SOFTWARE :

On this page I have started assembling a collection of bits & pieces (datasets, functions, libraries, programs, routines)
that some readers may find amusing &/or useful. The software is made available under the terms of the GNU public licence version 3, which can be found here.

This software collection should grow gradually
as time goes by. Most of it will be in Python (version 3, not version 2) though I may include a few functions in R if inspiration strikes, and maybe even some JavaScript.
If you don't have Python3, go to the Python website and follow the links to the lastest version for your hardware. The download/installation process is pretty straightforward. To obtain R go to the R-project site and do likewise.

The simple stuff should be self-explanatory. For the more complicated software, I intend to write user notes, aimed at users already familiar with the basics of Python (or R). If you're completely fresh to Python, you can find a plethora of tutorials on-line, of which a good example is the Python Tutorial on the official site. Michael Kart has converted Allen B. Downey's textbook
Think Python for Python 3. You can download a pdf copy
here free of charge! (Printed copies can be purchased from various suppliers,
ideally those that pay their taxes.)

R is somewhat trickier. The R Book, by Michael Crawley is excellent, but also expensive.
R for Dummies by de Vries & Meys is a pretty good starter. I personally possess 6 books on R and one on S (its progenitor) and still don't feel fully competent after more than 10 years using the system. Still, I wouldn't go back to SPSS for love or money. (Well, okay, I would -- under protest -- for lots of love and stacks of money). For newcomers to R, Robert Kabacoff's pages look like a good starting point.

A comprehensive revision of the BEAGLE rule-learning package is now available (August 2016). Moreover, it turns out that 1980s PC/BEAGLE never quite went away. It is still runnable, with a little help from DOSBox 0.74. This software deserves a page to itself: for the latest versions of BEAGLE and RUNSTER (Regression Using Naturalistic Selection To Evolve Rules) click here.

This Python3 program tries to identify natural groupings in data sets, i.e. it performs computational clustering. It uses an evolutionary algorithm to partition data sets into clusters of items that in some sense belong together. A notable point is that the user doesn't have to tell the system how many groups to find: the number of groups emerges as part of the optimization process.

The EASTIRS constrained sequential clustering software is included with CUES for convenience, as is a user guide, which can also be found separately, below.

[The latest version of EASTIRS has been included with the CUES software (above) so I will remove this separate section at some stage.]

This Python3 program tries to identify natural break-points in sequential data, thus partitioning a series into segments, or phases, with similar values.
In essence it performs constrained clustering, but does much the same job more commonly described by statisticians as change-point analysis. Finding natural subsections of temporally organized data is a task with many applications in economics, medicine, process control and other fields. Personally, I am motivated to apply it to stylochronometry, i.e. changes in an author's verbal habits over time which, if regular, could allow estimating the dates of undated works.
It turns out that an evolutionary approach is peculiarly well suited to this sort of problem, since candidate solutions are naturally represented as a string of zeroes and ones indicating where change points occur. Such bitstrings can easily be chopped up, recombined and/or mutated by the evolutionary optimizer.

Like all these programs EASTIRS is work in progress, so feedback from users would be helpful towards improving the system.

This suite of Python3 programs helps to explore the concept of formulaic language. By exploiting the notion of 'collocational coverage' these programs generate a 'formulexicon' that contains 'collocades' -- relatively repetitive token sequences that constitute cascades of collocations. With this information you can rank both individual documents and entire text types according to how pervaded they are by formulaic subsequences. In doing so the system identifies texts that are highly typical or highly anomalous with respect to different text categories, and can also be used in classification mode to assign documents to their most compatible category, along with a relatively robust indicator of the confidence to be placed on each such assignment.

The KEYSOFT programs are included with this zipped file for convenience, as is a user guide, which can also be found separately, below. (Feedback from users welcome, as usual.)

Of making "keyness" indices there is no end, yet many corpus linguists seem to believe that there is only a single way of quantifying the degree to which a word is a keyword with respect to a particular group of texts, namely what they usually call "log-likelihood" which statisticians prefer to refer to as "G-squared". The zipped file keysoft, below, contains a Python3 program (newquay3.py) that offers 12 different ways of calculating keyness, including G-squared, thus twelve different flavours of ranked keyword listing. It also permits more than 2 categories of texts to be compared (unlike most other software for this task). In addition, there is a follow-on program (termite3.py) which seeks what I call desmoglyphs -- sequences of tokens that are distinctive of particular categories of text. These aren't fixed-length n-grams: the program computes the appropriate length of each desmoglyph.

I'm sure this software can be improved, so feedback from any brave users out there would be welcome. Thanks.

This is version 9 of my text classification system, designed with authorship attribution in mind but also capable of other kinds of document categorization. You can run it with one of the built-in techniques supplied as library modules with the distribution, or, if you know Python, use it as a testbed for your own classification algorithms. Feedback from users would be welcome.

This software is designed to make it easy to go from a corpus of documents to a data grid in a format suitable to be imported into R, or similar systems, for statistical processing. Program dox2vox.py takes a collection of texts and produces a vocabulary listing in descending order of frequency. Five ways to operationalize the idea of frequency are offered, including the most obvious. A follow-on program, vox2dat.py, takes as inputs such a vocabulary listing (which doesn't have to have been produced by dox2vox.py) as well as a collection of text files and produces from them a tab-delimited rectangular file in which each row represents a text and each column is the percentage frequency of a particular vocabulary token in that text. This output data file also includes 10 columns that give values of various vocabulary-richness measures for each text.