LINGUIST List 11.1411

Sun Jun 25 2000

Review: MonoConc Pro 2.0

Editor for this issue: Andrew Carnie <carnielinguistlist.org>

What follows is another discussion note contributed to our Book Discussion
Forum. We expect these discussions to be informal and interactive; and
the author of the book discussed is cordially invited to join in.
If you are interested in leading a book discussion, look for books
announced on LINGUIST as "available for discussion." (This means that
the publisher has sent us a review copy.) Then contact Andrew Carnie at
carnielinguistlist.org

MonoConc Pro Concordance Software, Version 2.0 (March 2000)
Athelstan, Houston TX infoathel.com http://www.athel.com/
For Windows 3.1, 95, 98. Educational, single-user price US$85.
Reviewed by John Lawler, University of Michigan
MonoConc (MC) is a Windows program that provides most of the
functions that corpus linguists require in concordancing. MC has
been around for a long time; it was developed by Michael Barlow -
barlowruf.rice.edu - http://www.ruf.rice.edu/~barlow/ - of Rice's
Linguistics Department, with the needs of corpus linguists in mind.
(See Hockey 1998 on the needs of corpus linguists, and Barlow's
Corpus Linguistics page on meeting them.) The result is a fast, cheap,
and reliable program, and its utility is not restricted to specialists
in corpora by any means. This latest version, full of new features, is
software any linguist would want to have, and most can afford.
Not to keep anybody in suspense, this is largely a rave review, with a
few rants along the way. First I will discuss some of the main features
of MC, then describe three rather different projects I used it for,
indicate where I found it indispensable, and where I found it less
useful and why, with suggestions for revision of the user interface.
At the end I append some Web links in the References. Definitions
of most of the technical terms used here can be found in the
online Glossary of Lawler and Aristar 1998; see References for URLs.
The first thing one notices about MC is that it fits on one floppy
disk, and it doesn't have an "Installation" module, something
unusual for Windows programs. One simply copies three files into
whatever directory one pleases, then runs the program (MonoPro.exe).
The simplicity continues; when it is run, MC presents the user with
a blank window containing only File and Info menus.
To do anything, one must load some files from disk, or from a URL
(this is new in Vers 2.0). MC doesn't keep a list of recent files
on the File menu like Word, but it does remember the last folder
accessed, which is convenient. Any ASCII text file can be loaded as
a corpus, and one can load more than one at a time.
But you'd better specify all the files you want the first time,
because if you try to add more later in the session, they get loaded
as a separate corpus. This means that files that can't be grabbed
together in the Windows file dialogue may need to be renamed, and
sorted in the window, which is at least a bother; Vers 2.0
alleviates this problem somewhat by allowing one to save and load a
"workspace", which is essentially a state dump including a specified
file list. Then you can simply grab the workspace file icon on the
desktop and drop it on the MC icon, thereby avoiding the blank grey
window.
The files get loaded and concatenated together as a corpus (though,
oddly, the "Corpus Text" windows is named after the last file),
and when that happens, the "Corpus Text", "Frequency", "Window", and
"Concordance" menus appear between "File" and "Info". One can
suppress tags (or only Part-of-Speech tags), or suppress the words
and leave just the tags (useful for examining structure), on the
"Corpus Text" menu, and the "Frequency" menu produces instant
wordlists. But it's the "Concordance" menu that most of us will be
using.
On this menu there are three options available at first: "Search",
"Advanced Search", and "Search Options". Each brings up a dialog box
from which one can reach the other, so it doesn't matter which is
chosen first. The purpose of all of these commands is to determine
how the concordance is to be generated -- and in this program it's
important to realize that concordances are not permanent objects.
Rather, they're essentially reports generated on the fly, searching
for any string in any context that can be specified by regular
expressions or by a number of special wild card terms peculiar to
MC. One can, of course, do a complete concordance of the corpus
simply by searching for '*', but that doesn't begin to exhaust the
possibilities of the search function. For instance, full Regular
Expression searches are supported, as well as more sophisticated context
searches. (See Lawler 1988 for more information on Regular Expressions.)
Once the search term is entered (and usually it has to be adjusted
until it gets exactly the desired results), the concordance is
generated (very rapidly; time is almost never a limiting variable)
and displayed in Key Word In Context (KWIC) format in a separate
window, split into two parts; the lower one contains the concordance
as such, while the smaller upper one displays the context of the
selected line in the concordance as it is selected. Thus, if the
context of a line in the KWIC isn't clear, just select it with the
mouse, and the paragraph it appears in shows up in the upper window.
The concordance window (or windows; one can concord on any number of
different search terms) appear in the original order found in the
corpus. To get better-organized lists, one uses the "Sort" menu,
which appears (along with a "Display" menu) when the concordance
window is active. Here one is allowed to sort the concordance on
the keywords (called the "search term") or the first, second, or
third word to the right or left of them, and there can be up to
three consecutive sorts, so that similar contexts will appear
grouped together. The sort facility is very flexible; one can use
any sorting order on any character set, including unique digraphs,
so that for instance "ch" and "ll" can be made to sort as alphabetic
letters in their customary Spanish order.
Any linguist can get the hang of this kind of rapid sorting and
re-sorting (and its resultant eyeballing for patterns) very fast,
and it's quite gratifying to be able to look through a corpus for
patterns so easily and productively. At any point, of course, one
can go to the "Frequency" menu to get numeric data, sorted either
alphabetically or numerically.
I discovered in the course of the review that one of the principal uses
of MC in various recensions is as a classroom tool, for example in
language instruction, since it is child's play (literally) to generate
pattern exercises from corpora -- one simply searches for the word in
question and suppresses the search item, producing attested cloze
samples which can be used for drill or other exercises. Indeed, there
is a special version of MC for use in classes, and a special course
license and rate. I found it useful for other tasks as well, especially
in collaboration with other software, to do things it wouldn't.
Two years ago, for instance, I prepared the index for a book (Lawler
and Aristar-Dry 1998) using an earlier version of MC (see Antworth
& Valentine 1998 and Stevens 1997 for reviews of that version), and
MS-Word's Index facility. I couldn't have done it very well using
either alone. The Word Index facility is quite reasonable, given a
list of words to index, but getting that list is not easy, to say
the least.
I used MC on the ASCII files of the various chapters to prepare a
complete concordance. After that, I simply went through the
concordance deleting words I didn't want to index, leaving only
those I did, including all the proper names. These were copied and
pasted into Word's Index facility, which then dutifully produced an
index with correct page numbers (once I had put in manual page
breaks to correspond with the physical page numbers on the galleys)
for all the important words in the book. I was then able to
edit this index further, grouping together related concepts,
differentiating homophonous word uses, and eliminating spurious
references. It was a very educational experience for me, and
produced a really thorough index in what I considered a reasonable
amount of time, without having to pore over each page of the galleys
physically. The index is available online (see References).
I recommend MC highly for any form of indexing, especially now that
it can build concordances with more than 16,000 hits at a time,
which was the limit at the time I did the index (I actually had to
do 26 concordances, for a*, b*, etc, to keep the size of each below
16,000). 16,000 is still, unaccountably, the default in MC Pro, and
one must continue to remember to reset it each time when concording
large corpora, though I have experienced no problems with numbers as
high as 100,000 in MC Pro 2.0.
Another, more recent, project I found MC helpful with was an
investigation (Lawler 2000) of the syntactic properties of the verb
'remain' in the peculiar construction:
(1) I remain to be convinced that your plan is feasible. (Ross 1977)
As it turns out, the 2,435,659 quotations in the OED2 include 7,167
that contain some version of the word 'remain' (both noun and verb).
While useless for statistical purposes, such a collection is very
likely to contain occurrences of every possible construction, idiom,
important collocation, subcategorization, and selectional
restriction for the search term, identified by source and date.
The Digital Library Production Services OED Web interface delivers
the results as a single Web page, tagged in HTML. That amounts to a
partially tagged corpus, a real bonanza of syntactic data. (It is
available online; see References.)
The massaging was done outside of MC, which is equipped to search
for strings, but not to change them. Making a raw HTML corpus
tractable frequently requires other tools in collaboration with a
concordancer; one of the many benefits of MC is that it is ASCII
through and through, and therefore can be used in combination with
such tools as the ones I frequently use, like the stream editor sed,
and the filter language awk (available free for both DOS and UNIX;
see Lawler 1988), along with programmable editors, such as emacs and
ex, on UNIX (emacs is a screen editor -- some say *the* screen
editor, while ex is a line editor), or TextPad and Qedit on Windows
(TextPad is a Windows editor, while Qedit -- now called Semware, Jr
- is DOS. Both are shareware; see References).
In this task, I used MC extensively on the massaged corpus; since
the study was restricted to infinitive complements governed by
the verb 'remain', I had first to eliminate occurrences of noun
'remain(s)', as well as 'remainder'. MC allowed me to eliminate
all instances of 'remainder' from the target corpus easily, but of
course it could not search for zero morphology in a corpus without
Part-of-Speech tags (see Ball's overview on tagging), so I had to
put those in myself, by hand, by going through the (by this time)
around 4,000 quotations. Once that was done, MC concorded the
examples tagged with verbal 'remain' followed (anywhere) by the word
'to'. This didn't guarantee it was an infinitive 'to', but this
could be checked easily in MC, and the non-infinitive cases, and
those not governed by 'remain' were eliminated.
At this point, I had about 500 sentences, dated and attributed, all
containing a construction formally resembling the one under
consideration, suitable for syntactic analysis. From this point on,
the task could be done with a wordprocessor; but getting to this
point would have been difficult or impossible without MC and the
other tools. I found that MC opened up many possibilities that I
would not have considered doing in the study, let alone been able to
do at all, and that it frequently made the difficult simple, and the
impossible merely tedious. This is a program I would recommend to any
syntactician or semanticist that wants to work with real data.
Finally, I have recently become curious about Emily Dickinson's use of
phonesthemes. After snagging a lot of her poetry online from various
places, I am once again faced with a large amount of text, variously
tagged in HTML, to deal with. And once again, MC is the tool of choice.
This time, though, I am somewhat more conscious of drawbacks in MC.
For one thing, while it's very nice to be able to generate concordances
on the fly, sometimes you want to work with the same concordance for a
long time, and MC makes it hard to do that, because concordances are not
not saved in the form in which they appear in MC's "Concordance" window,
but rather strictly in ASCII, with the search term [[ marked ]] with
double brackets. This is OK for an ASCII save, but if I want to work
with the same concordance tomorrow that I'm working with today, I have
to generate it all over again tomorrow from data if I want to use MC's
searching and sorting facilities -- an ASCII concordance does not load
in the concordance window, but as a new corpus (containing [[]]'s, which
can't be deleted in MC).
This is normally not a big deal, but when dealing with HTML text one
usually simply wants to avoid the HTML tags, and this is certainly true
of the Dickinson corpus, since the poems are heavily formatted. The
"Corpus Text" menu has an option to "Suppress Tags", but it only applies
to display of the text in that window; when one opens a Concordance
window via a Search command (in this case, for *), all the HTML tags
show up in the Concordance window, duly found by the search. This
window *also* has a setting that suppresses tags, but it can't be
invoked until *after* the search, when it produces a lot of lines that
don't have any visible search term.
The workaround is to sort on the search term (which works, whether it's
suppressed or not), then delete all the lines without search terms, about
a third of the corpus of 77,000 hits, which group together for selection
at the beginning of the corpus. Then I have to remove the hits on the
various Roman numerals that stud the pages, then I can get down to
business. Tomorrow I'll have to do the same thing all over again. And
hope I've remembered all the steps, so I wind up with the same
concordance.
Once that's done, though, a variety of sorting strategies suggest
themselves, and any number of putative patterns begin to tease the
perceptions and beg to be tested. This is the *really* useful part
of an interactive concordancer -- it gives one the opportunity to
become really intimately acquainted with one's data in ways that are
simply impossible with large data sets without such help, and it is
here that MC really shines.
There are plenty of other gripes one might make about details of the
user interface, but there are workarounds for almost all of them.
And I haven't even begun to list most of the special functions it
can also perform, should one desire them; there's a big load of
functionality to explore here. The version of MC I received arrived
without documentation, but there is now a 70-page comprehensive
manual in Word format (9.8 MB), with diagrams and screen shots, that
is quite clearly written and covers pretty much everything one needs
to know. Nevertheless, it's been possible to learn how to use MC for
years just by following one's nose, a sign of an intuitive
interface, which itself is a very good sign in any piece of
software.
The flood of text that is now washing over us on the 'net has made
us aware that we need to do more than just bail frantically; we need
industrial-strength tools if we are to stay afloat. I strongly
recommend that every linguist who works with data that is
represented (or representable) in ASCII be equipped with MonoConc
Pro, for research, development, and teaching, if possible with
Departmental site licenses; this is simply too useful a tool to
overlook.
- ---------
John Lawler is Associate Professor of Linguistics at the University
of Michigan, where he is principal advisor of its undergraduate
Linguistics program, the largest in the US. He is interested in
metaphor, computing, sound symbolism, and English grammar, among
other subjects. He is co-editor of Lawler and Aristar 1998.
(Full bio at http://www.umich.edu/~jlawler/bio.doc)
- ---------
References:
Antworth, E. and J. R. Valentine. 1998. 'Software for Doing Field
Linguistics'. Ch 6 of Lawler & Aristar-Dry. Appendices:
http://www.sil.org/computing/routledge/antworth-valentine/ ,
http://www.sil.org/computing/routledge/antworth-valentine/text.html
Ball, Cathy. Tagging overview
http://www.georgetown.edu/cball/ling361/tagging_overview.html
Barlow, Michael. Corpus Linguistics Page
http://www.ruf.rice.edu/~barlow/corpus.html
-------------- Details of Monoconc 1.5 and Monoconc Pro 2.0
http://www.athel.com/rade.html
Hockey, Susan. 'Textual Databases'. Ch 4 of Lawler & Aristar-Dry.
Appendix: http://www.ualberta.ca/~shockey/UCLPP/textual.htm
Lawler, J. 2000. 'Remainders', paper delivered at LANGUAGING 2000.
Text: http://www.umich.edu/~jlawler/remainders.doc
Handout: http://www.umich.edu/~jlawler/remaindershandout.doc
Data: http://www.umich.edu/~jlawler/oedqa-remain.html (original)
and http://www.umich.edu/~jlawler/oedqc-remain.html (massaged)
------- 1998. 'The Unix Language Family'. Ch 5 of Lawler & Aristar-Dry.
Chapter text: http://www.umich.edu/~jlawler/routledge/unix.doc
Appendix: http://www.umich.edu/~jlawler/routledge/unix.html
Lawler, J, and H. Aristar-Dry (eds), 1998. _Using Computers in
Linguistics: A Practical Guide_. Routledge.
Home: http://www.routledge.com/routledge/linguistics/using-comp.html
Intro: http://www.routledge.com/routledge/linguistics/introduction.html
Index: http://www.umich.edu/~jlawler/routledge/unix.doc
Glossary: http://www.umich.edu/~jlawler/routledge/glossary.html
Ross, J.R, 1977. 'Remnants'. Studies in Language I:1.127-135.
sed and awk (universal freeware text filter languages)
http://www.umich.edu/~jlawler/routledge/sedawkperl.html
Semware Jr (formerly Qedit; shareware DOS programmable editor)
http://www.semware.com/
Stevens, V. 1997. Review of Monoconc 1.2. in CALICO (Computer
Assisted Language Instruction Consortium) newsletter.
http://www.arts.monash.edu.au/others/calico/review/monoconc.htm
TextPad (shareware Windows programmable editor)
http://www.textpad.com/