When working with corpora it is sometimes useful to be able to generate random samples from corpus results for manual analysis (e.g. to determine distribution percentages or recall/precision of queries). BNCweb, CQPweb or (No)SketchEngine provide a thin function for this purpose. However, if the results of corpus queries are only available as text files, there is a random thinning option available as part of GNU coreutils. The examples below create a random sample of 100 lines (adapt sample size according to your project’s needs). The reliability of manually checked results can be improved by obtaining several samples of 100 lines (typically 2-3) and using averaged scores.

On Linux, there is a very easy straight-forward way to achieve this (type: man shuf for details):
cd path_to_text_file
shuf -n 100 results.txt

In order to save the random sample into a new text file, specify an output file:shuf -n 100 -o random_sample.txt results.txt

On Mac OSX, it is slightly more complicated, as a Linux-like package manager (e.g. Homebrew) and the coreutils package have to be installed first (gshuf Tutorial OSX and corresponding random_sample.zip for novice users who are not familiar with OSX terminal). Once the gshuf command is available, the invocation is anologous (type: man gshuf for details):
cd path_to_text_file
gshuf -n 100 results.txt

In order to save the random sample into a new text file, specify an output file:gshuf -n 100 -o random_sample.txt results.txt

On Windows, the following Python code snipped could be used to achieve a similar result (please let me know if there are any built-in options):

Description

CQPweb

CQPweb is a web-based graphical user interface (GUI) for some elements of the CWB – and in particular, the CQP query processor.

CQPweb is designed to replicate the user-interface of the popular BNCweb tool, which also (in its most recent versions) uses CQP as a back-end. Like BNCweb, CQPweb uses a database alongside the CWB to provide extra functions beyond those built into CWB/CQP. However, unlike BNCweb, CQPweb can be used with any corpus.

CQPweb is especially suitable for students, non-linguists, and others for whom a Unix-like command-line is a terrifying prospect. […]

[Click on 'Overview' and 'Select Category' ⇨ 'Corpus Development Tools' ⇨ 'Sentence Alignment' for a list of tools you could use to create your own parallel corpus.]

Install apache2 web-server environment with php5 and enable sqlite support for perl and php5

Extract files to a new AntWCF directory and create database(s)

Adapt antpwc_concordance.php to new corpus

Adapt index.php to new corpus

Create apache2 configuration file

Start webbrowser and go to http://localhost/AntWCF

Install apache2 web-server environment with php5 and enable sqlite support for perl and php5

Open a terminal window and type:

sudo apt-get install apache2 php5 php5-sqlite libdbd-sqlite3-perl

Troubleshooting:

If you would like to get rid of apache2‘s warning “Could not reliably determine the server’s fully qualified domain name”, click here for a quick fix (the server works just fine even if you don’t specify ServerName).

Open zip-file (available on request: [1]), extract all files into a new AntWCF directory and create database(s)

Change into the new AntWCF directory:

cd path/to/AntWCF

Troubleshooting:

If you get a permission error when trying to drag&drop files into the new AntWCF directory, you probably created the directory outside your home area, using the sudo command (e.g. in /opt). To be able to drag&drop files into the directory, you can take ownership by typing:

L1_FILE.txt is the source text file of your parallel corpus that you have just copied into the AntWCF/data/corpus directory, L2_FILE.txt is the target text file of your parallel corpus and OUTPUT_FILE.db is the name of the sqlite database to be created by the script (extension: .db).

Take a note of the name of the new database file and the number of tokens for L1/L2 displayed by the script (or make sure that you do not close the terminal window, as you will need these pieces of information later on).

Delete the temporary files created by the script:

rm temp_*

Repeat this procedure for other parallel corpora you might wish to include in AntWCF

Troubleshooting:

If the script throws an encoding error, make sure that your text files only contain legal utf8 characters and eliminate all non-utf8 characters before running the database script again. Important: If you choose the same output file name, you have to delete the old file before re-running the script.

Adapt antpwc_concordance.php to new corpus

Open the file www/antpwc_concordance_20120925_2342.php in a text editor of your choice (e.g.):

leafpad www/antpwc_concordance_20120925_2342.php

[Select 'Options' ⇨ 'Line Numbers' for easier navigation]

Adapt the following lines for your own parallel corpus:

Note: AntWCF is preconfigured to be able to switch between two different parallel corpara. If you created just one database file, comment out lines 32-34 and lines 40-42 (using ‘//‘ at the beginning of each line) or fill in the same data twice. If, however, you created more than two databases, you could include those corpora by copying lines 32-34, inserting them once before line 35 and a second time before line 43. Subsequently, you could adapt the database information for an additional corpus (and so on …). For number of tokens per language, refer back to the information generated by the db_creator.pl script above.

Adapt index.php to new corpus

Open the file www/index.php in a text editor of your choice.

Adapt the following lines for your own parallel corpus:

Set default database:

Adapt database information:

Note: AntWCF is preconfigured to be able to switch between two different parallel corpara. If you just created one database file, comment out line 110 or just fill in the same data twice. If you created more than two database files, copy line 110, insert it before line 111 and adapt the database file and corpus names:

Adapt the language pair for your own parallel corpus (line numbers will have changed slightly if you inserted additional databases above).

Create apache2 configuration file

Create an new file called AntWCF.conf in the directory /etc/apache2/conf.d/ and open it in a text editor of your choice (e.g.):

sudo leafpad /etc/apache2/conf.d/AntWCF.conf

Copy-paste the following lines into the new file, adapt the absolute path to your AntWCF/www directory (twice for Alias and for Directory) and save the file (you will need superuser privileges (sudo) to be able to save the file in this location):

Download icon and create desktop shortcut (copy-paste [Desktop Entry] into a new file called InterText.desktop on your Desktop, don’t forget to adapt the full path to InterText binary and InterText_logo.gif):