Each corpus can be divided into smaller parts called subcorpora. Sketch Engine enables users to create subcorpora in their own namespace, each user has their own subcorpora and cannot access subcorpora of other users.

To share common subcorpora, it is possible to create a list of subcorpora which are accessible by all users (so-called “global subcorpora”). The list of global subcorpora is defined in a subcorpus definition file.

Minimal example of a subcorpus definition file — include files beginning with K:

=My_Subcorpus
file
filename="K.*"

An example is here with instructions on the format provided at the start of the file:

###############################################################################
# Subcorpus definition file
###############################################################################
#
# Subcorpora can be created by users in their own name space,
# each user have own subcorpora and cannot access subcorpora of
# other users.
# To share common subcorpora, it is possible to create a list of
# subcorpora which are accessible by all users.
# This file defines subcorpora names and respective subqueries.
#
#
# Subcorpus definition format
# ----------------------------
# *FREQLISTATTRS attr1 attr2
#
# =subcorpus_id
# structure
# sub-query
#
# =subcorpus_id
# -CQL-
# full-cql-query
#
# FREQLISTATTRS specifies a list of attributes for which frequecy
# lists should be precomputed.
#
# Sub-query is a part of a corpus query which can be used in
# "within " clause. It can consist of and/or combination
# of attribute-value pairs.
#
# Full-cql-query is any CQL query whose result (KWIC) is taken as subcorpus
# definition.
#
# All strings starting with # are comments and are ignored to the end of line.
#
###############################################################################
*FREQLISTATTRS word lemma lempos
=spoken
bncdoc
alltyp="Spoken context-governed" | alltyp="Spoken demographic"
=book60
bncdoc
alltim="1960-1974" & wrimed="Book"
=first1000
-CQL-
[#0-1000]
=same_as_book60
-CQL-

To compile the shared (global) subcorpora it is possible to use either the CA interface or a mksubc.py script.

1) via Corpus Architect interface

Once, you have created your subcorpus definition file, it is necessary to:

– upload the definition

go to the home page (corpora overview)

start by pressing Subcorpus definitions in the left-hand side menu

click on Add new subcorpus definition file at the bottom right

find and upload the definition file on your computer

fill in the name it should be referred to within Sketch Engine and click OK

Note that your uploaded definition files can be shared with other users. This allows the other users to compile subcorpora using your definition file or to view the file itself. This is *not* necessary for sharing the actual subcorpora you have compiled for a given corpus with other users.

– recompile the corpus

if you have uploaded a subcorpus definition file to the server or someone has shared their definition with you, open the corpus by clicking on its name (it works only on user corpora – not the preloaded ones)

select Set subcorpus definitions in the left-hand side menu (if the label is greyed, make sure the corpus is already compiled)

choose a definition file you want to use

tick the Recompile subcorpora checkbox and click OK

if the compilation finishes without any errors then all users that have access to the corpus will also see the newly created subcorpora

2) using mksubc.py script

Usage: mksubc.py CORPNAME SUBCORP_DIR SUBCORP_DEF_FILE

SUBCORP_DIR is a directory where the subcorpora will be created, this depends on the Sketch Engine installation. The global subcorpora (accessible by all users) should be stored in the directory set in the SUBCBASE attribute of the corpus config file, which is by default PATH/subcorp/.