These are the web pages of the project Issues in the Phonology of Word in Czech (13-15361P) supported by the Grant Agency of the Czech Republic (2013-2015). The project's goal was to account for various aspects of the phonology of words in modern Czech. In particular, it concentrated on the phonotactic aspect of Czech words (phoneme occurrence and phoneme frequency, phoneme combinations and the syllabic structure of words). The project resulted in two phonological corpora of Czech: the Lexical Corpus and the Textual Corpus. Other results are a series of publications.

A phonologically transcribed and annotated database of the Czech vocabulary (ie. of lemmas / dictionary entries) stored in a csv file (a comma-separated format file edittable e.g. by MS Excel). The transcription reflects the phonematic constituency of words and their syllabic structure ("syllabification of words"). The Corpus also includes an allophonic transcription showing an idealized pronunciation of a given lexical item. The main corpus is supplemented with several smaller lexical corpora with proprial vocabulary.

Quantitative analysis of the Lexical Corpus

The whole Lexical Corpus was quantitatively evaluated for frequencies of various phonological units, in particular the phonological word. See the Description of the Corpora for the explanation of this notion. The evaluation was achieved with the Evaluation program.

Proprial lexical sub-corpora

The main lexical corpus is supplemented with several subcorpora:

Names of municipalities and their parts (zip/csv, last updated: 29/06/2016)
· 15,051 names of the Czech municipalities and their parts existing by the end of 2013; it was analyzed and described in the paper Kvantitativní fonotaktická analýza názvů českých obcí a jejich částí (2015) (see Publications). Some minor corrections has been made in the Corpus since the publication of the paper.

SSČ

This sample (49,506 items) was analyzed and described in the papers Corpus-based analysis of the Czech syllable (2014), Kvantitativní analýza slabiky v českém lexikonu (2015), Kvantitativní fonotaktická analýza názvů českých obcí a jejich částí (2015) (see Publications). Some minor corrections has been made in the Corpus since the publication of the papers.

The Textual Corpus consists of a selection of phonologically transcribed Czech texts stored in xml files. The texts are mostly Czech novels in public domain (see here for the list of the currently included texts). Like in the case of the Lexical Corpus, the transcription reflects the phonematic constituency of words (and sentences) and their syllabic structure. The Corpus also includes an allophonic transcription showing the idealized pronunciation of the sentences. In addition, the transcription takes into account the neutral prosodic organization of words within sentences. It was automatically assigned on the basis of the rules proposed by Zdena Palková for the automatic TSS synthesis of Czech. See Description of the Corpora for more details.