Some researchers suggest[1] that in the analysis of corpora, even less sophisticated algorithms give better results using large, web-scale corpora. Corpus based language models brings empirical evidence into linguistic inquires and statistical methods have become the state-of-the-art techniques in natural language processing and linguistics[2].

On the other hand, we have to face methodological question when we are using web corpora. In most cases, the industry unconsciously relies on Leech notion of representativeness[3] and aims to use a corpus that is big enough to make generalization to the whole language. However, usage determines sampling and we cannot generalize outside the domain of our data. One of the most striking example of this vicious circle is named entity recognition, which is a notoriously domain specific task.

Although we aim full automatic solutions, we are very far from such applications. The human factor in processing corpora is still important and we need more elaborated methods. One promising direction can be crowdsourcing, that reduces the time and costs of annotation[4], but the costs of expertise in data curation cannot be saved. Titles like [5] show that the industry is interested in standard practices and needs guidance to overcome ad hoc, domain specific solutions.