Anybody could pick up a news story and identify journalism sources. Anybody could do the same job on 10, 100 or 1,000 news stories but it would be tedious. Best let JeRI do it.

It even turns out that a branch of computer science, known as entity extraction, is dedicated to identifying nouns in texts. So presto, we have a start for JeRI’s brain. So far, two entity extractors are working to get the job done. One of them is proprietary, and the other is open source.

Making judgments about the categorization of nouns is a much more complex problem, however. We are training JeRI to develop a dictionary of words (tokens, in natural language processing, terms) that defines our category of news. The tricky part is that to make finer judgments about how to categorize the sources, the dictionary will also need to be combined with a machine-learning technique called Classic Random Field classification, based on statistical probability.

We know how humans would categorize these sources – so we can compare JeRI’s success to that standard.

And if we can train JeRI to categorize sources at a rate that reaches our standard, we can then begin to think about how we want to weigh the placement of the sources in texts, and we can think about the weight of the quoted text that would be applied to JeRI’s index.

Any one article could be indexed against the aggregate score for the entire corpus – in our case, articles about police carding and racial profiling in Toronto media coverage.

In theory, this training method could be applied to any journalism themes or article types.