StrepHit

StrepHit is an intelligent reading agent that understands text and translates it into Wikidata statements.

More specifically, it is a Natural Language Processing pipeline that extracts facts from text and produces Wikidata statements with references. Its final objective is to enhance the data quality of Wikidata by suggesting references to validate statements.

column vector with the sample label (i.e. the correct answer for the classifier)

Return type:

tuple

process_sentence(sentence, fes, add_unknown, gazetteer)

Extracts and accumulates features for the given sentence

Parameters:

sentence (unicode) -- Text of the sentence

fes (dict) -- Dictionary with FEs and corresponding chunks

add_unknown (bol) -- Whether unknown tokens should be added to the index of treaded as a special, unknown token. Set to True when building the training set and to False when building the features used to classify new sentences

gazetteer (dict) -- Additional features to add when a given chunk is found in the sentence. Keys should be chunks and values should be list of features

find matches in text strings using regular expressions and transforms them

according to a pattern transformation expression evaluated on the match

the specifications are given in yaml format and allow to define meta functions

and meta variables as well as the pattern and transformation rules themselves.

meta variables will be placed inside patterns which use them in order to

make writing patterns easier. meta variables will be available to use from

inside the meta functions too as a dictionary named meta_vars

a pattern transformation expression is an expression which will be evaluated

if the corresponding regular expression matches. the pattern transformation

will have access to all the meta functions and meta variables defined and

to a variable named 'match' containing the regex match found

normalize_many(expression)

Find all the matching entities in the given expression expression

Parameters:

expression (str) -- The expression in which to look for

Returns:

Generator of tuples (start, end), category, result

Sample usage:

>>>frompprintimportpprint>>>fromstrephit.commons.date_normalizerimportDateNormalizer>>>pprint(list(DateNormalizer('en').normalize_many('I was born on April 18th, '...'and today is April 18th, 2016!')))[((14,24),'Time',{'day':18,'month':4}),((39,55),'Time',{'day':18,'month':4,'year':2016})]

normalize_one(expression, conflict='longest')

Find the matching part in the given expression

Parameters:

expression (str) -- The expression in which to search the match

conflict (str) -- Whether to return the first match found or scan through all the provided regular expressions and return the longest or shortest part of the string matched by a regular expression. Note that the match will always be the first one found in the string, this parameter tells how to resolve conflicts when there is more than one regular expression that returns a match. When more matches have the same length the first one found counts Allowed values are *first*, *longest* and *shortest*

Returns:

Tuple with (start, end), category, result

Return type:

tuple

Sample usage:

>>>fromstrephit.commons.date_normalizerimportDateNormalizer>>>DateNormalizer('en').normalize_one('Today is the 1st of June, 2016')((13,30),'Time',{'month':6,'day':1,'year':2016})

Applies the given function to each element of the iterable in parallel.

None* values are not allowed in the iterable nor as return values, they will

simply be discarded. Can be "safely" stopped with a keboard interrupt.

Parameters:

function -- the function used to transform the elements of the iterable

processes -- how many items to process in parallel. Use zero or a negative number to use all the available processors. No additional processes will be used if the value is 1.

flatten -- If the mapping function return an iterable flatten the resulting iterables into a single one.

raise_exc -- Only when *processes* equals 1, controls whether to propagate the exception raised by the mapping function to the called or simply to log them and carry on the computation. When *processes* is different than 1 this parameter is not used.

batch_size -- If larger than 0, the input iterable will be grouped in groups of this size and the resulting list passed to as argument to the worker.

POS-Tags many text documents of the given items. Use this for massive text tagging

Parameters:

items -- Iterable of items to tag. Generator preferred

document_key -- Where to find the text to tag inside each item. Text must be unicode

pos_tag_key -- Where to put pos tagged text

Sample usage:

>>>fromstrephit.commons.pos_tagimportTTPosTagger>>>frompprintimportpprint>>>pprint(list(TTPosTagger('en').tag_many(...[{'text':u'Item one is in first position'},{'text':u'In the second position is item two'}],...'text','tagged'...)))[{'tagged':[Tag(word=u'Item',pos=u'NN',lemma=u'item'),Tag(word=u'one',pos=u'CD',lemma=u'one'),Tag(word=u'is',pos=u'VBZ',lemma=u'be'),Tag(word=u'in',pos=u'IN',lemma=u'in'),Tag(word=u'first',pos=u'JJ',lemma=u'first'),Tag(word=u'position',pos=u'NN',lemma=u'position')],'text':u'Item one is in first position'},{'tagged':[Tag(word=u'In',pos=u'IN',lemma=u'in'),Tag(word=u'the',pos=u'DT',lemma=u'the'),Tag(word=u'second',pos=u'JJ',lemma=u'second'),Tag(word=u'position',pos=u'NN',lemma=u'position'),Tag(word=u'is',pos=u'VBZ',lemma=u'be'),Tag(word=u'item',pos=u'RB',lemma=u'item'),Tag(word=u'two',pos=u'CD',lemma=u'two')],'text':u'In the second position is item two'}]

tag_one(text, skip_unknown=True, **kwargs)

POS-Tags the given text, optionally skipping unknown lemmas

Parameters:

text (unicode) -- Text to be tagged

skip_unknown (bool) -- Automatically emove unrecognized tags from the result

Sample usage:

>>>fromstrephit.commons.pos_tagimportTTPosTagger>>>frompprintimportpprint>>>pprint(TTPosTagger('en').tag_one(u'sample sentence to be tagged fycgvkuhbj'))[Tag(word=u'sample',pos=u'NN',lemma=u'sample'),Tag(word=u'sentence',pos=u'NN',lemma=u'sentence'),Tag(word=u'to',pos=u'TO',lemma=u'to'),Tag(word=u'be',pos=u'VB',lemma=u'be'),Tag(word=u'tagged',pos=u'VVN',lemma=u'tag')]

tokenize(text)

Splits a text into tokens

strephit.commons.pos_tag.get_pos_tagger(language, **kwargs)

Returns an initialized instance of the preferred POS tagger for the given language

>>>fromstrephit.commons.split_sentencesimportPunktSentenceSplitter>>>list(PunktSentenceSplitter('en').split(..."This is the first sentence. Mr. period doesn't always delimit sentences"...))['This is the first sentence.',"Mr. period doesn't always delimit sentences"]

split_tokens(tokens)

Splits the given text into sentences.

Parameters:

tokens (list) -- the tokens of the text

Returns:

the sentences i the text

Return type:

generator

Sample usage:

>>>fromstrephit.commons.split_sentencesimportPunktSentenceSplitter>>>list(PunktSentenceSplitter('en').split_tokens(..."This is the first sentence. Mr. period doesn't always delimit sentences".split()...))[['This','is','the','first','sentence.'],['Mr.','period',"doesn't",'always','delimit','sentences']]

Extracts some sentences from the corpus following the given probabilities

Parameters:

sentences (iterable) -- Extracted sentences

probabilities (dict) -- Conditional probabilities of extracting a sentence containing a specific LU given the source of the sentence. It is therefore a mapping source -> probabilities, where probabilities is itself a mapping LU -> probability

processes (int) -- how many processes to use for parallel execution

input_encoded (bool) -- whether the corpus is an iterable of dictionaries or an iterable of JSON-encoded documents. JSON-encoded documents are preferable over large size dictionaries for performance reasons

output_encoded (bool) -- whether to return a generator of dictionaries or a generator of JSON-encoded documents. Prefer encoded output for performance reasons

input_encoded (bool) -- whether the corpus is an iterable of dictionaries or an iterable of JSON-input_encoded documents. JSON-input_encoded documents are preferable over large size dictionaries for performance reasons

Specify the selectors to suit the website to scrape. The spider first uses

a list of selectors to reach a page containing the list of items to scrape.

Another selector is used to extract urls pointing to detail pages, containing

the details of the items to scrape. Finally a third selector is used to

extract the url pointing to the next "list" page.

*list_page_selectors* is a list of selectors used to reach the page containing the items to scrape. Each selector is applied to the page(s) fetched by extracting the url from the previous page using the preceding selector.

*detail_page_selectors* extract the urls pointing to the detail pages. Can be a single selector or a list.

*next_page_selectors* extracts the url pointing to the next page

Selector starting with *css:* are css selectors, those starting with *xpath:*

are xpath selectors, all others should follow the syntax *method:selector*,

where *method* is the name of a method of the spider and *selector* is another

selector specified in the same way as above). The method is used to transform

the result obtained by extracting the item pointed by the selecctor and should

accept the response as first parameter and the result of extracting the data

pointed by the selector (only if specified).

The spider provides a simple method to parse items. Item class is specified in

item_class* (must inherit from *scrapy.Item*) and item fields are specified

in the dict *item_fields*, whose keys are field names and values are selectors

following the syntax described above. They can also be lists or dicts arbitrarily