You cannot really check for every word because there are certain words which have more than 1 alphabets in their spelling. So one way you could go is - check for each alphabet in the word and restrict its number of consecutive appearances to two now check the new spelling...

Firstly, it's better to leave the import at the top of your code instead of within your class: from sklearn.feature_extraction.text import TfidfVectorizer class changeToMatrix(object): def __init__(self,ngram_range=(1,1),tokenizer=StemTokenizer()): ... Next StemTokenizer don't seem to be a canonical class. Possibly you've got it from http://sahandsaba.com/visualizing-philosophers-and-scientists-by-the-words-they-used-with-d3js-and-python.html or maybe somewhere else so we'll assume it...

The difference between Latent Semantic Analysis and so-called Explicit Semantic Analysis lies in the corpus that is used and in the dimensions of the vectors that model word meaning. Latent Semantic Analysis starts from document-based word vectors, which capture the association between each word and the documents in which it...

Your tests sound very reasonable — they are the usual evaluation tasks that are used in research papers to test the quality of word embeddings. In addition, the website www.wordvectors.org can give you a good idea of how your vectors measure up. It allows you to upload your embeddings, generates...

You'll probably need to detect lines containing a question, then extract the question and drop the question number. The regexp for detecting a question label is qnum_pattern = r"^\s*\(Q\d+\)\.\s+" You can use it to pull out the questions like this: questions = [ re.sub(qnum_pattern, "", line) for line in text...

There is no PR in GATE that that will pair arguments and create instances for you. You must therefore create instances that are relevant to your problem. You can: write a custom PR or write some JAPE with Java RHS You can probably split your corpus on a training and...

There is no difference between them in the current version. Historically, a CASConsumer would tipically not modify the CAS, but only use the data existing in the CAS (previously added by an Analysis Engine) to aggregate it/prepare it for use in other systems, e.g., ingestion in databases. In the current...

Freebase was retired and Wikidata is the recommended alternative. You can use the Wikidata Query API to get entities with a specific property. For instance, the query http://wdq.wmflabs.org/api?q=CLAIM[26] retrieves the IDs of all items having the property spouse (P26). You can combine this with the Wikidata API, for instance to...

Here's a link to the download page for the Brown Corpus: http://www.nltk.org/nltk_data/ All the files are zip files. The data format is described on the Brown Corpus Wikipedia. I dunno what else to say. From there things should be obvious. EDIT: if you want original source data, I think there's...

To solve this specifically for linear SVM, we first have to understand the formulation of the SVM in sklearn and the differences that it has to MultinomialNB. The reason why the most_informative_feature_for_class works for MultinomialNB is because the output of the coef_ is essentially the log probability of features given...

You could try something like this: def nounify(verb_word): set_of_related_nouns = set() for lemma in wn.lemmas(wn.morphy(verb_word, wn.VERB), pos="v"): for related_form in lemma.derivationally_related_forms(): for synset in wn.synsets(related_form.name(), pos=wn.NOUN): if wn.synset('person.n.01') in synset.closure(lambda s:s.hypernyms()): set_of_related_nouns.add(synset) return set_of_related_nouns This method looks up all derivationally related nouns to a verb, and checks if they have...

I wrote this for python. I believe you can read it as a pseudocode. I will edit the post for Java later . I added the Java implementation later. import re # grammar repository grammar_repo = [] s = "( S ( NP-SBJ ( PRP I ) ) ( [email protected]

Yes, the standard PCFG parser (the one that is run by default without any other options specified) will choke on this sort of long nonsense data. You might have better luck using the shift-reduce constituency parser, which is substantially faster than the PCFG and nearly as accurate.

It does split on these characters, however only when they appear as their own token and not at the end of an abbreviation such as in "etc.". So the issue here is not the sentence splitter but the tokenizer which thinks that "N." is an abbreviation and therefore does not...

There are a couple of datasets like this: Yahoo Weboscope:- http://webscope.sandbox.yahoo.com/catalog.php?datatype=l Yandex Datasets:- https://www.kaggle.com/c/yandex-personalized-web-search-challenge/data A part of Kaggle problem. You can sign up and download. There are also AOL Query Logs and MSN Query Logs which had been publicised as part of shared tasks in past 10 years. I'm not...

I suspect what you really mean by stem is "tense". As in you want the different tense of each word to each count towards the "base form" of the verb. check out the pattern package pip install pattern Then use the en.lemma function to return a verb's base form. import...

Yes, this is possible, but a bit tricky and there is no out of the box feature that can do this, so you will have to write some code. The basic idea is to replace the tokenize, ssplit and pos annotators (and in case you also have trees the parse...

The information you're trying to get isn't actually there. If you take two strings, both of which may have any number of spaces, and join them together with a space, it's no longer possible to tell unambiguously which space was joining the two strings, and which spaces were part of...

I don't know about tokenization in mixed language texts, so I propose to use the following hack: go through the text, until you find English word; all text before this word can be tokenized by Chinese tokenizer; English word can be append as another token; repeat. Below is code sample....

The task of language identification is well researched and there are a lot of good libraries. For Java, try TIKA, or Language Detection Library for Java (they report "99% over precision for 53 languages"), or TextCat, or LingPipe - I'd suggest to start from the 1st, it seems to have...

This should work for you s = "Hello, how is it going ? I am tired actually, did not sleep enough... That is bad for work, definitely" s = parse(s) #Create a list of all the tags you don't want dont_want = ["UH", "PRP"] sentence = parse(s).split(" ") #Go through...

This isn't really a programming question, but anyway: If your goal is prediction, as opposed to text classification, usual methods are backoff models (Katz Backoff) and interpolation/smoothing, e.g. Kneser-Ney smoothing. More complicated models like Random Forests are AFAIK not absolutely necessary and may pose problems if you need to make...

You have at least 2 options: combine 2 kinds of features with FeatureUnion: one for ngram_range of (1,1) with stop words and one for ngram_range of (2,3) without stop words (more efficient, but harder to implement and use) implement your own analyzer that will check for presence in stop word...

The most common data structures in language models are tries and hash tables. You can take a look at Kenneth Heafield's paper on his own language model toolkit KenLM for more detailed information about the data structures used by his own software and related packages.

As I understand the question, you are asking for the differences between the feature sets of Apache UIMA and Apache OpenNLP. Their feature sets barely have anything in common as these two projects have very different aims. Apache UIMA is an open source implementation of the UIMA specification. The latter...

We use the tag set of the (Penn/LDC/Brandeis/UC Boulder) Chinese Treebank. See here for details on the tag set: http://www.cis.upenn.edu/~chinese/ This was documented in the parser FAQ, but I'll add it to the tagger FAQ....

That would work as a first approximation. The problem with fixed word lists for language detection, though, is that real texts (and especially short ones) may not provide enough hits in your list. A more reliable approach would collect parts of other language features (like statistics of letter n-grams that...

Your small data set script is largely correct, but with some minor errors. You are missing the if i=='date': continue line. (The source of your 'not comparable' error). In your post, your else line is mis-indented. Possibly (only possibly) you need a call to plt.hold(True) to prevent the creation of...

I usually use the subtrees function in combination with a filter for this. Changing your tree slightly to show that it only selects one of the NP's now: >>> tree = ParentedTree.fromstring("(S (NP (NNP)) (VP (VBZ) (NP (NNS))))") >>> for st in tree.subtrees(filter = lambda x: x.label() == "NP" and...

What you are trying to do is essentially Natural Language Understanding, a subfield of Natural Language Processing, which again is a subfield of Computational Linguistics ~ often thought as the engineering arm. You could do semantic parsing or relation extraction. Either are fine for this task. I decided to read...

As Ed Cottrell commented, you need to consider what happens if you encounter a word that is not in the documents in a category. You can avoid multiplying by 0 by using Laplace smoothing. If you see a word in k out of n documents in a category, you assign...

TF in TF-IDF means frequency of a term in a document. In other words, TF-IDF is a measure for both the term and the document. Here is a good illustration of what I mean. As far as I understand your case, you don't work with any particular document, instead you...

You can build a good machine learning model for sentiment analysis using Amazon ML. Here is a link to a github project that is doing just that: https://github.com/awslabs/machine-learning-samples/tree/master/social-media Since the Amazon ML supports supervised learning as well as text as input attribute, you need to get a sample of data...

Found the answer, so sharing the wisdom on SO :-) ... The deserialization can be done using the TSVReader in the edu.emory.clir.clearnlp.reader package. public void readCoNLL(String inputFile) throws Exception { TSVReader reader = new TSVReader(0, 1, 2, 4, 5, 6, 7); reader.open(new FileInputStream(inputFile)); DEPTree tree; while ((tree = reader.next()) !=...

In UIMA the primary way in which one assures that the annotations you need are present for your annotator is to aggregate annotators together. So to answer your question, that is how you are going to achieve what you want because what you want to do (have UIMA figure our...

I have had some breakthrough to understand if the word is actually preposition or subordinating conjunction. I have parsed following sentence : She left early because Mike arrived with his new girlfriend. (here because is subordinating conjunction ) After POS tagging She_PRP left_VBD early_RB because_IN Mike_NNP arrived_VBD with_IN his_PRP$ new_JJ...

LanguageTool can do that (disclaimer: I'm the maintainer of LanguageTool), it's available under LGPL and implemented in Java. You could use GermanTagger.tag(), the result can have more than one reading (as language is often ambiguous), and each reading's AnalyzedToken finally has a lemma.

Like many components in AI, the Stanford coreference system is only correct to a certain accuracy. In the case of coreference this accuracy is actually relatively low (~60 on standard benchmarks in a 0-100 range). To illustrate the difficulty of the problem, consider the following apparently similar sentence with a...

This FAQ answer explains the difference in a long paragraph. Relevant parts are quoted below: Can you explain the different parsers? This answer is specific to English. It mostly applies to other languages although some components are missing in some languages. The file englishPCFG.ser.gz comprises just an unlexicalized PCFG grammar....

The challenge is you need to make sure that the token isn't part of its representative mention. For example, the token "Judy" has "Judy 's" as its representative mention, so if you replace it in the phrase "Judy 's", you'll end up with the double "'s". You can check if...

Hi I'll try to help out! So it sounds to me like you have a list of strings that should be called "CURRENCY", and you have a list of strings that should be called "COUNTRY", etc... And you want something to tag strings based off of your list. So when...

In order to use the OneHotEncoder, you can split your documents into tokens and then map every token to an id (that is always the same for the same string). Then apply the OneHotEncoder to that list. The result is by default a sparse matrix. Example code for two simple...

For each suffix in the given list you can check if the given word ends with any of the given suffixes, if yes the remove the suffix, else return the word. suffixes = ['ing'] def stem(word): for suff in suffixes: if word.endswith(suff): return word[:-len(suff)] return word print(stem ('having')) >>> hav...

You should look into language models. A bigram language model, for example, will give you the probability of observing a sentence on the basis of the two-word sequences in that sentence. On the basis of a corpus of texts, it will have learned that "how are" has a higher probability...

A ± 2 window means 2 words to the left and 2 words to the right of the target word. For target word "silence", the window would be ["gavel", "to", "the", "court"], and for "hammer", it would be ["when", "the", "struck", "it"].

You can define the dependency with @TypeCapability like this: @TypeCapability(inputs = { "com.myproject.types.MyType", ... }, outputs = { ... }) public class MyAnnotator extends JCasAnnotator_ImplBase { .... } Note that it defines a contract at the annotation level, not the engine level (meaning that any Engine could create com.myproject.types.MyType). I...

As outlined in the comments, "orientation" is not a well-defined concept in this context. A traditional word vector space has one dimension for each term. In order for word vectors to be compatible, they will need to have the same term order. This is typically not the case between different...

As I see in your code samples, you don't call tree() in this line >>> print(next(next(mp.parse_sents([sent,sent2])))) while you do call tree() in all cases with parse_one(). Otherwise I don't see the reason why it could happen: parse_one() method of ParserI isn't overridden in MaltParser and everything it does is simply...

According to http://nlp.stanford.edu/software/lex-parser.shtml, Stanford NLP does have a parser which can identify the subject and predicate of a sentence. You can try it out online http://nlp.stanford.edu:8080/parser/index.jsp. You can use the typed dependencies to identify the subject, predicate, and object. From the example page, the sentence My dog also likes eating...

Lemmatization crucially depends on the part of speech of the token. Only tokens with the same part of speech are mapped to the same lemma. In the sentence "This is confusing", confusing is analyzed as an adjective, and therefore it is lemmatized to confusing. In the sentence "I was confusing...

There is a Annotation constructor with a List<CoreMap> sentences argument which sets up the document if you have a list of already tokenized sentences. For each sentence you want to create a CoreMap object as following. (Note that I also added a sentence and token index to each sentence and...

The link that you have already mentioned has two different tagsets. For tagset documentation, see nltk.help.upenn_tagset() and nltk.help.brown_tagset(). In this particular example, these tags are from Penn Treebank tagset. You can also read about these tags by: nltk.help.upenn_tagset('DT') nltk.help.upenn_tagset('EX') ...

I'm sure you've considered assigning each new word you encounter an integer. You'll have to keep track somewhere, but that's one option. You could also use whatever built-in hash method js has. If you don't mind a few hash collisions, and the size of the resulting integers doesn't matter, may...

Is there a specific reason why you are using version 3.4.1 and not the latest version? If I run your code with the latest version, it works for me (after I change the path to the SR model to edu/stanford/nlp/models/srparser/englishSR.ser.gz but I assume you changed that path on purpose). Also...

Based on what you want to do, this should work. It will give you the closest left NP node first, then the second closest, etc. So, if you had a tree of (S (NP1) (VP (NP2) (VBZ))), your np_trees list would have [ParentedTree(NP2), ParentedTree(NP1)]. from nltk.tree import * np_trees =...

Classification report must be straightforward - a report of P/R/F-Measure for each element in your test data. In Multiclass problems, it is not a good idea to read Precision/Recall and F-Measure over the whole data any imbalance would make you feel you've reached better results. That's where such reports help....

The second parameter of NERTagger is the path to the stanford tagger jar file, not the path to the model. So, change it to stanford-ner.jar (and place it there, of course). Also it seems that you should choose english.conll.4class.caseless.distsim.crf.ser.gz (from stanford-corenlp-caseless-2015-04-20-models.jar) instead of english.conll.4class.distsim.crf.ser.gz Thus try the following: english_nertagger =...

See the following example: "find me an android phone with 4 gb ram and at least 16 gb storage." First of all you need a list of words that you can directly extract from the input and insert in your search query.This is the easy part. "find me an android...

You can get the name and the pos tag of each synset like this: from nltk.corpus import wordnet as wn synonyms = wn.synsets('good','a') for synset in synonyms: print(synset.name()) print(synset.pos()) The name is the combination of word, pos and sense, such as 'full.s.06'. If you just want the word, you can...

Here is my solution. The idea is that phonetic alphabets can have an unicode representation and then: Use Unicode package; it provide the function Unicode_alphabetic_tokenizer that: Tokenization first replaces the elements of x by their Unicode character sequences. Then, the non- alphabetic characters (i.e., the ones which do not have...

Here's a method using the command line and perl: Save the text below as removeSW.sh: #! /bin/bash MYREGEX=\\b\(`perl -pe 's/\n/|/g' $1`\)\\b perl -pe "s/$MYREGEX//g" $2 Then if you have saved your file above as stopwords.txt, and have a second file (e.g.) called testtext.txt that contains: This is a file with...

Without using Natural Language Toolkit(NLTK) you may use simple Python command as follows. >>> line="a sentence with a few words" >>> line.split() ['a', 'sentence', 'with', 'a', 'few', 'words'] >>> given in Split string into a list in Python ...

The main idea is that the file contains settings that will be used by the translation model. Thus, the documentation of values and options in moses.ini should be looked up in the Moses feature specifications. Here are some excerpt I found on the Web about moses.ini. In the Moses Core,...

First of all, make sure that you have lines in your file then with no worries you can read it line-by-line (discussed here): with open('my100GBfile.txt') as corpus: for line in corpus: sequence = preprocess(line) extract_n_grams(sequence) Let's assume that your corpus doesn't need any special treatment. I guess you can find...

A word can be part of multiple coreference mentions. Consider for example the mention "the new acquisition by Microsoft". In this case, there are two candidates for mentions: the new acquisition by Microsoft and Microsoft. From this example it also follows that a word can be part of multiple coreference...

What you need is a simple unsupervised (or semi-supervised) clustering algorithm. word2vec with its pre-trained vectors may not be very helpful because institutions, etc. are unlikely to be in it. Also, it seems that the number of "aspects" a user has it small, so you can simply have a clustering...

You can use feature union http://scikit-learn.org/stable/modules/pipeline.html#featureunion-composite-feature-spaces There is a nice example in the documentation http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#example-hetero-feature-union-py which I think exactly fits your requirements. See TextStats transformer. Regards,...

The specified path file:///Users/dylan/Desktop/POLARITY_DIR/ should contain the unpacked data (see tutorial) in the directory txt_sentoken You can see this in the output: Data Directory=file:/Users/dylan/Desktop/POLARITY_DIR/txt_sentoken Also the the tutorial is not setup to use an URL, so the command should be java -cp sentimentDemo.jar:../../../lingpipe-4.0.jar PolarityBasic /Users/dylan/Desktop/POLARITY_DIR...

Well, I had the same problem and what I have done was split the text in '\n'. Something like this: # in my case, when it had '\n', I called it a new paragraph, # like a collection of sentences paragraphs = [p for p in text.split('\n') if p] #...

If your text is already split into sentences, just use tokens = nltk.word_tokenize(sentence) (see tokenization from NLTK). If you need to split by sentences first, take a look at this part (there is code sample)....

It seems to me that you would be better off separating the tokenization phase from your other downstream tasks (so I'm basically answering Question 2). You have two options: Tokenize using the Stanford tokenizer (example from Stanford CoreNLP usage page). The annotators options should only take 'tokenizer' in your case....

I think you can get what you want from the standard dcoref annotator. Look at the annotation set by this annotator, CorefChainAnnotation. This is a map from document entities to "coref chains." Each CorefChain can provide you with a list of mentions for the relevant entity in textual order....