This course provides an unique opportunity for you to learn key components of text mining and analytics aided by the real world datasets and the text mining toolkit written in Java. Hands-on experience in core text mining techniques including text preprocessing, sentiment analysis, and topic modeling help learners be trained to be a competent data scientists.
Empowered by bringing lecture notes together with lab sessions based on the y-TextMiner toolkit developed for the class, learners will be able to develop interesting text mining applications.

Ministrado por

Min Song

Professor

Transcrição

A NER, which stands for named entity recognition, stems originally from information extraction. Another name for NER is NEE, which stands for named entity extraction. Information extraction algorithm finds and understands limited relevant parts of text. It gathers information from many different pieces of text. And finally, It produce a structure representation of relevant information, such as relation and knowledge base. The goal of information extraction is to organize information so that it is useful to people. In other words, it is good information in a semantically precise form that allows further inferences to be made by algorithms. NER is subclass of information extraction. NER is a task of identifying names of people, places, organizations, and so on and so forth. In text, as you can see from the definition of NER, it is narrower than information inspection in terms of scope. The goal of NER is to find and classify names in text. NER is a very important subtask of information extraction. Let's take the example of used car advertisement as you see on the slide. For ads, for sale 2002 Toyota Prius, 20,000 miles, 15k or best offer. Let's say this is advertisement and if NER is correctly applied, then it will extract five types of entity, which are model, brand, year, price, and mileage. There can be a number of entity types. Popular ones are person names, which are like John Smith and John Connor. The second one is organizations, such as IBM, Google. Location is third, which are New York or Korea. Another one is date and time expressions, such as February 2010. The question is how to decide which type to be used. It is solely dependent on which models the NER system provides. If you're looking for a new set of ethic types, you have to prepare for your own annual training dataset and train the NER algorithm. In this course, I provide the package, which built based upon Stanford NER. Stanford NER provides three different models. First model consists of three types, which are location, person, and organization. Second model consists of four types which are location, person, organization, and miscellaneous. The third one consists of seven types which are location, person, organization, money, percent, date, and time. NER is not an easy task. Why it is difficult? There are several reasons for this. Simply put there are too many to be included in dictionaries if any NER test or NER technique or algorithm requires dictionary or ontology. Language is constantly changing, so the same word may mean differently between past and present. Words can appear in several different forms. Words can be present in abbreviation form after full form is introduced. For example, National Science Foundation parenthesis NSF is an organization, da, da, da, and next sentence says NSF can offer da, da, da. In this case, National Science Foundation can detect an organization but not NSF. Another difficult issue is that a word can have multiple meanings. NER can serve as a submodule of other text mining techniques. For instance, for summarization NER can help search engine users by automatically sifting through and summarizing web pages. For question answering many fact-based answers to questions are in fact entities that can be detected by NER. Therefore by incorporating NER into the question answering system, the task of finding some of the answers is simplified considerably. For ontology construction, two core elements of ontologies are class and relation. Class can be automatically generated by NER. There are quite a few open source space NER tools out there. And I introduced some of them here. First one is Stanford NER. As I mention earlier, it is part of white text mining package and this is based on conditional random field classifier. Another one is GATE. GATE was originally developed in the context of information extraction, and GATE is distributed with an IE system core ANNIE. And this ANNIE relies on finite state algorithm and the cheap language. The third case is minor third. Minor third is a motion learning based approach, which was developed at Carnegie Mellon University. The last example is OpenCalais. OpenCalais is web-based NER tool developed Thomson Reuters. It automatically extracts entities from web pages in a format that can be used on the semantic web. By and large, there are two major approaches to NER. First approach is knowledge based. Knowledge-based NER is a very precise since it is base on handcrafted rules, but the downside is that accuracy is very low if there are no matched rules for the given text. It requires a small amount of training data, which consists of a set of rules. Since it requires handcrafted rules, it is labor intensive and expensive. The rules are to be developed domain dependently. For instance, rules for biomedical domains are different from rules for finance domain. Since languages changes over time, the rules need to be adjusted accordingly. Compared to knowledge-based NER, learning-based NER achieves higher recall and lower precision. The reason is because learning-based NER utilize some kinds of probability and give a higher chance of predicting more correct answers. This approach does not require grammars or rules. In addition, since no linguistic rules are needed, there is no need for linguistic experts to be included. Compared to creating the rules making annotations are cheaper. Learning-based approach requires a large amount of high quality training data, more training data, better result. As I mentioned in the previous slides, knowledge-based approach requires to create regular expression to extract entities. Such as telephone numbers, emails, person's names from unstructured text. As far as learning-based approach is concerned, two ways of doing learning-based. The first one is supervised based, the second one is unsupervised based learning. Supervised learning is a terminal approach. It requires training dataset. Numerous methods have been proposed, including Hidden Markov models, k-Nearest neighbors, decision trees, AdaBoost, support vector machine, and so on and so forth. Unsupervised learning does not require a training dataset and the entity is automatically discovered without training.