Data By the Bay is the first Data Grid conference matrix with 6 vertical application areas spanned by multiple horizontal data pipelines, platforms, and algorithms. We are unifying data science and data engineering, showing what really works to run businesses at scale.

Sign up or log in to save this to your schedule and see who's attending!

Supervised machine learning models are extremely powerful and highly useful for processing vast amounts of text. Their applications include sentiment analysis, text classification, topic mining, part of speech tagging, and named entity recognition, among many others. However, supervised models rely heavily on large amounts of annotated data and furthermore require that the annotations be consistent and accurate. In practice, obtaining high quality annotated data, especially with strong inter-annotator agreement, is not always possible for legal and privacy reasons: there are some data that organizations may not be allowed to crowd source. In this talk I propose several methods to help machine learning models get over the hurdle of insufficient labeled data by leveraging a number of computational linguistics techniques. Specifically, focusing on CRF (conditional random field) model for Named Entity Recognition, I discuss how the use of language feature engineering, artificial dataset generation, and post-processing rules can significantly improve model performance, which otherwise suffers from the bottle-neck of insufficient training data. I propose a number of scalable and practical methods that machine learning practitioners can use in situations where obtaining more training data via crowdsourcing is not a viable option.

As a Staff Software Engineer at LinkedIn, I work on various natural language processing applications such as query understanding, sentiment analysis, and member /job data standardization. Before joining LinkedIn, I was a Staff Research Engineer at Samsung Research America, where... Read More →