Sunday, March 17, 2019

AutoML machine learning deep learning without code by Uber, Ludwig allows users to train and make inference deep learning model without coding (caveat you still have to use command line code). Previously, it is an internal tool at Uber now open sourced to gather contribution. It's a python library.

Wednesday, March 13, 2019

Matrix is a rank 2 tensor. There are two axis one is an array, one is individual numbers.
Check the dimensions of tensors using .size() or .shape()
Obtain the rank of the tensor by checking the length of its shape
len(tensor.shape) #returns 2 for matrix
number of elements in the tensor, is the product of the component values in the shape torch.tensor(my_tensor.shape).prod()
my_tensor.numel() #number of elements
number of elements is important in reshaping
reshaping does not change underlining data just change the shape

Comment: may be we can use advanced RNN for earthquake prediction since it has a time series element

Install important libraries. Installations & Dependencies

!pip install kaggle

!pip install numpy==1.15.0

!pip install catboost

import pandas as pd

import numpy as np

from catboost import CatBoostRegressor, Pool

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import GridSearchCV

from sklearn.svm import NuSVR, SVR

#kernel ridge model for SVM

from sklearn.kernel_ridge import KernelRidge

"Kernel methods are a way of improving Support Vector Machine Predictions. Make sure we can create a classifier line or regression line in a feature space we can visualize. You know? A lower dimension feature space"

I have noticed that some of my classmates - Facebook Pytorch scholars, they went fast and far beyond what's required of the nanodegree. Here are some of the impressive things they did.

Train model from scratch rather than using a pretrained model. "Try using more convolution layers, increasing the depth, decreasing learning rate, and keep the final fc layer simple, I used only one fc layer after the convolution layers" "It depends on how many epochs to train. I just did 35 epochs. Similar to VGG."

How long did it take for you all to train your cnn from scratch model? GPU use. Some move the training to Google Colab.

Traditional machine learning algorithms are mostly not designed for sequential data. Do B after A, then C. The kind of step wise output can not be comfortably generated by traditional machine learning algorithms.

Problem Introduction: Kaggle Boston Housing competition, trying to predict the median housing data based on features such as no. of rooms. Makes sense the house is more expensive if there are more no. of rooms. However, there are always variances, noises in the data cause the result to fluctuate from the true trend.

Bloom filter (a data structure) looking at overlapping in data. Checking if there's any overlap or cross over between train and test data. Test if element is an element of a set.

Use in NLP, in n-grams, 8-grams arbitrary, 20-gram typical because sentences are 20ish words. 7-grams, human memory span around seven words. Average spoken language may be 7-grams. Can do both to see the amount of overlaps. Look at all sets of n grams. Pair wise comparison: what number of n-grams already exist in the set. Empty bloom filter is a bit set of m bits, all set to 0 (wikipedia). k hash functions look at the input, each map or hashes some element to m bits. k is much smaller than m.

Friday, March 8, 2019

Google Cloud Next 2019
Interesting Data Engineer Sessions at Google Cloud:
Moving from Cassandra to Auto-Scaling Bigtable at Spotify
Data Management: The New Best Practice for Incident Response
Google Cloud Platform from 1 to 100 Million Users
Google Cloud: Data Protection and Regulatory Compliance
Organizing Your Resources for Cost Management on GCP
TensorFlow 2.0 on Google Cloud Platform
Chatbots Will Empower Students and Teachers
Deploy Your Next Application to Google Kubernetes Engine
Fast and Lean Data Science With TPUs
Creating Interactive Cost and KPI Dashboards Using BigQuery
From Blobs to Tables: Where and How to Store Your Stuff
Data Processing in Google Cloud: Hadoop, Spark, and Dataflow
Enabling Healthcare in the Cloud: Mitigating Risks and Addressing Security and Compliance Requirements with GCP
An Insider's Look: Google's Data Centers
Data Integration at Google Cloud
G Suite Data Controls and Transparency
How AI Computer Vision and IoT Is Transforming Businesses
Smart Pallets for a Smart Warehouse: Building Advanced Computer Vision Systems Using Google Cloud IoT
30 Ways Google Sheets Can Help Your Company Uncover and Share Data Insights
Data for Good: Driving Social & Environmental Impact with Big Data Solutions
Data Warehousing With BigQuery: Best Practices
Extracting Value With a Cloud Clinical Data Warehouse
Future of Google Sites
Best Practices for Storage Classes, Reliability, Performance, and Scalability
Best Practices in Building a Cloud-Based SaaS Application
Building a Global Data Presence
End-to-end Training of a Model and Prediction Generation Using BigQuery ML
How to Secure and Protect Your Data in Cloud Storage
Rethinking Business: Data Analytics With Google Cloud
The Future of Health. Powered by Google
Accelerating Machine Learning App Development with Kubeflow Pipelines
Backup, Disaster Recovery and Archival in the Cloud
Bringing the Cloud to You AMA (Ask-Me-Anything)
Building and Securing a Data Lake on Google Cloud Platform
Deploy and Manage Virtual Workstations on GCP
Google Cloud DevOps: Speed With Reliability and Security
Understanding Google Cloud IoT: Connectivity Options and Examples
Unlocking the Power of Google BigQuery
Data Analytics
Building AI-Powered Customer Service Virtual Agents for Healthcare
Case Study: Using GCP to Measure Package Sizes in 3D Images
Cloud Native Application Development, Delivery and Persistent Storage
ow to Run Millions of Self-Driving Car Simulations on GCP
Cruise Automation is a leading developer of autonomous vehicle technology. In this session, we will dive into the infrastructure which allows us to run hundreds of thousands of autonomous simulations every day and analyze the results quickly and efficiently. Cruise runs the vast majority of our testing on Google Cloud, taking advantage of high scalability of compute and GPU resources for our diverse workloads. Our simulation frameworks allow us to replay data gathered from road testing or generate complex variations
Integrating Smart Devices With the Google Assistant and Google Cloud IoT
Kaggle: Where 2 Million+ Data Scientists Learn, Compete, and Collaborate on AI Projects
Kaggle's the world's largest community of data scientists and AI engineers. You'll learn how 2 million+ users leverage Kaggle to learn AI, sharpen their skills on public competitions, incorporate 10,000's of public datasets into their projects, and analyze data in hosted Jupyter notebooks.
Ben Hamner

CTO,

Kaggle
Migrating Data Analytics Solutions to Google Cloud
Take Care of Data Privacy in a Serverless World with Firebase
Machine Learning with TensorFlow and PyTorch on Apache Hadoop using Cloud Dataproc
Medical Imaging 2.0

Medical imaging is one of the largest sources of healthcare data. Join us in this talk to learn how cloud technologies and artificial intelligence enable new applications in the medical imaging domain, improving patient care and reducing physician burnout.
Python 3 and Me: Upgrading Your Python 2 Application
The Path From Cloud AutoML to Custom Model
Transforming Healthcare With Machine Learning
With the wealth of medical imaging and text data available, there’s a big opportunity for machine learning to optimize healthcare workflows. In this talk, we’ll provide an overview of the Cloud ML products that can help with healthcare scenarios, including AutoML Vision, Natural Language, and BQML. Then we’ll hear from IDEXX, a veterinary diagnostics company using AutoML Vision to classify radiology images.
How to Grow a Spreadsheet into an Application
Integrate Firebase into Your Existing Infrastructure
Customer Case for Anomaly Detection in MMORPG
Genomic Analyses on Google Cloud Platform
Description
Using Google Cloud Platform and other open source tools such as GATK Best Practices and DeepVariant, learn how to perform end-to-end analysis of genomic data. Starting with raw files from a sequencer, progress through variant calling, importing to BigQuery, variant annotation, quality control, BigQuery analysis and visualization with phenotypic data. All the datasets will be publicly available and all the work done will be provided for participants to explore on their own.

Saving Even More Money on Compute Engine

Notable Clients of Google Cloud:
Journey to the Cloud Confidently With Citrix and Google Cloud
Square's Move to Cloud Spanner
Forbes' Road to the Cloud
Why Small and Medium Businesses are Going Google
Clorox Data Cleanup Using Advanced Cloud Dataprep Techniques
How Gordon Food Service Reimagined Collaboration Using G Suite
How Airbnb Secured Access to Their Cloud With Context-Aware Access

ow Twitter Is Migrating 300 PB of Hadoop Data to GCP

Twitter has been migrating their complex Hadoop workload to Google Cloud. In this session, we deep dive into how Twitter's components use Cloud Storage Connector and describe our initial usage, features we implemented, and how Google helped us build those features in open source. We describe how Cloud Storage fits into our ecosystems and the experience and features which have helped us. We'll also talk about unique challenges we discovered in data management at scale.

Optimizing File Storage for Your Use Case

Music Recommendations at Scale with Cloud Bigtable
Spotify serves personalized music recommendations to hundreds of millions of happy customers worldwide, and powers a lot of this infrastructure with Google Cloud Bigtable. In this talk, we'll go into detail about how Cloud Bigtable allows us to deliver recommendations at scale, roll out experiments quickly, and ingest terabytes every day via Cloud Dataflow. We'll discuss a number of challenges we overcame when designing our recommendations infrastructure on top of Cloud Bigtable, including tips about how to design a good schema, how to avoid latency when ingesting new data, and effective caching strategies to scale to tens of millions of data points per second.
Real-Time, Serverless Predictions With Google Cloud Healthcare API
Target's Application Platform (TAP)

Google Cloud for Its Business Partners, Use Case Showcase
Automate Cancer MCA using Cloud Vision API and GCMLE
Learn how Pluto7 built a model to extract the text from Clinical protocols using Cloud Vision API and automatically predicted whether clinical treatments, based on their criteria, were classified, covered by researcher of clinical trial, or by the patient's insurance. We used Cancer Clinical trial protocols by the customer to train word-embeddings and we constructed a dataset of short free-text labeled R or S (Researcher or Sponser).
GitLab's Journey from Azure to GCP and How We Made it Happen
How Booking.com Uses BigQuery ML to Assess Data Quality and Other Features
How News Corp Transformed into a Data-Driven Organisation
Future of Work With Cisco and Google
How Schlumberger is Building Enterprise Solutions for the Future with Google
Kaiser Permanente's Journey Towards an API-First IT Strategy
Everyone Flies Faster When BigQuery Fuels the BI Engines at AirAsia
How Pandora is Migrating It’s On-Premises BI & Analytics to GCP

Composing Pandora's Move to GCP With Composer
A Glimpse Into CBS Interactive’s AI/ML Group
State of the Art: SAP on Google Cloud
What Did the Doctor Say? Mining Clinical Notes With GCP
Marianne Slight

Product Manager, Google Cloud Healthcare & Life Sciences,

Google Cloud

How Equifax Accelerates Time-to-Market with Microservices and APIs
How Macy's Executes DevOps at Scale on GCP

How HSBC Leverages GCP For Regulatory Reporting
Using Google's Data and AI Technologies with Kaggle

HSBC Invents New Technology as They Migrate to BigQuery
Learn How Cardinal Health Migrated Thousands of VMs to GCP
Using AI to Transform Your Fleet Operations

bookmark
Saved
Description
With note bloat now at 80%, it has become harder than ever to trace medical decision-making in the electronic medical record. But the physician's clinical notes provide that context along with nuggets of gold that aren't easily documented in the structured EMR. Join this session to discover how to mine clinical concepts from the physician notes, map them to standard vocabularies, augment the EHR data with them, and use them in your CDW analysis or FHIR applications.

GAN when given training dataset can generate new images or outputs that have never been seen before.

StackGAN can take description of an image such as a bird and generate a photo of the said bird. iCAN convert sketches to images. Pix2Pix translation, blue print for building turns into building. #edges2cats turn doodle of cats into real cats. Can be trained in unsupervised ways. CartoonGAN is trained on faces and cartoons but does not need to be trained on face and cartoon pairs. It knows how to convert without being explicitly told. Also can turn photo of day scenes to photo of night scenes. CycleGAN Berkeley especially good at unsuperivsed image-to-image translation. Best example is video of horse turned into a video of zebra. The surrounding even changed from grassland to Savannah. See links to the networks below. Generating simulated training set apple example of turning unreal eyes into realistic eyes and train models to learn where user is looking. Imitation learning, reinforcement learning (data), imitate action that would be taken by experts. GANs can generate adversarial networks: images that look normal to humans but can fool neural networks.

Other generative models
Fully visible belief networks: output is generated one element at a time, for example, one pixel at a time. Aka autoregressive models, known since the 90s.
Breakthrough is to generate in one shot: GANs generate an entire image in parallel. Uses a differentiable function in form of NN.

"Generator Network takes random noise as input, runs that noise through a differentiable function to transform the noise, reshape it so it have recognizable structure. " - Ian Goodfellow

The output of a generator network is a realistic image. The choice of the noise input determines which image will come out of the network. "The goal is to have these (output) image sto be a fair sample of real image data" - Ian Goodfellow

The generator network has to be trained. The training process is very different from a supervised model. The generator network is not supervised. "We just show it a lot of images. And ask it to make more images that come from the same probability distributin."

The second network: the discriminator, a normal neural network classifier, guides the generator network. The discriminator is shown real images half of the time, and fake images the other half of the time. It classifies whether the image is real or not.

The generator network's goal is to make compelling images that the discriminator will assign 100% probability that the image is real.

Receiver operating characteristic (ROC) plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold setting (source: wikipedia). It's a measure of how good the decision frontier is at each split.

Benchmark is a random guess: the 45 degree line. The best scenario, perfect split, area under the curve is 1.

There are two "extreme" points, if we classify everything as positive, the TPR, FPR = (1,1). If we classify nothing as positive, the TPR, FPR = (0,0) because the true positive rate is true positive / all positive = zero / all positive.

Tensorflow Lite
Tensorflow for mobile, portable devices and embedded devices. Works especially well on Android.

"An embedded device is a highly specialized device meant for one or very few specific purposes and is usually embedded or included within another object or as part of a larger system."Tensorflow Use Case

Machine Learning Terminologies
Features: number of bathroom, zipcode,
Training vs inference:
Data split: don't want to use up all data for training. Need to witheld data so that can test out how the model perform for data it has never seen. 15 years ago 70% 30% data split was the gold standard. `Cross Validation K Fold: divide train data into 10 chunks, use 9 for train 1 for validation, then shuffle and do the same, use another 1 portion for validation.... Use different 9 and test with a different hold out.

Tutorial Presentation Getting Started on Kaggle

Vani Mandava Director Data Science Microsoft Research @vanimt

Kaggle Competition Overview and Tutorial

Oil palm tree produces palm oil in Africa and South Central America through deforestation. High resolution satellite image and computer vision helps us track this environmental issues. Dataset created by the West Big Data Hub and WiDS Datathon Committee. Develop a model to detect whether an oil palm plantation is present. This competition has ended.

Follow us for beginner friendly tutorials. We will also publish a better quality version of this article on medium.

Notes from Metis webinar of the same title with my own commentary and opinions. Metis is a data science training bootcamp with technical portfolio projects. Metis also offer bootcamp prep courses: python prep for example.

At his previous work Capital One, he started the machine learning powerhouse team that grew from 2 to 80, lots of experience hiring machine learnist and data scientists.

Machine learning engineer, a job title that is more senior, a sweet marriage of data engineer and data scientist. JAVA C++, can explain algorithms, KNN, k means.

8:35 In finance, the job descriptions, roles, and skill definitions are well defined, even mandated. Requirements of experiences are set and even strict. Startup role definition tend to be loose, even chaotic.

Set internal expectations. Set expectations for people who conduct interviews. For hiring, it's better to set expectations right to attract the right candidates. Result in smoother experience for candidates. Candidate also feels that the experience is tailored. Google onsite panel is assembled based on candidate strengths and interests.

12:00 data analyst may know SQL but may not know a programming language.
Data analyst, data engineer ETL infrastructure may also need to know data visualization.

Building a funnel and attract talents
13:40
Generally, with funnel, one starts with a large number of people and quickly decreases to a small number of people - monotonically decreasing.

The goal is to build a funnel, attract people to the funnel, and optimizing it. Comment: It is an important startup growth, product management technique.

Application Funnel
Funnel does not really differ by company size.
Entry point: sourcer, referral, cold applications. Sourcer will actively reach out to candidates, but it is a quick process. They won't spend more than 1 minute at your profile.

If you have a person to follow up with, you are already further down the funnel.

Cold application can result in a pool of thousands, or tens of thousands applications.

Referrals are much deeper in the funnel. Even "half way there already". At a meetup. Someone they know may be hiring even if they are not directly hiring.

Non technical Phone screen : pulse check, culture check, is this person generally agreeable, broad skill check with a recruiter, how is their communication skills. Check if this person knows the company language.

17:00Technical Phone Screen: perhaps everyone's least favorite part, culture check, little CS problem, data science problem algorithms, "coarse filter" for does candidate have enough skills to justify on-site interviews. 4-6 hours of engineering time is valuable so it's best not to waste on candidates that are not ready. Cracking the coding interview book.

On-site : 4-8 hours, a proper day, including technical and non technical interviews. Even a post-on-site sometimes. Discuss candidates feedback, make offer, expectations. Comment: I heard that Microsoft sometimes do this, the day of the interview.

Offer includes salary, compensation, starting date.

First day: be nice if there's a small celebration.

Sometimes there's a take home. For the presenter : "If someone asks me to do 8 hours of work for free. I will just say no. "

"Just make sure you have a good pipeline. Constantly get people through. With well defined steps." - Presenter on building a great application funnel.

Presenter's preferred method of finding best candidate: organic means of finding candidates. Conferences of relevant topics (shows that they are committed and passionate if they are spending time on a Thursday night for hours learning about a topic), meetups, speaker reception, exchange business cards, rolodex? Existing contacts, networks.

LSTM overcomes the vanishing gradient problem of RNN. Back propagation through time, can make gradient too small. Avoid loss of information

LSTM allows learning across many different steps. 1000 steps.
The cell is fully differentiable. All its functions have a derivative, and hence a gradient. That can be computed. Including: sigmoid, hyperbolic tangent, multiplication, addition. Easy use of backpropagation or SGD to update the weights.

Sigmoid threshold is the key to manage: what goes into the cell, what retains within the cell, what passes the output.

If RNN set hidden state as None then all the hidden state weights will just be zero.

At first the blue line is just flat, hasn’t learn anything yet. As it learns, it starts to track red line well. Eventually it gets close. But suddenly, in this Udacity lecture the graph looks like it flipped upside down?! This is the same graph but for better visualization, it is flipped, so that the two graph look like their track each other nicely on this new axis. But the lecturer didn’t point this out so it looked surprising.

Natural Language Processing (NLP) is not supposed to be easy! But let’s try to simplify for beginners. Follow us for more beginner friendly articles like this.

Natural Language Processing or NLP is a subset of the field of Artificial Intelligence. It is a field that analyzes our human language, takes texts as input. The entire text dataset, the input data is called the corpus. For example we calculate how many times a word appears in the corpus. This count is called term frequency.

“Hi there! It’s good to see you. I just wanted to say hi.” # The sentence is the corpus. Term frequency of ‘hi’ is 2, because it appears twice in the corpus, if our analysis case insensitive (‘Hi’ equals to ‘hi’). If it is case sensitive, then the term frequency of ‘Hi’ is one, and TF of ‘hi’ is also one.

We will elaborate on term frequency later.

Practical tip: Sometimes it is important to be case sensitive. For example, Trump may refer to Donald Trump, trump is a verb often used in card games describing one card outranks another. When cases don’t matter, a common preprocessing, data cleaning technique is to change all text of the corpus to lower case. Loweringlower_case_corpus = corpus.lower() The function .lower() is a python string method. For example “Hello there!” will become “hello there!”.

Codecademy.com explains bag-of-words model: “A bag-of-words model totals the frequencies of each word in a document, with each unique word being its own feature and its frequency being the value.”

If you haven’t studied Machine Learning the word feature makes no sense. There are tricks that may help you understand. We can imagine the output of a bag of word model as python dictionary / hashmap of key value pairs or as an Excel sheet. The features are the keys in the dictionary or the column headers in the Excel sheet. Features are meaningful representations of the data. Machine Learning learns features and predicts outcomes called labels.

For example useful features of Person data — information that describes people — may include: height, gender, name, government issued ID number etc.

Pro Tip: what is the feature dimension? What is the size or the number of features? It equals to the size of vocabulary found in the corpus.

corpus = ["You are reading a tutorial by Uniqtech. We are talking about Natural Language Processing aka NLP. Would you like to learn more? Learn more about Machine Learning today!"]
# if use corpus = "..."
# receive error
# ValueError: Iterable over raw text documents expected, string object received.

Pro tip: what does CountVectorizer do per the Sklearn documentation? “Convert a collection of text documents to a matrix of token counts” and returns a sparse matrix scipy.sparse.csr_matrix Just an FYI. Don’t think too hard about it now.

The feature names are returned by count_vect.get_feature_names() and bow.toarray() gives us the frequency of corresponding features. For example, the first word ‘about’ appears twice in the corpus so its frequency is 2. The last word ‘you’ also appears twice.

How is it useful? This common model is surprisingly powerful. There are some criticism of the author of 50 Shades of Grey on the internet: the claim is that she is not a sophisticated author because her books only utilize limited English vocabulary. Apparently people have found that she uses some simple non-descriptive words too often, such as love and gasp. Below is a meme that makes fun of 50 Shades of Grey.

How did people know the author uses gasp a lot? Word count, word frequency of course!

If we read through this Word Frequency Analysis of the 50 Shades of Grey Trilogy, indeed we have to scroll down quite far to see a complex word that is also frequently used such as murmur.

Some argue however precisely because the author uses easy-to-read colloquial style the series has gained wide readership and popularity.

Surprisingly, this simple model is quite insightful and already generates a good discussion.

More on bag of words

Stop Word Removal … not : Not all words in the corpus are considered important enough to be features. Some such as a, the, and are called stop words, which are sometimes removed from the feature dataset to improve machine learning model results. The appeared nearly 5000 times in the bookbut it does not mean anything in particular, thus it’s okay to remove it from our dataset.

In the bag of words model, grammar does not matter so much, nor does word order.

Pro Tip: the bag-of-words model instance is often stored in a variable called bow , which can be confusing because you may be thinking of bow and arrow, but it is the acronym for bag of words!]

Sample natural language processing workflow and NLP pipeline:

Data cleaning pipeline for text data

cleaning (regular expressions)

sentence splitting

change to lower case

stopword removal (most frequent words in a language)

stemming — demo porter stemmer

POS tagging (part of speech) — demo

noun chunking

NER (name entity recognition) — demo opencalais

deep parsing — try to “understand” text.

Important Natural Language Processing Concepts

Stop Words Removal

Stop words are words that may not carry valuable information.

In some cases stop words matter. For example researchers found that stop words are useful in identifying negative reviews or recommendations. People use sentences such as “This is not what I want.” “This may not be a good match.” People may use stop words more in negative reviews. Researchers found this out by keeping the stop words and achieving better prediction results.

While it is common practice to remove stop words and only returned clean text, removing stop words do not always give better prediction results. For example, not is considered in some NLP libraries, but not is a very significant word in negative reviews or recommendations in sentiment analysis. For example, if a customer states “I would not buy this product again, and would not accept any refund. Really not a good match at all.”, the word “not” is a strong signal that this review is negative. A positive review may sound, well, positive! “I really like the product! I enjoyed it very much. Not what I expected at all.” In this case, negative reviews use the “not” word 3x more.

Removing punctuation may also yield better results in some situations.

NLP Techniques — Removing punctuations with Regex

Punctuations are not always useful in predicting the meaning of texts. Often they are removed along with stop words. What does removing punctuation mean? It means keeping only the alpha numeric characters. Regex programming lessons can fill books! Just use this nifty function below for short texts. For longer texts that require more processing power, use iterable generators to iterate through each line of text and keep only alpha numeric characters. For big data, use parallel processing to process multiple lines of texts at once.

This process of removing numbers and punctuation is called pruning.

Regex removes punctuation

#import regex
import re

corpus = "You are reading a tutorial by Uniqtech. We are talking about Natural Language Processing aka NLP. Would you like to learn more? Learn more about Machine Learning today!"

# ^\s means DO NOT MATCH SPACE
corpus = re.sub("[^a-zA-Z0-9\s]+", "",corpus)
corpus
#returns 'You are reading a tutorial by Uniqtech We are talking about Natural Language Processing aka NLP Would you like to learn more Learn more about Machine Learning today'

Go ahead, just use the above method and avoid reinventing the wheel.

Pro Tip: python also has a build in alpha numeric checker function ialnum() . There is another .isalpha() only returns true for alphabets, a number will not evaluate to true.

There are always hackers coming up with fancy regex code! It keeps getting fancier.

#tokenize any word that has length > 1,
#effectively removing all punctuations

Tokenization

Tokenization: breaking texts into tokens. Example: breaking sentences into words, and more group words based on scenarios. There’s also the n gram model and skip gram model.

Basic tokenization is 1 gram, n gram or multi gram is useful when a phrase yields better result than one word, for example “I do not like Banana.” one gram is I _space_ do _space_ not _space_ like _space_ banana. It may yield better result with 3 gram model: I do not, do not like, not like banana, like banana _space_, banana _space.

ngram : n is the number of words we want in each token. Frequently, n =1

Did you know that Google digitized many books and generated and analyzed literature based on the n gram model? Nice work Google!

Lemmatization

Lemmatization: transforming words into its roots. Example: economics, micro-economics, macro-economists, economists, economist, economy, economical, economic forum can all be transformed back to its root econ, which can mean this text or article is largely about economics, finance or economic issues. Useful in situations such as topic labeling. Common libraries: WordNetLemmatizer, Porter-Stemmer.

Sentence Tagging

Sentence tagging is like the part of speech exercises your grammar teacher made you do in high school. Here’s an illustration of that:

Sections Coming soon…

Like what you read so far? Join our $5/month membership to get in-depth Silicon Valley job intelligence, beginner friendly tutorials, training courses for a tech career in Silicon Valley. subscribe@uniqtech.co

NLP Use Cases

Sentiment analysis of tweets, amazon reviews. Classifying whether a short text is positive or negative.

Writing style analysis analysis: authors’ favorite vocabulary choice, singers’ lyrics style. For example, style analysis has identified JK Rowling as the author of a book even though she used male a pen name after passionate readers analyzed and found parallels and similarity in the text styles.

Entity tagging: find organizations or people’s names in articles

Text summarization: summarize main points of news articles

Getting Started with NLP Now!

You can use the Python nltk library to analyze texts. It’s a popular and a powerful library. It includes lists of stop words in several languages.

from nltk.corpus import stopwords
clean_tokens = [token for token in tokens if token not in stop_words]

Sklearn conveniently has a build-in text dataset for you to experiment with! These news articles can be classified into different topics. Sklearn provides cleaned training data for this classification task.