Materials or downloads needed in advance

After Docker is running on your machine, run the following command to get the Docker container for this tutorial on your machine:

docker pull melcutz/nlu-demo

Because of all the dependencies (Spark, SpaCy,NLTK,UMLS,etc.) the image file is very big so it might take a while to download. Once the command finishes successfully and you have the image on your machine (use ‘docker images’ to validate), use the following command to start the Docker container:

docker run –it --rm –p 8888:8888 melcutz/nlu-demo

If successfully launched, the output should be something like:

Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://localhost:8888/?token=a8309a652c58fe0172483ef845461af030349e04cb0ac88e

So, follow the instruction and copy/paste the provided URL into your browser of choice (we tested on Chrome) and you should be able to navigate to an instance of Jupyter Notebook running inside the Docker container.

What you'll learn

Gain hands-on experience with common NLP tasks and pipelines using spaCy and Spark NLP

Description

Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, paraphrasing or summarization, sentiment analysis, natural language BI, language modeling, and disambiguation. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks.

David Talby and Claudiu Branzan lead a hands-on tutorial for scalable NLP using spaCy for building annotation pipelines, Spark NLP for training distributed custom natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings. You’ll spend about half your time coding as you work through three sections, each with an end-to-end working codebase that you are then asked to change and improve.

Outline

Using spaCy to build an NLP annotations pipeline that can understand text structure, grammar, and sentiment and perform entity recognition

Built-in spaCy annotators

Debugging and visualizing results

Creating custom pipelines

Practical trade-offs for large-scale projects, as well as for balancing performance versus accuracy

Using TensorFlow to build domain-specific machine-learned annotators and then integrating them into an existing NLP pipeline

Feature engineering and optimization

Measurement

Practical considerations when working on problems that require understanding text beyond keyword matching and one-hot encoding

Using Spark ML and TensorFlow to apply deep learning to expand and update ontologies

Comparison of word2vec and doc2vec

When each is useful

How to apply them to increase the accuracy of classification or information retrieval problems

Current trade-offs in integrating spaCy and Spark when engineering distributed, large-scale NLP pipelines

David Talby

Pacific AI

David Talby is a chief technology officer at Pacific AI, helping fast-growing companies apply big data and data science techniques to solve real-world problems in healthcare, life science, and related fields. David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing Group, where he led business operations for Bing Shopping in the US and Europe, and worked at Amazon both in Seattle and the UK, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Claudiu Branzan

Accenture AI

Claudiu Branzan is a analytics senior manager in Accenture’s Applied Intelligence Group, based in Seattle, where he leverages his more than 10 years of expertise in data science, machine learning, and AI to promote the use and benefits of these technologies to build smarter solution to complex problems. Previously, Claudiu held highly technical client-facing leadership roles in companies utilizing big data and advanced analytics to offer solutions for clients in healthcare, high-tech, telecom, and payments verticals.

Comments

Claudiu Branzan
| ANALYTICS SENIOR MANAGER

22/05/2018 1:07 BST

Thanks for the info Sertan! We specifically put — as we had that issue before :)
Also, please notice the token is dynamically generated so it will differ from the one in the example above. Please use the link generated after executing the ‘docker run’ command. We will be in the room a few minutes earlier to help anyone setting up their environment. Even if you don’t get to install anything you can still follow us and it should be fun ;)

Sertan Şentürk
| DATA SCIENTIST

22/05/2018 1:03 BST

P.S: The dashes in the command I posted before is also wrong due to the website’s automatic formatting. You should simply enter the command by hand :)

Sertan Şentürk
| DATA SCIENTIST

22/05/2018 1:00 BST

If you copy and paste the docker run command you might get a “docker: invalid reference format.” error. This is because some of the dash characters are actually en-dash. Below is I paste the command with correct characters: