SpaCy with Matthew Honnibal - Episode 87

December 11, 2016

Summary

As the amount of text available on the internet and in businesses continues to increase, the need for fast and accurate language analysis becomes more prominent. This week Matthew Honnibal, the creator of SpaCy, talks about his experiences researching natural language processing and creating a library to make his findings accessible to industry.

Do you want to try out some of the tools and applications that you heard about on Podcast.__init__? Do you have a side project that you want to share with the world? Check out Linode at linode.com/podcastinit or use the code podcastinit2019 and get a $20 credit to try out their fast and reliable Linux virtual servers. They’ve got lightning fast networking and SSD servers with plenty of power and storage to run whatever you want to experiment on.

One of the frustrating things about being a developer, is dealing with errors… (sigh)

Relying on users to report errors

Digging thru log files trying to debug issues

A million alerts flooding your inbox ruining your day…

With Rollbar’s full-stack error monitoring, you get the context, insights and control you need to find and fix bugs faster. It’s easy to get started tracking the errors and exceptions in your stack. You can start tracking production errors and deployments in 8 minutes – or less, and Rollbar works with all major languages and frameworks, including Ruby, Python, Javascript, PHP, Node, iOS, Android and more. You can integrate Rollbar into your existing workflow such as sending error alerts to Slack or Hipchat, or automatically create new issues in Github, JIRA, Pivotal Tracker etc.

We have a special offer for Podcast.__init__ listeners. Go to rollbar.com/podcastinit, signup, and get the Bootstrap Plan free for 90 days. That’s 300,000 errors tracked for free. Loved by developers at awesome companies like Heroku, Twilio, Kayak, Instacart, Zendesk, Twitch and more. Help support Podcast.__init__ and give Rollbar a try today. Go to rollbar.com/podcastinit

Brief Introduction

Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.

I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable.

When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at linode.com/podcastinit and get a $20 credit to try out their fast and reliable Linux virtual servers for running your awesome app.

You’ll want to make sure that your users don’t have to put up with bugs, so you should use Rollbar for tracking and aggregating your application errors to find and fix the bugs in your application before your users notice they exist. Use the link rollbar.com/podcastinit to get 90 days and 300,000 errors for free on their bootstrap plan.

Visit our site to subscribe to our show, sign up for our newsletter, read the show notes, and get in touch.

To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers

Join our community! Visit discourse.pythonpodcast.com for your opportunity to find out about upcoming guests, suggest questions, and propose show ideas.

Interview with Matthew Honnibal

Introductions

How did you get introduced to Python?

Can you start by sharing what SpaCy is and what problem you were trying to solve when you created it?

Another project for natural language processing that has been part of the Python ecosystem for a number of years is the Natural Language Tool Kit (NLTK). How does SpaCy differ from the NLTK and are there any cases where that would be the better choice?

How much knowledge of NLP and computational linguistics is necessary to be able to use SpaCy?

What does the internal design and architecture of SpaCy look like and what are the biggest challenges associated with its development to date and into the future?

One of the projects that you have built around SpaCy which I think is really cool and caught my attention when I first found your project is the displaCy visualization tool. Can you explain what that is and why you think it is important?

What are some kinds of applications where SpaCy would be useful which might not be obvious candidates for it?

Why is speed such an important focus for an NLP library?

One of the ways that you have been able to gain a speed boost is through releasing the GIL and allowing for true parallelism via Cython. How have you managed to ensure that this doesn’t lead to data races and program failures?

Building on the success of SpaCy you founded a company called Explosion AI. Can you explain what your goals are for this endeavor and the kinds of services that you are offering?

What are some of the most interesting uses of SpaCy that you have seen?