JupyterCon 2017

I've just returned from the first JupyterCon, held from August 22-25, 2017 in downtown Manhattan. It was a relatively low-key conference focused on Jupyter, starting with tutorials and workshops, and ending with two days of talks in four tracks. Fernando Perez, the inventor of IPython and instigator of this whole Jupyter phenomenon, keynoted the conference with a humbleness and collaborative vision of a vibrant Jupyter-rich future that was infectious. As with some of the best open-source-focused conferences, JupyterCon focused far more on cooperation and sharing learnings and best practices than on advertising. Of course, there were sponsors and booths, but it never felt intrusive.

Key Takeaways (TL;DR)

For those who don't want to read the full take below, here are my top-line observations:

Jupyter is continuing to grow in use, especially in the academic space where entire courses are being taught out of Jupyter notebooks

JupyterLab (or Jupyter v.next) looks very promising and is almost ready for users to start adopting (EOY17)

Some early careful thoughts about the kernel separation and about protocols have paid dividends, allowing JupyterLab to completely rewrite Jupyter Notebooks while maintaining (almost) full compatibility, allowing light-weight Kernel frameworks like Xeus and some crazy Kernel wrappers.

Tutorial - Polyglot Notebooks

As a speaker I didn't have permission to do the two-day training, so have no feedback on their quality, but anecdotal evidence from others suggests they were useful. I only went to one tutorial as I had to spend the rest of the day getting my talk cleaned up and all of my code running on my Data Science VM. Since my team works with such a wide variety of technologies, I often wind up using Python for one project and R for the next. I love Python, don't get me wrong, but I find myself missing dplyr and GGPlot whenever I'm in Pandas/Matplotlib-land (yes, I know about Bokeh/Seaborn/Plotly/Dash), so when I saw the tutorial for mixing SQL, Python and R in one Notebook I knew that was where I needed to be.

The tutorial, given by Laurent Gautier (the creator of rpy2), had some initial setup hiccups getting people up and running (mostly due to people trying to get started without using the provided Docker container), but once those were resolved it was quite interesting. He started by setting up a SQLite instance and showing how to execute SQL queries using sqlite3 and then SQLAlchemy, pointing out that while the SQLAlchemy code looks more elegant and certainly more Pythonic than the raw SQL, by using the library you've replaced learning a language (SQL) with learning an API (SQLAlchemy). A valid point, especially given the ubiquity of SQL well beyond the RDBMS world.

Laurent then walked through Pandas and R DataFrames, comparing filtering, aggregation and other methods between Pandas and dplyr. He walked through examples of some reasonably advanced R plotting using lattice and ggplot, plotting some interesting earthquake-related data. All of this R work was done using %%R magic to run against an R Kernel loaded by %load_ext rpy2.ipython.

Finally, with all of the preliminaries out of the way, we got to the fun part - Python and R together, sharing data and having fun like one big math-nerd play-date. He loaded rpy2.ipython as before, but then showed off importr and imported some native R packages. These could immediately be accessed just like they were Python:

stats=importr('stats')tuple(stats.rnorm(5))

Laurent then showed off the even more important rpy2.robjects.lib.dplyr and ggplot2, allowing what looked like straight-up Python dplyr and ggplot code.

Day One Sessions

Keynotes

Fernando Perez and Andrew Odewahn opened the keynotes with, as mentioned previously, a humbleness and a collaborative open spirit that was admirable and infectious. Fernando talked about their new Data Science course at UCBerkeley taught entirely through Jupyter Notebooks in the (private) cloud, allowing them to avoid any student-related setup issues and opening the course to people from sixty different majors. Peter Wang talked about Anaconda's role and helped reinforce Fernando's worry about the health of Jupyter and its community in the future - talking about Continuum's experience as the OSS community around Python evolved and pointing out potential traps for the future, as Jupyter users went from forgiving early adopters to a broader user base.

Rachel Thomas talked about her experience running fast.ai'sDeep Learning course and how Jupyter helped enable the project, as well as the motivations behind it and some success stories from her students. Wes McKinney (of Pandas fame) talked about his efforts on Arrow (language-agnostic data frames) and Demba Ba discussed their JupyterHub-taught courses at Harvard.

Bryan Granger, Chris Colbert, and Ian Rose presented the latest progress on JupyterLab. Honestly, I'd never seen JupyterLab and was amazed by how promising it looks. They've managed to fully decouple model from view, allowing a single notebook to be opened to multiple locations and easily docking views in whichever configurations make the most sense. I was also impressed by their live Markdown previewer with the ability to attach a Kernel - letting you Shift+Enter on e.g. a Python cell and execute it in a console window. Finally they showed a collaborative editing scenario using Google Docs, which made me want to try and build a similar scenario using the Microsoft stack. It's looking like it'll be stable enough for users to start using it shortly, and ready for developers to start porting and writing new extensions by the end of the year.

Domain-Specific Notebook Extensions

I then watched talks on Graph/Network-based Notebooks and Geo-based Notebooks. Both extensions seemed fairly heavyweight and I'm not sure how they will work in the new JupyterLab world, but it was interesting to see these domain-specific adaptations. The Geo-notebook adaptation seemed more useful to me in practice as it included a GeoTile server; it seemed like the Graph extension could probably mostly be accomplished with the Gremlin extension and some creative D3 or Seaborn/GGPlot work.

Day Two Sessions

Keynotes

Fernando Perez and Andrew Odewahn opened the keynotes today as well, and led quickly into a talk by Jeremy Freeman from the Chan Zuckerberg Initiative. He discussed how Jupyter was pivotal in the work they were doing, and how tools can help enable science to happen faster (a pivotal part of their mission), and the challenges ahead in making that happen. Brett Cannon hammered the point home with a talk that, in some respects, boiled down to the OSS version of the Wil Wheaton Rule.

Paul Ivanov and Matthias Bussonnier had a really interesting talk on the Jupyter infrastructure and protocols. Paul's section on the interaction between the notebook and the kernel was eye-opening with a great sequence diagram that really laid out how they interact. I lost the train a bit, however, when he showed the underlying wire protocol. Matthias talked about the protocol and how Jupyter renders objects to the notebook, showing how easy it was to enable custom rendering. It is shockingly easy! That made me excited for being able to e.g. render DNN topologies in notebooks easily - we'll see how easy it is to do so in practice.

Patty Ryan, Lee Stott and I gave a talk outlining some of the varied work we've done within the CSE team that involved Jupyter Notebooks. Lee has done a ton of work with academia, enabling large-scale Jupyter deployments and helping people adopt Azure Notebooks - showing off their capability for displaying a "Profile" and letting you publish a set of libraries that outline your entire corpus of work.

Patty discussed her work with Ernst & Young enabling easy discovery of tax policies through a custom Azure Search deployment. She also went into some detail on an IoT scenario with the Professional Ski Instructors of America for improving skiers performance by benchmarking them against professional skiers. I discussed a scenario using object detection with Fast R-CNN for auditing inventory against vendor policies, enabling massive cost savings against the current painful manual process. Jupyter Notebooks are pivotal in these cases for quick iteration, documenting our assumptions and settings inline with the code, and being able to share these with both the partners and the world after our hackfest or short engagement is complete. Of course, all of our work is open source and free (as in beer).

Yoshi Masatani gave a talk on the solution they've built out in the National Institute of Informatics that, honestly, was pretty mind-blowing. He was obviously struggling to speak English to such a large group and was incredibly soft-spoken (some mic issues didn't help), but the content of his talk was great. I'd say my two favorite take-aways were his "Enhanced Collapsible Headings" which allowed you to collapse notebook sections and run multiple blocks within them with a single click of a "play" button - this would lock successful cells (or freeze them) and mark failing cells as red, and his log summary/truncation functionality which would execute verbose commands and log all output to a file, rendering only the head/tail and errors to the notebook. He had a lot more in his talk, though, so I'd recommend watching it when it comes online.

Chris Wilcox from the Azure Notebooks team gave a great talk about how Azure Notebooks evolved from prototype to scalable system. I liked that he focused on the journey and their learnings in the process of going from a product team focused on Python and R Tools for Visual Studio to an online service. His insight on the increased impact of flaky tests as you went from a product ship schedule to a service ship schedule was something I hadn't considered before (I come from a services background so flaky tests were always anathema).

The final talk of the conference for me was perhaps the most mind-blowing and interesting, even if it's not directly relevant to my day-to-day work. Sylvain Corlay and Johan Mabile presented Xeus, their simple C++ Jupyter Kernel base implementation, allowing people to easily implement their own custom kernels. This in an of itself is an achievement and should allow custom kernels to flourish, but wasn't the most mind-blowing part.

They then gave a great demo of their first custom kernel implementation - a Jupyter Kernel wrapped around Cling. Cling is a C++ Interpreter built by the folks at CERN - obviously driven mad contemplating Quantum Chromo-Dynamics. Wrapping Cling in Xeus, they demo'd interpreted C++ inside the Jupyter Notebook showing off polymorphic inheritance and template specialization - quite impressive. They followed this up with a demo of their XTensor library showing very Numpy-like code written in C++ and running inside Jupyter. It was awesome - in that it was both amazing and terrifying.

They also gave me the quote of the conference: "C++ was never meant to be interpreted". So very true.

End Notes

I came into this conference a fan of Jupyter, but unaware of either its current power or its future direction. I left incredibly excited for the coming year, and looking forward to the next JupyterCon and the progress it will bring. Thanks Fernando et.al. for such a great tool and conference to go along with it!

Recently, I was messing around with some image recognition tasks in
Python, and wanted to take a look at some of the images in my training
set. At first, I was tempted to go open file explorer and walk around
the directory tree spot-checking a few, but then I remembered …

Leader Election is a
mechanism for designating one instance of a group as the primary (or
leader) and the others as secondaries (or followers), and is typically
used for coordination amongst groups of machines in a distributed
system. I first learned of leader election back in the olden times of …

As mentioned in my previous
post I
was recently working on a project training Deep
ConvNets
for image recognition. As with any Machine Learning project one of the
primary things you need to make it successful is large amounts of data -
quality labeled data. With image-based convnets, though, the data …

I was recently working on a project to train Deep
ConvNets
for image recognition tasks, and ran into several interesting problems
along the way. This post is the first in a series outlining some of
those problems and how I went about solving them, and it focuses on the
problem …

One of my recent projects had to do with watching people moving around
the city - not in a creepy stalker way, but trying to get some sense of
where people congregated and general walking traffic flow. As anyone who
has dealt with large data, especially large geo data can tell …

I got back recently from giving my first talk at CascadiaFest
2015, held in the beautiful Semiahmoo
Resort outside of Blaine, WA. As a long-term Seattle resident, my only
knowledge of Blaine was as "that place where we spend too much …

I love my job. Less than a week after watching the amazing announcements
at //Build2015, I had a chance to try
using some of them on a real project as part of a hackathon.
Unfortunately, I can't share the actual problem they are trying to
solve, but the work itself …

As you might have noticed in some of my other posts, I've spent my share
of time with Azure EventHubs. It's a great ingestion pipeline able to
cope with some of the highest scale workloads you can throw at it, but
once the data is in there it's another matter …

With the general availability of Azure Machine Learning, the team has
added a ton of new features. Perhaps the one I'm most excited about is
the addition of a general "Execute Python Script" module. I've been
meaning to brush up on my Python skills, and with the inclusion of
Pandas …

I've been working on a Node-based AMQP
1.0 library and
wrappers that allow
it to talk ServiceBus and EventHub easily with a colleague of mine at
work, and it's taught me many things about doing Node.js
development, collaborating on GitHub, and publishing on npmjs. The
documentation is mostly …

Over the past few months, I've been working on a Node.js client for
AMQP 1.0 - the lingua
franca of Azure's EventHub and ServiceBus. Well, the EventHub support
is just about complete, and ServiceBus Topics and Queues should follow
shortly thereafter (it's just me working on it, so it's …

I've been working with
AzureMLfor
a while now, and it's fantastic. Having done ML in a few other
platforms over the years, the ease with which I can put together a few
regression models, do parameter-sweeps, and compare results is just
groundbreaking.

One of the worst things about working with Node.js is the .js part - as
much as I like Node, I still think JavaScript is a strange, broken
little language. Not quite OO, not quite functional, it's a language
made to help browsers validate form fields and is now stuck …

Lately, I've been working on a Node.js-based library for speaking AMQP
1.0, since all of the existing ones are still on 0.9.1 and don't seem
intent on updating, or are based on Qpid Proton and thus burdened with
more native code than I'd like. It's still …

This isn't a long post, I just wanted to shout-out my colleague Felix
Rieseberg, who has a great
postoutlining
a whole selection of tools to help you get up and running with a Windows
development environment. The most notable is probably Chocolatey
NuGet, which if you're a long-term Windows …

On a recent project, I found myself writing a web service in Node via
Azure Mobile Services, and realized that the client pattern was to call
a common set of requests on startup and in early interactions, and then
dig down into more detailed requests. Furthermore, these common
requests were …

This past week, I participated in a hackfest trying to get some
Alljoyn devices talking to
Nitrogen. It was a great time, getting to learn
some new technologies, get exposed to the ~~pain ~~wonder of working
with IoT hardware, and watch my bosses (John
Shewchuk and Tracey
Trewin)
sling some …

I've been dealing with Azure Storage for years now, and while most of
the time it's rock solid, on occasion you can get hit with networking
issues or other "brownout" drops in availability that can make you
question your sanity. I find it helpful to know what's going on behind …

On a recent project, I made my first foray into using Azure Mobile
Services Custom
API
support, with a Node.js-backed API implementation. I've used (and
loved) Mobile Services in the past for their push notification
infrastructure, but this was new ground for me. I liked the easy
setup - trivial …

I've been messing around with
Vagrantfor a few weeks now, on
and off, to manage my Linux VMs on my Win8.1 box with Hyper-V. So far,
it's been working well, barring a few caveats due primarily to MSIT's
restrictive proxy settings (although I could find out it's really …

Henry Robinson, the smart Cloudera SDE
behind The Paper Trail blog, just
posted a great primer on distributed
systems.
I highly recommend anyone interested in this field go and take a
look - if you've been around long enough, you'll have read most of the
papers, but it's a good collection …

I was recently asked by a colleague of mine why anyone is even using
NoSQL solutions at all, and not just sticking with SQL. After telling
them that people use them because they're
web-scale, I did some
thinking about when you should use NoSQL vs. just sticking with good old …

I've decided to move from my MSDN
blog to my own WordPress site. No big
deal, since my cobweb of a blog hadn't been updated in years, but you
might notice that this site is much younger than most of the posts on
it. I've migrated my old posts and …

One of my favorite new features in C# 4.0 is the dynamickeyword. Not only does
it allow you to do freaky things with interactions between C# and
scripting languages like IronRuby, but it opens up one of my favorite
features from that language – Duck Typing.

I’ve recently started looking
into Infer.NET –
I’m trying to learn some Bayesian inference, and it seems like a good
tool for the job (I’m a bear of little brain, and need all the help I
can get). However, it’s an exciting library, so I’ve …

So it’s been some time since I updated this blog, and I thought I’d
explain why just on the off-chance folks are actually reading. My wife
and I just bought a new house (well, town-house, we are in Seattle), and
that took (and is still taking) a lot …

I’ve been meaning to write a post on how to author custom Powershell
Cmdlets for some time now, as it’s incredibly easy to do and makes an
awesome cmd shell even better. I’m still more comfortable in tcsh, but
for working in Windows, Powershell makes life a …

I’m probably late to the party here (I’ve been in C#-land lately), but
I think I’ve finally worked out what irks me about the Checked vs.
Unchecked Exception religious war in the Java world. I’ve always been
on the Checked side of the fence, but …

I can’t tell you how many times I’ve heard this - from PMs, dev leads,
and colleagues. Other than the use of the “word” performant, what bugs
me about this is that usually they don’t have any data (or current data)
to back it up. Coming from the …