A planet of blogs from our members...

November 13, 2018

DjangoCon 2018 attracted attendees from around the world, including myself and several other Cakti (check out our DjangoCon recap post). Having attended a number of DjangoCons in the past, I looked forward to reconnecting with old colleagues and friends within the community, learning new things about our favorite framework, and exploring San Diego.

While it was a privilege to attend DjangoCon in person, you can experience it remotely. Thanks to technology and the motivated organizers, you can view a lot of the talks online. For that, I am thankful to the DjangoCon organizers, sponsors, and staff that put in the time and energy to ensure that these talks are readily available for viewing on YouTube.

Learn How to Give Back to the Django Framework

While I listened to a lot of fascinating talks, there was one that stood out and was the most impactful to me. I also think it is relevant and important for the whole Django community. If you have not seen it, I encourage you to watch and rewatch Carlton Gibson’s “Your web framework needs you!". Carlton was named a Django Fellow in January of 2018 and provides a unique perspective on the state of Django as an open source software project, from the day-to-day management, to the (lack of) diversity amongst the primary contributors, to the ways that people can contribute at the code and documentation levels.

This talk resonated with me because I have worked with open source software my entire career. It has enabled me to bootstrap and build elegant solutions with minimal resources. Django and its ilk have afforded me opportunities to travel the globe and engage with amazing people. However, in over 15 years of experience, my contributions back to the software and communities that have served me well have been nominal in comparison to the benefits I have received. But I came away from the talk highly motivated to contribute more, and am eager to get that ball rolling.

Carlton says in his talk, “we have an opportunity to build the future of Django here.” He’s right, our web framework needs us, and via his talk you will discover how to get involved in the process, as well as what improvements are being made to simplify onboarding. I agree with Carlton, and believe it’s imperative to widen the net of contributors by creating multiple avenues for contributions that are easily accessible and well supported. Contributions are key to ensuring a sound future for the Django framework. Whether it’s improving documentation, increasing test coverage, fixing bugs, building new features, or some other aspect that piques your interest, be sure to do your part for your framework. The time that I am able to put toward contributing to open source software has always supplied an exponential return, so give it a try yourself!

October 25, 2018

That’s it, folks — another DjangoCon in the books! Caktus was thrilled to sponsor and attend this fantastic gathering of Djangonauts for the ninth year running. This year’s conference ran from October 14 - 19, in sunny San Diego. ☀️

What a Crowd!

At Caktus we love coding with Django, but what makes Django particularly special is the remarkable community behind it. From the inclusive code of conduct to the friendly smiles in the hallways, DjangoCon is a welcoming event and a great opportunity to meet and learn from amazing people. With over 300 Django experts and enthusiasts attending from all over the world, we loved catching up with old friends and making new ones.

#Djangocon 2018 in San Diego has been the most inclusive, breath of fresh air conference I've ever attended, with the most beautiful and diverse group of people. Way to go Team Djangocon, @FlipperPA !!! <3

What a Lineup!

DjangoCon is three full days of impressive and inspiring sessions from a diverse lineup of presenters. Between the five Cakti there, we managed to attend almost every one of the presentations.

We particularly enjoyed Anna Makarudze’s keynote address about her journey with coding, Russell Keith-Magee’s hilarious talk about tackling time zone complexity, and Tom Dyson’s interactive presentation about Django and Machine Learning. (Videos of the talks should be posted soon by DjangoCon.)

What a Game!

Thanks to the 30+ Djangonauts who joined us for the Caktus Mini Golf Outing on Tuesday, October 16! Seven teams putted their way through the challenging course at Belmont Park, talking Django and showing off their mini golf skills. We had fun meeting new friends and playing a round during the beautiful San Diego evening.

Mini golf was a blast! An ending worthy of the Masters: a 2-way tie until the very end when — drum roll — @GrahamDumpleton, last player on the course (pictured with caddies) won by a single stroke! Congrats! And thanks to everyone who came out. #DjangoConpic.twitter.com/S79MoAr9zt

October 18, 2018

If you want to build a list page that allows
filtering and pagination, you have to get a few separate things
to work together. Django provides some tools for pagination,
but the documentation doesn't tell us how to make that work with anything else.
Similarly, django_filter makes it relatively easy to add filters to
a view, but doesn't tell you how to add pagination (or other things)
without breaking the filtering.

The heart of the problem is that both features use query parameters,
and we need to find a way to let each feature control its own query
parameters without breaking the other one.

Filters

Let's start with a review of filtering, with an example of how you might
subclass ListView to add filtering. To make it filter the way you want,
you need to create a subclass of
FilterSet
and set filterset_class to that class. (See that link for how to
write a filterset.)

classFilteredListView(ListView):filterset_class=Nonedefget_queryset(self):# Get the queryset however you usually would. For example:queryset=super().get_queryset()# Then use the query parameters and the queryset to# instantiate a filterset and save it as an attribute# on the view instance for later.self.filterset=self.filterset_class(self.request.GET,queryset=queryset)# Return the filtered querysetreturnself.filterset.qs.distinct()defget_context_data(self,**kwargs):context=super().get_context_data(**kwargs)# Pass the filterset to the template - it provides the form.context['filterset']=self.filtersetreturncontext

Here's an example of how you might create a concrete view to use it:

classBookListView(FilteredListView):filterset_class=BookFilterset

And here's part of the template that uses a form created by the filterset
to let the user control the filtering.

filterset.form is a form that controls the filtering, so
we just render that however we want and add a way to submit it.

That's all you need to make a simple filtered view.

Default values for filters

I'm going to digress slightly here, and show a way to give filters default
values, so when a user loads a page initially, for example, the items will
be sorted with the most recent first. I couldn't find anything about this in the
django_filter documentation, and it took me a while to figure out a good
solution.

To do this, I override __init__ on my filter set and add default values
to the data being passed:

To tell the view which page to display, we want to add a query parameter
named page whose value is a page number. In the simplest case, we can
just make a link with ?page=N, e.g.:

<ahref="?page=2">Goto page 2</a>

You can use the page_obj and paginator objects to build a full set
of pagination links, but there's a problem we should solve first.

Combining filtering and pagination

Unfortunately, linking to pages as described above breaks filtering. More specifically,
whenever you follow one of those links, the view will forget whatever filtering
the user has applied, because that filtering is also controlled by query
parameters, and these links don't include the filter's parameters.

So if you're on a page
https://example.com/objectlist/?type=paperback
and then follow a page link, you'll end up at
https://example.com/objectlist/?page=3
when you wanted to be at
https://example.com/objectlist/?type=paperback&page=3.

It would be nice if Django helped out with a way to build links that set
one query parameter without losing the existing ones, but I found a
nice example of a template tag
on StackOverflow
and modified it slightly into this custom template tag that helps
with that:

# <app>/templatetags/my_tags.pyfromdjangoimporttemplateregister=template.Library()@register.simple_tag(takes_context=True)defparam_replace(context,**kwargs):""" Return encoded URL parameters that are the same as the current request's parameters, only with the specified GET parameters added or changed. It also removes any empty parameters to keep things neat, so you can remove a parm by setting it to ``""``. For example, if you're on the page ``/things/?with_frosting=true&page=5``, then <a href="/things/?{% param_replace page=3 %}">Page 3</a> would expand to <a href="/things/?with_frosting=true&page=3">Page 3</a> Based on https://stackoverflow.com/questions/22734695/next-and-before-links-for-a-django-paginated-query/22735278#22735278 """d=context['request'].GET.copy()fork,vinkwargs.items():d[k]=vforkin[kfork,vind.items()ifnotv]:deld[k]returnd.urlencode()

Here's how you can use that template tag to build pagination links
that preserve other query parameters used for things like filtering:

October 11, 2018

Pictured from left: Caktus team members Vinod Kurup, Karen Tracey, and David Ray.

The Caktus team includes expert developers, sharp project managers, and eagle-eyed QA analysts. However, you may not know that there’s more to them than meets the eye. Here’s a peek at how Cakti spend their off-hours.

Vinod Kurup, M.D.

By day Vinod is a mild-mannered developer, but at night he swaps his keyboard for a stethoscope and heads to the hospital. Vinod’s first career was in medicine, and prior to Caktus he worked many years as an MD. While he’s now turned his expertise to programming, he still works part-time as a hospitalist. Now that’s what I call a side hustle.

Karen Tracey, Cat Rescuer

When Karen isn’t busy as both lead developer and technical manager for Caktus, she works extensively with Alley Cats and Angels, a local cat rescue organization dedicated to improving the lives and reducing the population of homeless cats in the Triangle area. She regularly fosters cats and kittens, which is why you sometimes find feline friends hanging out in the Caktus office.

David Ray, Extreme Athlete

Software development and extreme physical endurance training don’t generally go together, but let me introduce you to developer/sales engineer David. When not building solutions for Caktus clients, David straps on a 50-pound pack and completes 24-hour rucking events. Needless to say, he’s one tough Caktus. (Would you believe he’s also a trained opera singer?)

Pictured: David Ray at a recent rucking event.

These are just a few of our illustrious colleagues! Our team also boasts folk musicians, theater artists, sailboat captains, Appalachian cloggers, martial artists, and more.

September 28, 2018

If you’re building or updating a website, you’re probably wondering about which content management system (CMS) to use. A CMS helps users — particularly non-technical users — to add pages and blog posts, embed videos and images, and incorporate other content into their site.

CMS options

You could go with something quick and do-it-yourself, like WordPress (read more about WordPress) or a drag-and-drop builder like Squarespace. If you need greater functionality, like user account management or asset tracking, or if you’re concerned about security and extensibility, you’ll need a more robust CMS. That means using a framework to build a complex website that can manage large volumes of data and content.

Wait, what’s a framework?

Put simply, a framework is a library of reusable code that is easily edited by a web developer to produce custom products more quickly than coding everything from scratch.

Django and Drupal are both frameworks with dedicated functionality for content management, but there is a key difference between them:

Drupal combines aspects of a web application framework with aspects of a CMS

Django separates the framework and the CMS

The separation that Django provides makes it easier for content managers to use the CMS because they don’t have to tinker with the technical aspects of the framework. A popular combination is Django and Wagtail, which is our favorite CMS.

I think I’ve heard of Drupal ...

Drupal is open source and built with PHP programming language. For some applications, its customizable templates and quick integrations make it a solid choice. It’s commonly used in higher education settings, among others.

However, Drupal’s predefined templates and plugins can also be its weakness: while they are useful for building a basic site, they are limiting if you want to scale the application. You’ll quickly run into challenges attempting to extend the basic functionality, including adding custom integrations and nonstandard data models.

Other criticisms include:

Poor backwards compatibility, particularly for versions earlier than Drupal 7. In this case, updating a Drupal site requires developers to rewrite code for elements of the templates and modules to make them compatible with the newest version. Staying up-to-date is important for security reasons, which can become problematic if the updates are put off too long.

Unit testing is difficult due to Drupal’s method of storing configurations in a database, making it difficult to test the effects of changes to sections of the code. Failing to do proper testing may allow errors to make it to the final version of the website.

Another database-related challenge lies in how the site configuration is managed. If you’re trying to implement changes on a large website consisting of thousands of individual content items or users, none of the things that usually make this easier — like the ability to view line-by-line site configuration changes during code review — are possible.

What does the above mean for non-technical stakeholders? Development processes are slowed down significantly because developers have to pass massive database files back and forth with low visibility into the changes made by other team members. It also means there is an increased likelihood that errors will reach the public version of your website, creating even more work to fix them.

Caktus prefers Django

Django is used by complex, high-profile websites, including Instagram, Pinterest, and Eventbrite. It’s written in the powerful, open-source Python programming language, which was created specifically to speed up the process of web development. It’s fast, secure, scalable, and intended for use with database-driven web applications.

A huge benefit of Django is more control over customization, plus data can easily be converted. Since it’s built on Python, Django uses a paradigm called object-oriented programming, which makes it easier to manage and manipulate data, troubleshoot errors, and re-use code. It’s also easier for developers to see where changes have been made in the code, simplifying the process of updating the application after it goes live.

How to choose the right tool

Consider the following factors when choosing between Drupal and Django:

Need for customization

Internal capacity

Planning for future updates

Need for customization:
If your organization has specific, niche features or functionality that require custom development — for example, data types specific to a library, university, or scientific application — Django is the way to go. It requires more up-front development than template-driven Drupal but allows greater flexibility and customization. Drupal is a good choice if you’re happy to use templates to build your website and don’t need customization.

Internal capacity:
Drupal’s steep learning curve means that it may take some time for content managers to get up to speed. In comparison, we’ve run training workshops that get content management teams up and running on Django-based Wagtail in only a day or two. Wagtail’s intuitive user interface makes it easier to manage regular content updates, and the level of customization afforded by Django means the user interface can be developed in a way that feels intuitive to users.

Planning for future updates:
Future growth and development should be taken into account when planning a web project. The choices made during the initial project phase will impact the time, expense, and difficulty of future development. As mentioned, Drupal has backwards compatibility challenges, and therefore a web project envisioned as fast-paced and open to frequent updates will benefit from a custom Django solution.

Need a second opinion?

Don’t just take our word for it. Here’s what Brad Busenius at the University of Chicago says about their Django solution:

"[It impacts] the speed and ease at which we can create highly custom interfaces, page types, etc. Instead of trying to bend a general system like Drupal to fit our specific needs, we're able to easily build exactly what we want without any additional overhead. Also, since we're often understaffed, the fact that it's a developer-friendly system helps us a lot. Wagtail has been a very positive experience so far."

The bottom line

Deciding between Django and Drupal comes down to your specific needs and goals, and it’s worth considering the options. That said, based on our 10+ years of experience developing custom websites and web applications, we almost always recommend Django with Wagtail because it’s:

Easier to update and maintain

More straightforward for content managers to learn and use

More efficient with large data sets and complex queries

Less likely to let errors slip through the cracks

If you want to consider Django and whether it will suit your next project, we’d be happy to talk it through and share some advice. Get in touch with us.

September 20, 2018

We’re looking forward to taking part in the international gathering of Django enthusiasts at DjangoCon 2018, in San Diego, CA. We’ll be there from October 14 - 19, and we’re proud to attend as sponsors for the ninth year! As such, we’re hosting a mini golf event for attendees (details below).

This year’s speakers are impressive, thanks in part to Erin Mullaney, one of Caktus’ talented developers, who volunteered with DjangoCon’s Program Team. The three-person team, including Developer Jessica Deaton of Wilmington, NC, and Tim Allen, IT Director at The Wharton School, reviewed 257 speaker submissions. They ultimately chose the speakers with the help of a rating system that included community input.

“It was a lot of fun reading the submissions,” said Erin, who will also attend DjangoCon. “I’m really looking forward to seeing the talks this year, especially because I now have a better understanding of how much work goes into the selection process.”

Erin and the program team also created the talk schedule. The roster of speakers includes more women and underrepresented communities due to the DjangoCon diversity initiatives, which Erin is proud to support.

What we’re excited about

Erin said she’s excited about a new State of Django panel that will take place on Wednesday, October 17, which will cap off the conference portion of DjangoCon, before the sprints begin. It should be an informative wrap-up session.

Our Account Manager Tim Scales is particularly excited about Tom Dyson’s talk, “Here Come The Robots,” which will explore how people are leveraging Django for machine learning solutions. This is an emerging area of interest for our clients, and one of particular interest to Caktus as we grow our areas of expertise.

Follow us on Twitter @CaktusGroup and #DjangoCon to stay tuned on the talks.

Golf anyone?

If you’re attending DjangoCon, come play a round of mini golf with us. Look for our insert in your conference tote bag. It includes is a free pass to a mini golf outing that we’re hosting at Tiki Town Adventure Golf on Tuesday, October 16, at 7:00 p.m. (please RSVP online). The first round of golf is on us! Whoever shoots the lowest score will win a $100 Amazon gift card.*

No worries if you’re not into mini golf! Instead, find a time to chat with us one-on-one during DjangoCon.

*In the event of a tie, the winner will be selected from a random drawing from the names of those with the lowest score. Caktus employees can play, but are not eligible for prizes.

September 18, 2018

I recently looked into whether I could use pip-tools to improve my workflow around projects' Python dependencies. My conclusion was that pip-tools would help on some projects, but it wouldn't do everything I wanted, and I couldn't use it everywhere. (I tried pip-tools version 2.0.2 in August 2018. If there are newer versions, they might fix some of the things I ran into when trying pip-tools.)

My problems

What were the problems I wanted to find solutions for, that just pip wasn't handling?
Software engineer Kenneth Reitz explains them pretty well
in his post, but I'll summarize here.

Let me start by briefly describing the environments I'm concerned with. First is my development environment, where I want to manage the dependencies. Second is the test environment, where I want to know exactly what packages and versions we test with, because then we come to the deployed environment, where I want to use exactly the same Python packages and versions as I've used in development and testing, to be sure no problems are introduced by an unexpected package upgrade.

The way we often handle that is to have a requirements file with every package and its version specified. We might start by installing the packages we know that we need, then saving the output of pip freeze to record all the dependencies that also got installed and their versions. Installing into an empty virtual environment using that requirements file gets us the same packages and versions.

But there are several problems with that approach.

First, we no longer know which packages in that file we originally wanted, and which were pulled in as dependencies. For example, maybe we needed Celery, but installing it pulled in a half-dozen other packages. Later we might decide we don't need Celery anymore and remove it from the requirements file, but we don't know which other packages we can also safely also remove.

Second, it gets very complicated if we want to upgrade some of the packages, for the same reasons.

Third, having to do a complete install of all the packages into an empty virtual environment can be slow, which is especially aggravating when we know little or nothing has changed, but that's the only way to be sure we have exactly what we want.

Requirements

To list my requirements more concisely:

Distinguish direct dependencies and versions from incidental

Freeze a set of exact packages and versions that we know work

Have one command to efficiently update a virtual environment to have exactly the frozen packages at the frozen versions and no other packages

Make it reasonably easy to update packages

Work with both installing from PyPI, and installing from Git repositories

Take advantage of pip's hash checking to give a little more confidence that packages haven't been modified

Support multiple sets of dependencies (e.g. dev vs. prod, where prod is not necessarily a subset of dev)

Perform reasonably well

Be stable

That's a lot of requirements. It turned out that I could meet more of them with pip-tools than just pip, but not all of them, and not for all projects.

How to set it up

I put the top-level requirements in requirements.in/*.txt.

To manage multiple sets of dependencies, we can include "-r file.txt",
where "file.txt" is another file in requirements.in, as many times as we want.
So we might have a base.txt, a dev.txt that starts with -r base.txt
and then adds django-debug-toolbar etc,
and a deploy.txt that starts with -r base.txt and then adds gunicorn.

There's one annoyance that seems minor at this point, but turns out to be a bigger problem: pip-tools only supports URLs in these requirements files if they're marked editable with -e.

This looks only at the requirements file(s) we tell it to look at, and not
at what's currently installed in the virtual environment. So one unexpected
benefit is that pip-compile is faster and simpler than installing everything
and then running pip freeze.

The output is a new requirements file at requirements/dev.txt.

pip-compile nicely puts a comment at the top of the output file to tell
developers exactly how the file was generated and how to make a newer version of it.

How to make the current virtual environment have the same packages and versions

And that's all. There's no need to create a new empty virtual environment to
make sure only the listed requirements end up installed. If everything is already as we want it, no packages need to be installed at all. Otherwise only the necessary changes are made. And if there's anything installed that's no longer mentioned in our requirements, it gets removed.

Except ...

pip-sync doesn't seem to know how to uninstall the packages that we installed using -e <URL>. I get errors like this:

Can't uninstall 'pkgname1'. No files were found to uninstall.
Can't uninstall 'pkgname2'. No files were found to uninstall.

I don't really know, then, whether pip-sync is keeping those packages up to date. Maybe before running pip-sync, I could just

rm -rf $VIRTUAL_ENV/src

to delete any packages that were installed with -e? But that's ugly and would be easy to forget, so I don't want to do that.

Hash checking

I'd like to use hash checking, but I can't yet. pip-compile can generate hashes for packages we will install from PyPI, but not for ones we install with -e <URL>. Also, pip-sync doesn't check hashes. pip install will check hashes, but if there are any hashes, then it will fail unless all packages have hashes. So if we have any -e <URL> packages, we have to turn off hash generation or we won't be able to pip install with the compiled requirements file. We could still use pip-sync with the requirements file, but since pip-sync doesn't check hashes, there's not much point in having them, even if we don't have any -e packages.

What about pipenv?

Pipenv promises to solve many of these same problems. Unfortunately, it imposes other constraints on my workflow that I don't want. It's also changing too fast at the moment to rely on in production.

Pipenv solves several of the requirements I listed above, but fails on these:
It only supports two sets of requirements: base, and base plus dev, not arbitrary sets as I'd like.
It can be very slow.
It's not (yet?) stable: the interface and behavior is changing constantly, sometimes
multiple times in the same day.

It also introduces some new constraints on my workflow. Primarily, it wants to control where the virtual environment is in the filesystem. That both prevents me from putting my virtual environment where I'd like it to be, and prevents me from using different virtual environments with the same working tree.

Shortcomings

pip-tools still has some shortcomings, in addition to the problems with checking hashes I've already mentioned.

Most concerning are the errors from pip-sync when packages have previously been
installed using -e <URL>. I feel this is an unresolved issue that needs to be fixed.

Also, I'd prefer not to have to use -e at all when installing from a URL.

This workflow is more complicated than the one we're used to, though no more complicated than we'd have with pipenv, I don't think.

The number and age of open issues in the pip-tools git repository worry me. True, it's orders of magnitude fewer than some projects, but it still suggests to me that pip-tools isn't as well maintained as I might like if I'm going to rely on it in production.

Conclusions

I don't feel that I can trust pip-tools when I need to install packages from Git URLs.

But many projects don't need to install packages from Git URLs, and for those, I think adding pip-tools to my workflow might be a win. I'm going to try it with some real projects and see how that goes for a while.

August 31, 2018

On August 11, I attended the National Day of Civic Hacking hosted by Code for Durham. More than 30 attendees came to the event, hosted in the Caktus Group Tech Space, to collaborate on civic projects that focus on the needs of Durham residents.

National Day of Civic Hacking is a nationwide day of action that brings together civic leaders, local government officials, and community organizers who volunteer their skills to help their local community. Simone Sequeira, Senior Product Manager of GetCalFresh, came from Oakland to participate and present at our Durham event. Simone inspired us with a presentation of GetCalFresh, a project supported by Code for America, that streamlines the application process for food assistance in California. It started as just an idea, and turned into a product used statewide that’s supported by over a half dozen employees. Many Code for Durham projects also start as ideas, and the National Day of Civic Hacking provided an opportunity to turn those ideas into realities.

Pictured: Laura Biedeger, a Team Captain at Code for Durham and a co-organizer of the event, speaks to attendees. I'm standing to the left.

Durham Projects

We worked on a variety of projects in Durham, including the following:

One group of designers, programmers, and residents audited the Code for Durham website. The group approached the topic from a user-centered design perspective: they identified and defined user personas and wrote common scenarios of visitors to the site. By the end of the event they had documented the needs of the site and designed mockups for the new site.

Regular volunteers with Code for Durham have been working with the Durham Innovation Team to create an automated texting platform for the Drivers License Restoration Initiative, which aims to support a regular amnesty of driver’s license suspensions in partnership with the Durham District Attorney’s Office. During our event volunteers added a Spanish language track to the platform.

The “Are We Represented?” project focused on voter education: showing how the makeup of County Commissioner boards across the state compare to the population in their county. During the event I worked with Jason Jones, the Analytics and Innovation Manager of Greensboro, to deploy the project to the internet (and we succeeded!).

Pictured: The Are We Represented group reviews State Board of Elections data files.

Another group partnered with End Hunger in Durham, which provides a regularly updated list of food pantries and food producers (gardeners, farmers, grocery stores, bakeries) that regularly donate surplus food. The volunteers reviewed an iOS app they had developed to easily find a pantry, and discussed the development of an Android app.

Join Us Next Time!

The National Day of Civic Hacking gave volunteers a chance to get inspired about new project opportunities, to meet new volunteers, city employees, and to focus on a project for an extended period of time. The projects will continue at Code for Durham’s regularly hosted Meetup at the Caktus Group Tech Space. Volunteers are always welcome, so join us at the next Meetup!

August 27, 2018

This is the first article in a series that explores concepts of state in CircuitPython.

In this installment, we discuss the platform we're using (both CircuitPython and the Adafruit M0/M4 boards that support it), and build a simple circuit for demonstration purposes. We'll also talk a bit about abstraction.

This series is intended for people who are new to Python, programming, and/or microcontrollers, so there's an effort to explain things as thoroughly as possible. However, experience with basic Python would be helpful.

If you ever struggled to implement a complicated development project, starting your next one with a discovery workshop will help. Discovery workshops save you time and money over the course of a project because we help you answer important questions in advance, ensuring that the final product lines up with your primary end goals. Our new guide, Shared Understanding: A Guide to Caktus Discovery Workshops, demonstrates the value of these workshops and why we’ve made them a core component of our client services.

Set Your Project Up for Success

Discovery workshops are vital in defining a project and are an ideal way to overcome the challenges that arise when multiple stakeholders have varying opinions and conflicting visions. By facilitating a discovery workshop, we create a shared understanding of the project and ultimately streamline the development process to ensure that our clients get the best value for their investment. Projects that begin with a discovery phase are more successful for these simple reasons:

They cost less because we build the right thing first

They’re done faster because we focus on the most valuable features first

They have better results because user needs are prioritized from the start

Discovery workshops are part of our best practices for building sharp web apps the right way. We’ve proven that these workshops ensure that projects not only hit their objectives but that they do so on budget, reducing the likelihood of requiring additional work (or money) further down the line.

August 20, 2018

On July 27 - 28, we ran our quarterly ShipIt Day here at Caktus. These 24-hour sprints, which we’ve organized since 2012, allow Cakti to explore new projects that expand or increase their skill sets. The event kicked off at 3:00 pm on Thursday and we reconvened at 3:00 pm on Friday to showcase our progress. The goal is to experiment, take risks, and try something new.

May 31, 2018

OK, this is a debug session in progress, so don't expect a nice solution at the end. We're working
on a project that does analysis of some public voter registration data. The DB is hosted on Amazon
RDS and I've been perplexed by how poorly queries are performing there, despite the tables only have
about 10 million rows. Simple queries are taking many minutes, which is orders of magnitude slower
than my laptop.

Mark suggested running 'VACUUM ANALYZE', which I didn't think would help
because my understanding was that the autovacuum process in PostgreSQL would be taking care of that
on a regular basis. These queries had been slow for days with no recent inserts or updates, so
certainly autovacuum should have caught up to them by now. But, I tried it anyway and lo and behold:

So those 2 large tables haven't been ANALYZEd in weeks, despite the fact that we import a 10 million
row CSV once every week. This is the end of my debugging road, for now. Hopefully, I'll figure out
what's going on.

February 13, 2018

July 2018 update: I’ll be giving a talk based on this guide at PyOhio next week. If you’re there, please come say hello!

It’s not always obvious, but migrating from Python 2 to 3 doesn’t have to be an overwhelming effort spike. I’ve done Python 2-to-3 migration assessments with several organizations, and in each case we were able to turn the unknowns into a set of straightforward to-do lists.

I’ve written a Python 2-to-3 migration guide [PDF] to help others who want to make the leap but aren’t sure where to start, or have maybe already begun but would like another perspective. It outlines some high level steps for the migration and also contains some nitty-gritty technical details, so it’s useful for both those who will plan the migration and the technical staff that will actually perform it.

The (very brief) summary is that most of the work can be done in advance without sacrificing Python 2 compatibility. What’s more, you can divide the work into manageable chunks that you can tick off one by one as you have time to work on them. Last but not least, many of the changes are routine and mechanical (for example, changing the print statement to a function), and there are tools that do a lot of the work for you.

January 12, 2018

Summary

I recently tripped over my reliance on a simple (and probably obscure) feature in Python’s distutils that setuptools doesn’t support. The result was that I created a tarball for my posix_ipc module that lacked critical files. By chance, I noticed when uploading the new tarball that it was about 75% smaller than the previous version. That’s a red flag!

Fortunately, the bad tarball was only on PyPI for about 3 minutes before I noticed the problem and removed the release.

I made debugging harder on myself by stepping away from the project for a long time and forgetting what changes I’d made since the previous release.

Background

In February 2014, I (finally) made my distribution PyPI–friendly. Prior to that I’d built my distribution tarballs with a custom script that explicitly listed each file to be included in the tarball. The typical, modern, and PyPI–friendly way to build tarballs is by writing a MANIFEST.in file that a distribution tool (like Python’s distutils) interprets into a MANIFEST file. A command like `python setup.py sdist` reads the manifest and builds the tarball.

That’s the method to which I switched in February 2014, with one exception—since my custom script already contained an explicit list of files, it was easier to write a MANIFEST file directly and skip the intermediate MANIFEST.in. That works fine with distutils.

I released version 1.0.0 of posix_ipc in March of 2015, and haven’t needed to make any changes to the code until just now (the beginning of 2018). However, in February 2016, I made a small change to setup.py that I thought was harmless. (Ha!)

I added a conditional import of setuptools so that I could build wheels. (Side note: I really like wheels!) The change allows me to build posix_ipc wheels on my laptop where I can ensure setuptools is available, but otherwise falls back on Python’s distutils which works just fine for everything else I need setup.py to do, including installing from a tarball. The code looks like this —

The Problem

Just a few days ago, I released a maintenance release of posix_ipc, and it was then I noticed that the tarballs I built with my usual python setup.py sdist command were 75% smaller and missing several critical files. Because it had been 23 months since I made my “harmless” change to setup.py, the switch from using distutils to setuptools wasn’t exactly fresh in my mind.

However, some examination of my commit log and a realization that this was the first release I’d made after making that change gave me a suspicion, and grepping through setuptools‘ code revealed no references to MANIFEST, only MANIFEST.in.

[B]e sure to ignore any part of the distutils documentation that deals with MANIFEST or how it’s generated from MANIFEST.in; setuptools shields you from these issues and doesn’t work the same way in any case. Unlike the distutils, setuptools regenerates the source distribution manifest file every time you build a source distribution, and it builds it inside the project’s .egg-info directory, out of the way of your main project directory.

So that was the problem—setuptools doesn’t look for a MANIFEST file, only MANIFEST.in. Since I had the former but not the latter, setuptools used its defaults instead of my list of files in MANIFEST.

The Solution

This part was easy. I converted my MANIFEST file to a MANIFEST.in which works with both setuptools and distutils. That’s probably a more robust solution than the hardcoded list in MANIFEST anyway.

I’m pleased that posix_ipc has been stable and well-behaved for such a long time, but these long breaks between releases mean a certain amount of mental rust has always accumulated when it’s time for the next one.

September 09, 2017

I had a few issues with this many moons ago when I was trying the initial social-auth-core packaging. Yesterday I was able to get it to work with the latest version, which in turn allowed me to move from Django 1.10 to Django 1.11.You will most likely encounter failed Django migrations when making the switch. Some posts on the 'net recommend first upgrading to an intermediate version of python-social-auth to resolve that, but I wanted a simpler production switchover, which I found in this social-app-django ticket. The eventual production deploy solution after testing locally with a copy of the production database was:

Temporarily hack my Ansible deploy script to fail after updating the source tree and virtualenv for the new libraries but before running migrations.

On the server, as the project user, run pip uninstall python-social-auth to delete the old package.

On the server, make another copy of the production database and then run update django_migrations set app='social_django' where app='default'; via psql.

On the server, as the project user, run python manage.py migrate social_django 0001 --fake.

July 03, 2017

The last time I wrote about Quality Engineering, I mentioned that some of the reasons why people are not familiar with this term are, in no particular order:

'Quality' is usually something that is added as an after thought and doesn't really come into the picture, if ever, until the very end of the release process

Nobody outside of a QA team really knows what they do. It has something to do with testing...

Engineering is usually identified with skills related to writing code and designing algorithms, usually by a developer and not by QA

A quick search on Google shows the following results:

104,000,000 hits for "Software Engineer"

86,900,000 hits for “Quality Control”

83,100,000 hits for “Quality Assurance”

5,390,000 hits for “Quality Engineer”

As you can see, it is no wonder that whenever I say 'quality engineer' people always think that what I really meant to say was 'quality assurance' or 'quality control'. The term is just not that well-known! So in order to clarify what the difference is between these professions, today I'd like to talk a little bit about quality assurance and what I usually think whenever someone tells me that they either work in QA or have a 'QA team'.

Wikipedia tells us that the terms 'quality assurance' (QA) and 'quality control' (QC) are often used interchangeably to refer to ways of ensuring the quality of a service or product.

Furthermore,

"Quality assurance comprises administrative and procedural activities implemented in a quality system so that requirements and goals for a product, service or activity will be fulfilled. It is the systematic measurement, comparison with a standard, monitoring of processes and an associated feedback loop that confers error prevention."
-- Wikipedia

That is quite a mouth full (the emphasized words are mine), but I feel that it does a good job at stating the following ideas:

Quality Assurance and/or Quality Control is used to assure the quality of a product, but there is no clear distinction as to when in the release process it should be used. In my experience, it usually happens when the product is close to being shipped!

Used to make sure that requirements (the what) are fulfilled (the how)

Used to measure, monitor and compare results against a standard

Used for error preventions (which to me denotes a reactive mode compared to a proactive mode)

In other words, those who do quality assurance for a living are involved in verifying that the final version of the product being tested delivers exactly what was designed with the expected behavior and outcome. It requires that the QA person fully understand what is being added to or changed in the product and, most importantly, what the end result should be. Testing is definitely a big part of the 'day to day' activities for someone in QA, which does provide useful information to create a positive feedback loop and hopefully increase error prevention.

Here's what I don't like about this whole business though:

Quality is something that must be part of all phases of a product and not at the very end of the process. A good QA person is usually so familiar with the product being tested that one could say that QA is the first customer a company has! If you have someone in your team who can fully understand how your product works, where the pain points are, knows at a glance if a new feature or a fix does not follow the existing standards, and has the ability to tell you if something doesn't feel right, would you want to hear this type of feedback at the very end? By then, can you really afford to put things on hold and re-design your product??? In my experience, the answer to this question has 99.99% of the time been 'No'.

Quality is the responsibility of everyone involved with a product and not only of those in QA! Everyone, document writers, translators, user experience (UX) experts, product managers, you name it, everyone should be in the business of delivering and assuring the quality of the product! If you bought something, would you be OK with accepting mediocre user experience, documentation, features and translations? I doubt it.

Monitoring and measuring how a product compares against some set of standardized benchmarks is definitely important but as customers request more and more new features and the product's complexity increases, are your benchmarks also keeping up with all these changes? More importantly, since you are the one using the product day and night, do you have any input into updating the benchmarks? I certainly hope so.

Lastly, if your job is to make sure that no product 'goes out the door' without a thorough validation, that it works as expected and that all known issues have been fixed, aren't you forgetting something? What about the issues that are not known yet? You may be thinking that I'm joking, but seriously. If all you do is prevent errors from being shipped to your customers, how about detecting them as early as possible to give all major stake-holders enough time to make a decision as to what should be done with them? Again, if you're catching them at the end of the release cycle, it could be too late.

If your company has a QA team, then you're already ahead of the game, since it is only when customer dissatisfaction is very high and the final numbers for the quarter start to look gloomy that people start paying attention to delivering quality. But it is not enough if you're only kicking the can down the road only to find yourself facing the same scenario later on! Quality, good quality, is what everyone in your team should be striving for... not some times, but all the time!

If you are in a QA team, do you ever feel like you're ahead of the game or feel like you're constantly playing catch up? Do you wish you could have a chance to catch issues as early as possible? Wouldn't you want to stop racing against the clock to get issues verified and have a shot at doing more exploratory testing and identify problems early on? Would you say 'no' to an opportunity to provide some insight into how the product could be improved and perhaps how some work-flows could be simplified to increase the usability?

It should be clear by now that quality is something that should be something systemic for any project or company who takes their customer satisfaction as their top priority! Sure you can test the product as much as you (or your QA team) can handle, but you'd be only treating the symptoms. Maintaining a 'quality first' mentality and improving existing processes to make sure that quality is an integral part of everyone's day to day activities is primordial if you really want to make a bigger impact!

This is when a Quality Engineer comes in! A Quality Engineer is someone who can actively and continuously keep driving improvements to the release cycle process and are in the unique position to help the entire team adopt these improvements so that everyone is using the same methodologies.

Next time I will then talk about quality engineering (QE), what it is, what it isn't, and how you should be either hiring more QE or, if you're in QA, how you should be working to become a QE!

As always, please let me know what your thoughts are on this topic as I'd live to get some constructive feedback!

Disclaimer: The opinions contained within this article are mine alone and do not necessarily represent the opinions of any entity whatsoever with which I have been, am now or will be affiliated.

July 02, 2017

On behalf of the Jython development team, I'm pleased to announce that the final release of Jython 2.7.1 is available! We thought 2017-07-01 was a perfect time to release version 2.7.1 :) This is a bugfix release. Bug fixes include improvements in ssl and pip support along with lots of improvements in CPython compatibility.

Please see the NEWS file for detailed release notes. This release of Jython requires JDK 7 or above.

This release is being hosted at maven central. There are three main distributions. In order of popularity:

Most likely, you want the traditional installer. NOTE: the installer automatically installs pip and setuptools (unless you uncheck that option), but you must unset JYTHON_HOME if you have it set. See the installation instructions for using the installer.

June 30, 2017

Whenever I meet someone for the first time, after we get past the initial niceties typically involved when you meet someone for the first time, eventually the conversation shifts to work and what one does for a living. Inevitably I'm faced with what, at a first glance, may sound like a simple question and the conversation goes like this:

New acquaintance: "What do you do at Red Hat?"

Me: "I manage a team of quality engineers for a couple of different products."

New acquaintance: "Oh, you mean quality assurance, right? QA?"

Me: "No, quality engineers. QE."

What usually followed then was a lengthy monologue whereby I spent usually around ten to fifteen minutes explaining what the difference between QA and QE is and what, in my opinion, sets these two professions apart. Now, before I get too deep into this topic, I have to add a disclaimer here so not to give folks the impression that what I'm talking about is backed by any official definition or some type of professional trade organization! The following are my own definitions and conclusions, none of which were pulled out of thin air, but backed by (so far) 10 years of experience working on the field of delivering quality products. If there are formal definitions out there, and they match with my own, it is by pure coincidence.

Why the term 'Quality Engineer' is not well known I'm not sure, but I have a hunch that it may be related to something I noticed throughout the 10 years that I have spent on this field. In my personal experience, 'quality' is something that is not always considered as part of the creation of a new company, product or project. Furthermore, the term 'quality' is also not well defined or understood by those involved in actually attempting to 'get more' of it.

In my experience, folks usually forget about the word 'quality', whatever that may be, happily start planning and developing their new ideas/products and eventually ship it to their customers. If the customer complains that something is not working or performing as advertised or it doesn't meet their expectations, no problem. Someone will convey the feedback back to the developers, a fix will eventually be provided and off it goes to the customer. Have you ever seen this before? I have!

Eventually, assuming that the business is doing well and is attracting more paying customers, it is highly likely that support requests or requests for new features will increase. After all, who wants to pay for something that doesn't work as expected? Also, who doesn't want a new feature of their own either? Depending on the size of the company and the number of new requests going into their backlog, I'd expect that either one of the following events would then take place:

More tasks from the backlog would be added to individual's 'plates', or

New associates would be hired to handle the volume of tasks

I guess one could also stop accepting new requests for support or new features, but that would not make your customers happy, would it?

Regardless of the outcome, the influx of new tasks is dealt with and if things get out of control again, one could always try to get an intern or distribute tasks more evenly. Now, notice how the word 'quality' has not been mentioned yet? It is no accident that to solve an increase of more work, most often than not the number one solution used is to throw more resources at it. There's even a name for this type of 'solution': The Mythical Man-Month.

You see, sadly, 'quality' is something that usually only becomes important as an afterthought. It is the last piece added to the puzzle that comprises the machinery of delivering something to an end user. It is only when enough angry and unsatisfied paying customers make enough noise about the unreliability or usability of the product that folks start asking: "Was this even tested before being put on the market?"

If the pain being inflicted by customer feedback is sharp enough, a Quality Assurance (QA) team is hastily put together. Most of the time in my experience, this is a Team of One usually made up of one of the developers who after being dragged kicking and screaming from his cubicle, eventually is beat into accepting his new role as a button pusher, text field filler, testing guy. Issues are then assigned to him and a general sense of relief is experienced by all. Have you also seen this before? I have! I'm 2 for 2 so far!

The idea is that by creating a team of one to sit in the receiving end of the product release cycle, nothing would get shipped until some level of 'quality' is achieved. The fallacy with this statement, however, is that no matter how agile your team may be, the assurance of the quality for a product somehow is still part of a waterfall model. Wouldn't it be better if problems were caught as early as possible in the process instead of waiting until the very end? To me that is a no brainer but somehow the process of testing a product is still relegated to the very end, usually when the date for the release is just around the corner.

Why is the term Quality Engineer not well known then? I feel that the answer is comprised of several parts:

'Quality' doesn't come into the picture, if ever, until the very end of the game;

If there is a QA team, nobody outside of that team really knows what they do. It has something to do with testing...

Engineering is usually identified with skills related to writing code and designing algorithms, usually by a developer and not by QA;

No surprise that quality engineering is something foreign to most!

OK, so what is a Quality Engineer then? Glad you asked! The answer to that I shall provide in a subsequent post, as I still need to cover some more ground and talk about what 'quality' is, what someone in QA does and finally what is a QE!

My next article will continue this journey through the land of Quality and Engineering, and in the meantime, please let me know what you think about this subject.

June 27, 2017

This week I started reading On Writing: A Memoir of the Craft by Stephen King, a book that has been mentioned a few times by people I usually interview for my weekly podcast as something that is both inspiring and has had a major impact on their lives and careers. After the third or forth time someone mentioned I finally broke down and got myself a copy at the local bookstore.

I have to say that, so far, I am completely blown away by this book! I can totally see why everyone else recommended it as something that people should add to their BTR (Books To Read) list! First of all, the first section of the book, which Stephen King calls his 'C.V.' (and not his memories or auto biography), covers his early life as a child, his experiences and struggles (there are quite a few passages that will most likely get you to laugh out loud) growing up with his mom and older brother, Dan. This section, roughly speaking around 100 pages or so, are so easy to relate to that you can probably be done with them in about 2 hours no matter what your reading pace is. I am always captivated to learn how someone 'came to be', the real 'behind the scenes' if you will, of how someone started out their lives and the paths they took to get to where they are now.

The next sections talk about what any aspiring writer should add to their 'toolbox' and it covers many interesting topics and suggestions which, if you really think about it, makes a ton of sense. This is where I am in the book right now, and though it isn't as captivating as the first section, it should still appeal to anyone looking for solid advice on how to become a better writer in my humble opinion.

Though I one day do aspire to become a published writer (fiction most likely), and I am enjoying this book that I'm having a real hard time putting it down, the reason why I chose to write about it is related to a piece of advice that Stephen King shares with the reader about the habit of reading.

Stephen King claims that, to become a better writer one must at least obey the following rules:

Read every day!

Write every day!

It is by reading a lot (something that should come naturally to anyone who reads every day) that one learns new vocabulary words, different styles of prose, how to structure ideas into paragraphs and rhythm. He says that it doesn't matter if you read in 'tiny sips' or in huge 'swallows', but as long as you continue to read every day, you'll develop a great and, in his opinion, required habit for becoming a better writer. Obviously, based on his two rules you'd need to write every day too, and if you're one of us who is toying with the idea of becoming a writer one day (or want to become a better writer), I too highly recommend that you give this book a shot! I know, I know, I have not finished it yet but still... I highly recommend it!

Back to the habit of reading and the purpose of this post, I remember back in 2008 my own 'struggle' to 'find the time' to read non technical books. You know, reading for fun? Back then I was doing a lot of reading, but mostly it consisted of blog posts and articles recommended by my RSS feeds, and since I was very much involved with a lot of different open source projects, I mostly read about GNOME, KDE, Ubuntu and Python. Just the thought of reading a book that did not cover any of these topics gave me a feeling of uneasiness and I couldn't picture myself dedicating time, precious time, to reading 'for fun.' But eventually I realized that I needed to add a bit more variety to my reading experience and that sitting in front of my computer during my lunch break would not help me with this at all. There were too many distractions to lure me away from any book I may be trying to read.

I started out by picking up a book that everyone around me had mentioned many times as being 'wicked cool' and 'couldn't put it down' kind of book. Back then I worked at a startup and most of the engineers around me were much younger than me and at one point or another most of them were into 'the new Harry Potter' book. I confess that I felt judgmental and couldn't fathom the idea of reading a 'kid book' but since I was trying to create a new habit and since my previous attempts had failed miserably, I figured that something drastic was just what the doctor would have recommended. One day after work, before driving back home, I stopped by the public library and picked up Harry Potter and the Sorcerer's Stone.

Next day at work when I took my lunch break, I locked my laptop and went downstairs to a quiet corner of the building's lobby. I picked a nice, comfortable seat with a lot of natural sun light and view of the main entrance and started reading... or at least I thought I did. Whenever I started to read a paragraph, someone would open the door at the main entrance to the building either on their way in or out, and with them went my focus and my mind would start wandering. Eventually I'd catch myself and back to the book my eyes went, only to be disrupted by the next person opening the door. Needless to say, experiment 'Get More Reading Done' was an utter failure!

June 26, 2017

Plotting is an essential component of data analysis. As a data scientist, I spend a significant amount of my time making simple plots to understand complex data sets (exploratory data analysis) and help others understand them (presentations).

In particular, I make a lot of bar charts (including histograms), line plots (including time series), scatter plots, and density plots from data in Pandas data frames. I often want to facet these on various categorical variables and layer them on a common grid.

To that end, I made pythonplot.com, a brief introduction to Python plotting libraries and a "rosetta stone" comparing how to use them. I also included comparison to ggplot2, the R plotting library that I and many others consider a gold standard.

June 22, 2017

Summary

This is a followup to an earlier post about using Python to measure the “Anglo-Saxonicity” of a text. I’ve used my code to analyze the Baby version of the British National Corpus, and I’ve found some interesting results.

Introduction

Thanks to a suggestion from Ben Sizer, I decided to analyze the British National Corpus. I started with the ‘baby’ corpus which, as you might imagine, is smaller than the full corpus.

It’s described as a “100 million word snapshot of British English at the end of the twentieth century“. It categorizes text samples into four groups: academic, conversations, fiction, and news. Below are stack plots showing the percentage of Anglo-Saxon, non-Anglo-Saxon, and unknown words for each document in each of the four groups. The Y axis shows the percentage of words in each category. The numbers along the X axis identify individual documents within the group.

I’ve deliberately given the charts non-specific names of Group A, B, C, and D so that we can play a game. :-)

Before we get to the game, here’s the averages for each group in table form. (The numbers might not add exactly to 100% due to rounding.)

Anglo-Saxon (%)

Non-Anglo-Saxon (%)

Unknown (%)

Group A

67.0

17.7

15.3

Group B

56.1

25.8

18.1

Group C

72.9

13.2

13.9

Group D

58.6

22.0

19.3

Keep in mind that “unknown” words represent shortcomings in my database more than anything else.

The Game

The Baby BNC is organized into groups of academic, conversations, fiction, and news. Groups A, B, C, and D each represent one of those groups. Which do you think is which?

Click below to reveal the answer to the game and a discussion of the results.

Answers

Anglo-Saxon (%)

Non-Anglo-Saxon (%)

Unknown (%)

A = Fiction

67.0

17.7

15.3

B = Academic

56.1

25.8

18.1

C = Conversations

72.9

13.2

13.9

D = News

58.6

22.0

19.3

Discussion

With the hubris that only 20/20 hindsight can provide, I’ll say that I don’t find these numbers terribly surprising. Conversations have the highest proportion of Anglo-Saxon (72.9%) and the lowest of non-Anglo-Saxon (13.2%). Conversations are apt to use common words, and the 100 most common words in English are about 95% Anglo-Saxon. The relatively fast pace of conversation doesn’t encourage speakers to pause to search for those uncommon words lest they bore their listener or lose their chance to speak. I think the key here is not the fact that conversations are spoken, but that they’re impromptu. (Impromptu if you’re feeling French, off-the-cuff if you’re more Middle-English-y, or extemporaneous if you want to go full bore Latin.)

Academic writing is on the opposite end of the statistics, with the lowest portion of Anglo-Saxon words (56.1%) and the highest non-Anglo-Saxon (25.8%). Academic writing tends to be more ambitious and precise. Stylistically, it doesn’t shy away from more esoteric words because its audience is, by definition, well-educated. It doesn’t need to stick to the common core of English to get its point across. In addition, those who shaped academia were the educated members of society, and for many centuries education was tied to the church or limited to the gentry, and both spoke a lot of Latin and French. That has probably influenced even the modern day culture of academic writing.

Two of the academic samples managed to use fewer than half Anglo-Saxon words. They are a sample from Colliding Plane Waves in General Relativity (a subject Anglo-Saxons spent little time discussing, I’ll wager) and a sample from The Lancet, the British medical journal (49% and 47% Anglo-Saxon, respectively). It’s worth noting that these samples also displayed highest and 5th highest percentage of words of unknown etymology (26% and 21%, respectively) of the 30 samples in this category. A higher proportion of unknowns depresses the results in the other two categories.

Fiction rests in the middle of this small group of 4 categories, and I’m a little surprised that the percentage of Anglo-Saxon is as high as it is. I feel like fiction lends itself to the kind of description that tends to use more non-Anglo-Saxon words, but in this sample it’s not all that different from conversation.

News stands out for having barely more Anglo-Saxon words than academic writing, and also the highest percentage of words of unknown etymological origin. The news samples are drawn principally from The Independent, The Guardian, The Daily Telegraph, The Belfast Telegraph, The Liverpool Daily Post and Echo, The Northern Echo, and The Scotsman. It would be interesting to analyze each of these groups independently to see if they differ significantly.

Future

My hypothesis that conversations have a high percentage of Anglo-Saxon words because they’re off-the-cuff rather than because they’re spoken is something I can challenge with another experiment. Speeches are also spoken, but they’re often written in advance, without the pressure of immediacy, so the author would have time to reach for a thesaurus. I predict speeches will have an Anglo-Saxon/non-Anglo-Saxon profile closer to that of fiction than of either of the extremes in this data. It might vary dramatically based on speaker and audience, so I’ll have to choose a broad sample to smooth out biases.

I would also like to work with the American National Corpus.

Stay tuned, and let me know in the comments if you have observations or suggestions!

June 20, 2017

On behalf of the Jython development team, I'm pleased to announce that the third release candidate of Jython 2.7.1 is available! This is a bugfix release. Bug fixes include improvements in ssl and pip support.

Please see the NEWS file for detailed release notes. This release of Jython requires JDK 7 or above.

This release is being hosted at maven central. There are three main distributions. In order of popularity:

Most likely, you want the traditional installer. NOTE: the installer automatically installs pip and setuptools (unless you uncheck that option), but you must unset JYTHON_HOME if you have it set. See the installation instructions for using the installer.

Run your function from the command line

{1} fills in each item after the ::: as an argument to the function_name.

For example

(lazy) ~ $ cat python_file.py
from time import sleep
def function_name(arg1):
print("Starting to run with", arg1)
sleep(2)
print("Finishing to run with", arg1)
if __name__ == '__main__':
import fire
fire.Fire()
(lazy) ~ $ parallel -j3 --lb "python -u python_file.py function_name {1} " ::: input1 input2 input3 input4 input5
Starting to run with input2
Starting to run with input1
Starting to run with input3
Finishing to run with input2
Finishing to run with input1
Finishing to run with input3
Starting to run with input4
Starting to run with input5
Finishing to run with input4
Finishing to run with input5

I added --lb and -u to keep Python and Parallel from buffering the output so you can see it being run in parallel.

June 06, 2017

I was getting this message when I tried to install packages from conda-forge with Conda:

Fetching package metadata ...
CondaHTTPError: HTTP 401 UNAUTHORIZED for url <https://conda.anaconda.org/conda-forge/osx-64/repodata.json>
Elapsed: 00:00.920954
CF-RAY: 36ad7cbd5d1c23d8-IAD
The remote server has indicated you are using invalid credentials for this channel.
If the remote site is anaconda.org or follows the Anaconda Server API, you
will need to
(a) remove the invalid token from your system with `anaconda logout`, optionally
followed by collecting a new token with `anaconda login`, or
(b) provide conda with a valid token directly.
Further configuration help can be found at <https://conda.io/docs/config.html>.

I tried to do $ anaconda logout but didn't have a program called anaconda installed.

June 02, 2017

On behalf of the Jython development team, I'm pleased to announce that the second release candidate of Jython 2.7.1 is available! This is a bugfix release. Bug fixes include improvements in ssl and pip support.

Please see the NEWS file for detailed release notes. This release of Jython requires JDK 7 or above.

This release is being hosted at maven central. There are three main distributions. In order of popularity:

Most likely, you want the traditional installer. NOTE: the installer automatically installs pip and setuptools (unless you uncheck that option), but you must unset JYTHON_HOME if you have it set. See the installation instructions for using the installer.

May 26, 2017

I spent the Fall quarter (of 1950) at RAND. My first task was to find a name for multistage decision processes. “An interesting question is, ‘Where did the name, dynamic programming, come from?’ The 1950s were not good years for mathematical research.

We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word, research. I’m not using the term lightly; I’m using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term, research, in his presence. You can imagine how he felt, then, about the term, mathematical.

The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons.

I decided therefore to use the word, ‘programming.’ I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying—I thought, let’s kill two birds with one stone. Let’s take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that is it’s impossible to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It’s impossible.
Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities.

April 26, 2017

Thomas Godfrey, a self-taught mathematician, great in his way, and afterward inventor of what is now called Hadley's Quadrant. But he knew little out of his way, and was not a pleasing companion; as, like most great mathematicians I have met with, he expected universal precision in every-thing said, or was for ever denying or distinguishing upon trifles, to the disturbance of all conversation.

April 25, 2017

I took part in the Raleigh March for Science last Saturday. For the opportunity to learn about it, participate in it, photograph it, share it with you — oh, and also for, you know, being alive today — thanks, science!

I've been trying to rethink distractions in my own life, particularly in my work environment. Here are some things that have helped:

Working from Home

Working in an office, especially an open-floor plan office, is disastrous for staying focused. DeMarco and Lister wrote about this in Peopleware 30 years ago, and yet open offices are the norm for startups today.

I'm much more productive by working from home in my quiet office or on my back patio. I'm finally able to spend my time thinking about hard problems rather than ways of silencing Constant Throat Clearer or Perpetual Annoying Laugher.

Notifications

Every app and website these days wants to send you notifications. I'm aggressive about reducing notifications down to those that I need see, and I let almost nothing notify me with sound. I use Do Not Disturb mode on my phone and Mac whenever I need to stop notifications altogether.

Slack

Slack has become the new normal for company communication. Some would say Slack itself is ruining our focus, but having it regularly available has been essential for my own work.

I've come up with a few ways to take control of Slack:

Only show "My unread, along with everything I've starred" in the sidebar. See Michael Lopp's excellent post on Slack for more here.

Enable notifications selectively.

Sign out of distracting avocational Slacks.

Social Media

I've started using an app called Focus to block distracting websites (including Facebook and Twitter.com) and apps on my work computer from 9 AM to 5:30 PM. I use Focus's scheduling feature so blocking isn't optional for me.

I've decided not to block Tweetbot. Though it can be distracting, Twitter is an invaluable way for me to learn from my professional colleagues, bounce ideas off of them, and have a good laugh.

On my iPhone, iPad, and personal Laptop, I've started using Freedom to block all social media during the day. This has stopped me from instinctively checking Instagram every time I walk to the bathroom or get suck on a hard problem. I highly recommend it.1

I also use Freedom to block social media for the first hour I'm up in the morning and before I go to bed.

Email

When emails only need a brief reply, I tend to write responses as soon as possible. At the moment, I'm trying to break people of the expectation that I'll respond quickly. Using services like Boomerang which lets me write emails now and have them sent later helps here.

Reading

Long-form reading at the computer is terrible for comprehension. As Doug Lemov has argued, you have to get away from your computer and other devices to read deeply. I do this by printing articles or reading on my iPad with Freedom blocking enabled. I take my printouts or iPad and walk away from my desk to read.

Todo Items

I'm a firm believer in the Getting Things Done principle of reducing the cognitive overhead of tracking to-do items in my head. I use Omnifocus for task management. Mail Drop and this Alfred workflow help me to quickly add tasks to my Omnifocus inbox. When I think of something I need to take care of outside of work, I drop that thought into Omnifocus; this keeps those personal to-do items from distracting me while I'm working.

Staying focused is hard. I'm still learning how to do it well, and I'm sure I'm not the only one struggling to improve here. If you have any tips to share, I'd love to hear them!

I can't use Freedom on my work computer, because it acts as a VPN which conflicts with my work VPN. ↩

April 10, 2017

Summary (Nutshell)

This is a first look at a work in progress. I’m using Python to study text from an etymological perspective. Specifically, I’m measuring how many words in a given English language text have Anglo-Saxon origin. Many people (including myself) think that Anglo-Saxon words convey a different sense than their counterparts of French/Latin origin. To demonstrate the point in a small way, I’ve included a Latin and Anglo-Saxon version of each heading in this blog post.

French was introduced to English as the language of conquerers and nobility. French was also the language of some European royalty in the 18th and 19th century, further adding to its reputation as a language associated with high status. Even today, English words with French origins often have higher cultural status than their counterparts with Anglo-Saxon origins (think cuisine versus cooking, illumination versus light, create versus make, and escargot versus snail). By contrast, the Anglo-Saxon words are often considered more visceral (think sea versus ocean, sweat versus perspire, and free versus emancipated — more on that last pair in a moment).

For instance, when taunting someone, you reach for blunt Anglo-Saxon words. “Your mother was a hamster, and your father smelled of elderberries!” is 100% Anglo-Saxon, except for “elderberries” which was coined in Middle English from “elder” and “berry”, both of Anglo-Saxon origin.

William of Normandy in 1067, addressing his English subjects.

Legal documents and government issuances, on the other hand, tend to include more words of Latin/French origin. It’s no coincidence that the Latin/French words “Emancipation Proclamation” describe a legal act, but if you want to stir the heart about emancipation, you say something like “Free at last!”(1) which is all Anglo-Saxon.

Others have written more eloquently than I about how word origin influences tone (Annalisa Quinn at NPR, Gemma Varnom, and M. Birch, to suggest a few), so I won’t belabor the point more than I already have. But I wanted to talk about how it inspired the project I’ve been working on.

The Project (The Work)

I should preface this by saying that I Am Not A Linguist, and I don’t even play one on TV.

I thought it would be interesting to perform lexicographical analysis of text from an etymological perspective. My etymological categorization is necessarily simple. When I look at a text, I put each word into one of three etymological categories: Anglo-Saxon, non-Anglo-Saxon, or unknown. From this rough grouping I generate statistics that allow me to compare one text to another.

For instance, does one author consistently use more Anglo-Saxon words than other authors? Does an author’s usage of Anglo-Saxon words change from one work to another? Also of interest to me is the etymology of words as the book progresses from front to back. Do the relative frequencies of etymologies change as the book progresses towards its exciting conclusion? For authors writing in English as a second language, is their word selection influenced by their first language?

All of the questions above can be explored with the tool I’ve written. It’s easier to show the tool’s output than describe it, so here’s an analysis of Lewis Carroll’s 1865 work “Alice’s Adventures in Wonderland”.

The graph below shows the relative frequency of the three etymological categories as the book progresses from beginning to end.

This graph shows the relative frequency of the three etymological categories as counting statistics for various part-of-speech categories.

The table below is a more detailed version of the chart immediately above. Some percentages may not add up to 100% due to rounding.

Total

%age of All Words

Anglo-Saxon

non-Anglo-Saxon

Unknown

All Words

26624

100%

18233 (68%)

3812 (14%)

4579 (17%)

Unique

3528

13%

1354 (38%)

899 (25%)

1275 (36%)

Nouns

8522

32%

4521 (53%)

2354 (27%)

1647 (19%)

Verbs

5479

20%

2994 (54%)

565 (10%)

1920 (35%)

Adjectives

1639

6%

896 (54%)

375 (22%)

368 (22%)

Adverbs

1974

7%

1348 (68%)

420 (21%)

206 (10%)

Other

9010

33%

8474 (94%)

98 (1%)

438 (4%)

Observations (What I See)

There’s some minor observations to be made here, but the strength of this tool will be in comparative analysis. It’s hard to draw conclusions from one analysis before I have an idea of what’s typical.

For instance, at first glance, the ratio of Anglo-Saxon to non-Anglo-Saxon words looks dramatic, but this says more about English than it does about Carroll. The most common words in English are overwhelmingly Anglo-Saxon in origin. (2) For the small sample size of works I’ve processed so far (just 8 in total), I can see that it’s common for roughly three quarters of the words to be Anglo-Saxon. Alice in Wonderland isn’t an outlier by that standard.

We can also see that the frequency of Anglo-Saxon words decreases slightly throughout the book. This is the kind of trend that I find interesting, but in this case it’s due to an increase in the number of words of unknown etymology. Sometimes a word’s etymology is truly unknown. More often, though, the etymology is classified as unknown for other reasons. Most likely, it’s simply not in my etymological database (which isn’t very complete yet). Also, the word could be a proper noun, an invented word (like “woodshadows” from James Joyce’s Ulysses), or a word for which the etymology is ambiguous. An example of this last category is “bank” which is Anglo-Saxon in origin when referring to the side of a river, but French/Italian in origin when referring to a place that handles money.

At present, the quantity of words classified as “unknown” is too large for my tastes, and I plan to reduce it by improving both my database and the tool.

Verbs are overrepresented in the “unknown” category. My guess is that this is an artifact of my stemmer having difficulty stemming verbs. (I’m currently using the Snowball Stemmer from NLTK.)

As you can see, at this point it’s easier to draw conclusions about the representation of the data than it is about the data themselves. That leads me to the next (and final) topic in this post.

Future (What’s to Come)

As I said in the introduction, this is an early look at a work in progress. Here’s some of the things I’d like to add –

Better etymological data

Large scale comparisons of text to look for trends (across authors, genres, etc.)

More numeric (rather than visual) descriptions of the data to facilitate automated comparison. One idea is to add the mean and standard deviation of the percentage of Anglo-Saxon words.

Open sourcing

If you have any suggestions on how to use this tool or make it more interesting, I’d love to hear them in the comments below. I moderate all comments to filter spam which is yet another Viking influence on England.

Endnotes

Like English itself, “Endnote” is an etymological hybrid. “End” is of Anglo-Saxon origin, while “note” comes from Old French/Latin.

1. Martin Luther King, Jr. isn’t the only person to have said “Free at last!”, but his use of it is perhaps the most famous. His “I Have a Dream” speech makes brilliant use of etymological contrasts. Many of his memorable phrases in that speech (“I have a dream today”, “Let freedom ring”, “Free at last”) are Anglo-Saxon.

2. In 2014 I pulled from Wikipedia a list of the 100 most common English words. At the time, it contained just four non-Anglo-Saxon words. They were “just” (ME < Latin), “people” (ME < Anglo-French < Latin), “use” (ME < Old French, replaced OE brucan, cognate w/modern Swedish bruk-), and “because” (ME < Fr ‘par cause’). There are lots of ways to count the 100 most common words, and doubtless the list would have been different in Carroll’s day. But my guess is that the presence of Anglo-Saxon hasn’t changed dramatically from that 96% regardless of when and how one counts.

March 31, 2017

I've been tinkering with websites for nearly 20 years. My friend Hunter and I were big into making terrible Angelfire sites as pre-teens. In high school, my dad paid me to make him a webpage for his doctor's office (I used Frontpage). A year or two after that, I read Kevin Yank's "Build Your Own Database Driven Website Using PHP & MySQL" and hacked together a PHP back-end for a Lord of the Rings fan site.

In recent years, I've put together this blog, shouldigetaphd.com, and a few other simple web-based side projects. However, I haven't kept up with modern web development, and my projects have been hacked together from boilerplate or templates. I've programmed professionally since 2011, I've spent very little of that writing anything close to graphical user interfaces.

I have a number of other side projects that I'd like to do at some point, and most of them would require some sort of graphical interface. While I could work on app development, I think web-based implementations would be a great starting place.

A few months back, I decided to stop watching Netflix on the treadmill and instead use those 45 minutes each morning to learn; in particular, I've been trying to learn more about modern(ish) web design and development. My work has a subscription to Safari Books Online which gives me access to copious technical books and video tutorials.

The number of resources available on Safari (along with YouTube, blog posts, etc) is astounding. I started many video tutorials on Safari that I quickly realized weren't going to be useful. Yet there many gems to be found, which I share here with you.

What follows is an overview of the technologies I've realized I need to learn more about and links to the resources I've found valuable in learning about them. If you think there are gaps I haven't yet filled or better resources than I've listed below, I'd love your feedback.

What I Knew Going In

I've been a professional software developer and data scientist since 2012. I mostly write Python, but I've programmed in a number of different languages.

I have a pretty good grasp on how HTML and CSS work. I've used enough Javascript over the years to be dangerous; I understood how it runs in the browsers. I understand what a DOM is and how it relates to the page source.

I've used the Python Flask web framework for several projects. I understand how to repond to HTTP requests with server-generated content. I had some idea of how to run my own web server on AWS.

A Meta Tutorial on Web Development

A great place to start is Andrew Montalenti's lengthy tutorial on using Python, Flask, Bootstrap, and Mongo to rapidly prototype a website. The tutorial is out of date, but the principles still stand.

Another great resource is Cody Lindley's free Front-End Developer's Handbook. This is a substantial list meta-resource that organizes links for learning all angles of front-end development. "It is specifically written with the intention of being a professional resource for potential and currently practicing front-end developers to equip themselves with learning materials and development tools."

Chrome Developer Tools

One of the most important tools for me in learning more about web development has been the Chrome Developer Tools. You can live edit the DOM elements and style sheets and watch how a website changes. I've mostly learned Developer Tools through exploring it myself, but there are lots of tutorials for it on Youtube.

HTML, CSS, and Bootstrap

Many modern websites are responsive: they automatically adapt to various size screens and devices, from phones to desktops. Writing responsive websites from scratch requires deep knowledge of HTML, CSS, Javascript, and browsers. Unless you're doing this professionally, you probably don't want to write a responsive site from scratch.

For several projects, I've used the lightweight Skeleton project to create simple, responsive pages.

Recently, I decide to dive deep into the more robust Bootstrap framework originally developed at Twitter.

Once you have a basic idea of how Bootstrap works, the best thing you can do is start playing with it. Since I was familiar with the Pelican static site generator, I decided to switch this blog to Bootstrap theme starting with pelican-bootstrap3.

Advanced Stylesheets (LESS, SASS, and Flexbox)

There are several alternatives to writing raw CSS. Two popular ones are Less and SASS. These "preprocessors" allow you to write CSS-like stylesheets but with constructs such as variables, nesting, inheritance, and mathematical operators.

I found this brief tutorial on Less (Safari) helpful, and I've enjoyed Less a lot. I haven't used SASS yet, but it's very similar. I'll probably switch to SASS when I start using Bootstrap 4.

Another modern innovation is the Flexbox layout model for CSS. Stone River Learning has a great tutorial on Flexbox (Safari). It seems that Flexbox is the future of CSS-based layouts, and it's worth learning about.

Advanced JavaScript (Elm, React, Angular, Backbone, Ember)

The JavaScript web framework space has exploded. Many of these are implementations of the Model, View, Controller pattern, including React, Angular, and Ember. These tools allow the creation of complex web apps (as well as mobile apps).

Web Server Operations and DNS

I learned a ton form Linux Web Operations (Safari) by Ben Whaley. "The videos discuss the relationship between web and application servers, load balancers, and databases and introduce configuration management, monitoring, containers, cryptography, and DNS."

I've struggled with DNS configuration over the years, so I watched Cricket Liu's Learning DNS series (Safari). I still wouldn't want to be responsible for a company's complex DNS infrastructure, but I can now configure my own sites DNS with a little more understanding.

Development Automation

Package Managers

It's likely that any modern web project will have some external Javascript dependencies. Package managers (analogous to Pypi or Anaconda.org on Python) have emerged to help support this. Node.js comes with the npm package manager, but Bower seems to make more sense for front-end development.1Cody Lindley has a nice introduction to npm and Bower. Bower is well documented and easy to start using. There is a nice Flask extension to help you integrate Bower with your Python project.

Task Automation

Web development comes with lots of build-style tasks that have to happen repeatedly. For example, before you can render a webpage in the browser, you might need to convert the Less to CSS and start a local web server. Before deploying to production, you might want to also run tests and minify your Javascript.

There's a GUI application called Codekit that can do a lot of these tasks. You can also do it through a Node.js program called Grunt. I haven't used it yet, but it looks like following the documentation would be the best way to get started.

User Experience Design

Conclusion

I've learned a lot in the past few months. I've filled in some gaps about how CSS works. I've gotten a better grasp on the Javascript prototype model. I've learned that I can start with higher level tools (e.g. Bootstrap and JQuery) to rapidly build my side projects with some amount of visual appeal. I'm learning how to use available tools to reduce the boilerplate I have to write, automate tedious tasks, and reduce my personal technical debt.

I still have a lot of learning and a lot of practicing ahead of me, but I'm starting to feel confident that I could make headway on some of my projects. The modern frontend development landscape is massive, varied, and ever changing, but that shouldn't prohibit you from diving in if you want to.

The recent buzz in package management has been about Yarn, a replacement for npm. ↩

March 23, 2017

I wrote a few months back about how data scientists need more automation. In particular, I suggested that data scientists would be wise to learn more about automated system configuration and automated deployments.

In an attempt to take my own advice, I've finally been making myself learn Ansible. It turns out that a great way to learn it is to sit down and read through the docs, front to back; I commend that tactic to you. I also put together this tutorial to walk through a practical example of how a working data scientist might use this powerful tool.

What follows is an Ansible guide that will take you from installing Ansible to automatically deploying a long-running Python to a remote machine and running it in a Conda environment using supervisord. It presumes your development machine is on OS X and the remote machine is Debian-like; however, it shouldn't require too many changes to run it on other systems.

I wrote this post in a Jupyter notebook with a Bash kernel. You can find the notebook, Ansible files, and installation directions on my Github.

Ansible provides "human readable automation" for "app deployment" and "configuration management". Unlike tools like Chef, it doesn't require an agent to be running on remote machines. In short, it translates declarative YAML files into shell commands and runs them on your machines over SSH.

Behind the scenes, Ansible is turning this -m ping command into shell commands. (Try running with the -vvv flag to see what's happening behind the scenes.) It can also execute arbitrary commands; by default, it'll use the Bourne shell sh.

Instead of specifying our inventory with the -i flag each time, we should specify an Ansible inventory file. This file is a text file specifying machines you have SSH access to; you can also group machines under bracketed headings. For example:

Ansible has to be able to connect to these machines over SSH, so you will likely need to have relevant entries in your .ssh/config file.

By default, the Ansible CLI will look for a system-wide Ansible inventory file in /etc/ansible/hosts. You can also specify an alternative path for an intentory file with the -i flag.

For this tutorial, I'd like to have an inventory file specific to the project directory without having to specify it each time we call Ansible. We can do this by creating a file called ./ansible.cfg and set the name of our local inventory file:

In [5]:

cat ./ansible.cfg

cat ./ansible.cfg
[defaults]
inventory = ./hosts

You can check that Ansible is picking up your config file by running ansible --version.

For this example, I just have one host, a Digital Ocean VPS. To run the examples below, you should create a VPS instance on Digital Ocean, Amazon, or elsewhere; you'll want to configure it for passwordless authentication. I have an entry like this in my ~/.ssh/hosts file:

We told Ansible to run this command on all specified hosts in the inventory. It found our inventory by loading the ansible.cfg which specified ./hosts as the inventory file.

It's possible that this will fail for you even if you can SSH into the machine. If the error is something like /bin/sh: 1: /usr/bin/python: not found, this is because your VPS doesn't have Python installed on it. You can install it with Ansible, but you may just want to manually run sudo apt-get -y install python on the VPS to get started.

While adhoc commands will often be useful, the real power of Ansible comes from creating repeatable sets of instructions called Playbooks.

A playbook contains a list of "plays". Each play specifies a set of tasks to be run and which hosts to run them on. A "task" is a call to an Ansible module, like the "ping" module we've already seen. Ansible comes packaged with about 1000 modules for all sorts of use cases. You can also extend it with your own modules and roles.

Our first playbook will just execute the ping module on all our hosts. It's a playbook with a single play comprised of a single task.

We could add these these tasks to a playbook file (like ping.yml), but what maybe we will want to share it among multiple playbooks? For this, Ansible has a construct called Roles. A role is a collection of "variable values, certain tasks, and certain handlers – or just one or more of these things". (You can learn more about variables and handlers in the Ansible docs.)

Roles are organized as subfolders of a folder called "Roles" in the working directory. The rapid proliferation of folders in Ansible organization can be overwhelming, but a very simple rule is just a file called main.yml nestled several folders deep. In our case, it's in ./roles/supervisor/tasks/main.yml.

Next we want to ensure that Conda installed on our system. We could write our own role to follow the recommended process. However, Ansible has a helpful tool to help us avoid reinventing the wheel by allowing users to share roles; this is called Ansible Galaxy.

You can search the Galaxy website for miniconda and see that a handful of roles for installing Miniconda exist. I liked this one.

We can install the roll locally using the ansible-galaxy command line tool.

In [14]:

ansible-galaxy install -f andrewrothstein.miniconda

You can have the roll installed wherever you want (run ansible-galaxy install --help to see how, but by default they'll go to /usr/local/etc/ansible/roles/.

You can look at the tasks/main.yml to see the core logic of installing Miniconda. It has tasks to download the installer, run the installer, delete the installer, run conda update conda, and make conda the default system Python.

Once a roll is installed locally, you can add it to a play just like you can with roles you wrote. Installing Miniconda is now as simple as:

roles:
- role: andrewrothstein.miniconda

Before we add that to a playbook, I want to customize where miniconda is installed. If you look back at the main.yml file above, you see a bunch of things surrounded in double brackets. These are variables (in the Jinja2 template language). From the play, we can see that Miniconda will be installed at {{miniconda_parent_dir}}/{{miniconda_name}}. The role defines these variables in /andrewrothstein.miniconda/defaults/main.yml. We can override the default variables by specifying them in our play.

Cloning a repository with Ansible is easy. We just use the git module. This play will clone the repo into the specified directory. The update: yes flag tells Ansible to update the repository from the remote if it has already been cloned.

Since we've now installed conda and cloned the repository with an environment.yml file, we just need to run conda env update from the directory containing the environment spec. Here's a play to do that:

This is a is normal INI-style config file, except it includes Jinja2 variables. We can use the Ansible template module to create a task which fills in the variables with information about our program and copies it into the conf.d folder on the remote machine.

ssh digitalocean cat /var/log/long_running_process.out.log
INFO:root:Process ran for the 1th time
INFO:root:Process ran for the 2th time
INFO:root:Process ran for the 3th time
INFO:root:Process ran for the 4th time

If your lucky (i.e. your systems and networks were setup sufficiently similar to mine), you can run this exact same command to configure and start a process on your own system. Moreover, you could use this exact same command to start this program on an arbitrary number of machines by simply adding more hosts to your inventory and play spec!

Ansible is a powerful, customizable tool. Unlike some similar tools, it requires very little setup to start using it. As I've learned more about it, I've seen more and more ways in which I could've used it in copious projects in the past; I intend to make it a regular part of my toolkit. (Historically I've done this kind of thing with hacky combinations of shell scripts and Fabric; Ansible would often be better.)

👉 Quite honestly given your questions [about vacation policy] and the fact that you are considering other options, [we] may not be the best choice for you.

These quotes above are some of the reasons I've been given for why I wasn't offered a data science job after interviewing. I've been told a variety of other reasons as well: company decided against hiring remotes after interviewing (I've heard this at least 3 times), company thought I changed jobs too frequently, company decided it didn't have necessary data infrastructure in place for data science work. Multiple companies gave no particular reason; some of these were at least kind enough to notify me they weren't interested. One company hired someone with a Ph.D. from MIT soon after turning me down.

In the last five years, I've clearly interviewed for a lot of data science jobs, and I've also been turned down for a lot of data science jobs. I've spent a good bit of time reflecting on why I wasn't offered this job or that. Several folks have asked me if I had any advice to share on the experience, and I hope to offer that here.

You never really know

I learned with graduate school applications years ago: you rarely truly know why you were turned down. Maybe my GRE scores weren't high enough, or maybe the reviewer rushed through my application in the 5 minutes before lunch. Maybe my statement of interest was too weak, or maybe the department needed to accept an alumni's child.

The same goes for companies. I'm fairly skeptical that the reasons I have been given for why I was passed by are the full story, and I suspect you will rarely (if ever) know the real reasons why you weren't offered a job. I try to use the reasons I hear as a way to help me refine my skills and better present myself, but I don't put too much weight in them.

Some advice anyway

That said, here are a few takeaways from interviewing for probably 20 data science jobs since 2012.

Companies often use interviews as a time to figure out what they're really looking for. I suspect this rarely intentional. But actually interviewing candidates forces a team to talk through what they're actually looking for, and they often realize they had differing perspectives prior to the interview.

Companies where "data science" is a new addition need your help in understanding what data science can do for them. As much as possible, use the interview to sell your vision for what data science can offer at the company, how you'll get it off the ground, and what the ROI might be.

Being the wrong fit for what a company needs is not ideal. I've come to appreciate a company trying to ensure my abilities align with their needs. You'd hope this was always the case, but I've been hired when it wasn't. That said, I hesitate to say you should always look for this: if you need a job, and someone offers you a job, you should feel free to take it!

Data infrastructure is important and many companies are lacking it. Many data scientists can attest to being hired at a company only to discover the data they needed wasn't available, and they spent months or years building the tools required for them to start their analysis. Many companies are naive about how much engineering effort is required for effective data science. Don't assume that a company with a grand vision for data science necessarily knows what it will take to accomplish that vision.

Many companies are still uneasy about data science being done remotely. I think this is silly, but I'm biased.

There's little consistency as to what you might be asked in a data science interview. I've been asked about Java design patterns, how to solve combinatorics problems, to describe my favorite machine learning model, to explain the SMO algorithm, my opinions about the TensorFlow API, how I do software testing, to analyze a never-before-seen dataset and prepare a presentation in a 4 hour window, the list goes on. I spent a flight to the west coast reading up on the statistics of A/B testing only to be asked largely soft-skills type questions for an entire interview. I've largely given up attempting any special preparation for interviews.

Networking is still king. Hiring is hard, and interviewing is hard; having a prior relationship with an applicant is attractive and reduces hiring uncertainty. In my own experience, my friendships and connections with the data science community on Twitter has shaped my career. Don't downplay the benefits of networking.

Conclusion

So how do you get a data science job? I don't know.

I've been unbelievably fortunate to be continuously employed since college, but I'm not sure how to tell you to repeat that. The best I have to offer is to reiterate the conclusion of my recent talk about data science as a career. Learn and know the hard stuff: linear algebra, probability, statistics, machine learning, math modeling, data structures, algorithms, distributed systems, etc. You probably won't use this knowledge every day in your job, but interviewers love to ask about it anyway.

At the same time, don't forget about the even harder skills: communication, careful thought, prose writing skill, software writing skill, software engineering, tenacity, Stack Overflow. You will use these every day in your job, and they'll help you present yourself well in an interview.1

March 01, 2017

I gave a slightly expanded version of the talk I gave at PyData Carolinas 2016 about connecting Python to compiled languages like C, C++, and Fortran. (Slides from that talk are here.) I appreciate the time and attention of everyone who attended last night, especially Francois Dion for organizing and reminding us of some of the interesting new things in Python 3.5 and 3.6.

Background

I have a long list of objects, each with the properties “color” and “shape”. I want to count the frequency of each color/shape combination. A sample of what I’m trying to achieve could be represented in a grid like this –

circle square star
blue 8 41 18
orange 5 33 25
red 53 64 58

At first I implemented this with a dictionary of collections.Counter instances where the top level dictionary is keyed by shape, like so –

Then I counted my frequencies using the code below. (For simplicity, assume that my objects are simple 2-tuples of (shape, color)).

for shape, color in all_my_objects:
frequencies[shape][color] += 1

So far, so good.

Enter the Pandas

This looked to me like a perfect opportunity to use a Pandas DataFrame which would nicely support the operations I wanted to do after tallying the frequencies, like adding a column to represent the total number (sum) of instances of each color.

It was especially easy to try out a DataFrame because my counting loop ( for...all_my_objects) wouldn’t change, only the definition of frequencies. (Note that the code below requires I know in advance all the possible colors I can expect to see, which the Dict + Counter version does not. This isn’t a problem for me in my real-world application.)

It Works, But…

Both versions of the code get the job done, but using the DataFrame as a frequency counter turned out to be astonishingly slow. A DataFrame is simply not optimized for repeatedly accessing individual cells as I do above.

How Slow is it?

To isolate the effect pandas was having on performance, I used Python’s timeit module to benchmark some simpler variations on this code. In the version of Python I’m using (3.6), the default number of iterations for each timeit test is 1 million.

First, I timed how long it takes to increment a simple variable, just to get a baseline.

Second, I timed how long it takes to increment a variable stored inside a collections.Counter inside a dict. This mimics the first version of my code (above) for a frequency counter. It’s more complex than the simple variable version because Python has to resolve two hash table references (one inside the dict, and one inside the Counter). I expected this to be slower, and it was.

Third, I timed how long it takes to increment one cell inside a 2×2 NumPy array. Since Pandas is built atop NumPy, this gives an idea of how the DataFrame’s backing store performs without Pandas involved.

Fourth, I timed how long it takes to increment one cell inside a 2×2 Pandas DataStore. This is what I had used in my real code.

Benchmark Results Summary

Here’s a summary of the results from above (decimals truncated at 3 digits). The rightmost column shows the results normalized so the fastest method (incrementing a simple variable) equals 1.

Actual (seconds)

Normalized (seconds)

Simple variable

0.092

1

Dict + Counter

0.683

7.398

Numpy 2D array

0.890

9.639

Pandas DataFrame

157.564

1704.784

As you can see, resolving the index references in the middle two cases (Dict + Counter in one case, NumPy array indices in the other) slows things down, which should come as no surprise. The NumPy array is a little slower than the Dict + Counter.

The DataFrame, however, is about 150 – 200 times slower than either of those two methods. Ouch!

I can’t really even give you a graph of all four of these methods together because the time consumed by the DataFrame throws the chart scale out of whack.

Here’s a bar chart of the first three methods –

Here’s a bar chart of all four –

Why Is My DataFrame Access So Slow?

One of the nice features of DataFrames is that they support dictionary-like labels for rows and columns. For instance, if I define my frequencies to look like this –

This flexibility has a price. Slicing (which is what is invoked by the square brackets) calls an object’s __getitem__() method. The parameter to __getitem__() is the whatever was inside the square brackets. A DataFrame’s __getitem__() has to figure out what the passed parameter represents. Determining whether the parameter is a label reference, a callable, a boolean array, or something else takes time.

If you look at the DataFrame’s __getitem__() implementation, you can see all the code that has to execute to resolve a reference. (I linked to the version of the code that was current when I wrote this in February of 2017. By the time you read this, the actual implementation may differ.) Not only does __getitem__() have a lot to do, but because I’m accessing a cell (rather than a whole row or column), there’s two slice operations, so __getitem__() gets invoked twice each time I increment my counter.

This explains why the DataFrame is so much slower than the other methods. The dictionary and Counter both only support key lookup in a hash table, and a NumPy array has far fewer slicing options than a DataFrame, so its __getitem__() implementation can be much simpler.

Better DataFrame Indexing?

DataFrames support a few methods that exist explicitly to support “fast” getting and setting of scalars. Those methods are .at() (for label lookups) and .iat() (for integer-based index lookups). It also provides get_value() and set_value(), but those methods are deprecated in the version I have (0.19.2).

“Fast” is how the Panda’s documentation describes these methods. Let’s use timeit to get some hard data. I’ll try at() and iat(); I’ll also try get_value()/set_value() even though they’re deprecated.

These methods are better, but they’re still pretty bad. Let’s put those numbers in context by comparing them to other techniques. This time, for normalized results, I’m going to use my Dict + Counter method as the baseline of 1 and compare all other methods to that. The row “DataFrame (naïve)” refers to naïve slicing, like frequencies[0][0].

Actual (seconds)

Normalized (seconds)

Dict + Counter

0.683

1

Numpy 2D array

0.890

1.302

DataFrame (get/set)

15.050

22.009

DataFrame (at)

36.331

53.130

DataFrame (iat)

42.015

61.441

DataFrame (naïve)

157.564

230.417

The best I can do with a DataFrame uses deprecated methods, and is still over 20 times slower than the Dict + Counter. If I use non-deprecated methods, it’s over 50 times slower.

Workaround

I like label-based access to my frequency counters, I like the way I can manipulate data in a DataFrame (not shown here, but it’s useful in my real-world code), and I like speed. I don’t necessarily need blazing fast speed, I just don’t want slow.

I can have my cake and eat it too by combining methods. I do my counting with the Dict + Counter method, and use the result as initialization data to a DataFrame constructor.

The rows and columns appear in essentially random order; they’re ordered by whatever order Python returns the dict keys during DataFrame initialization. Getting them in a specific order is left as an exercise for the reader.

There’s one more detail to be aware of. If a particular (shape, color) combination doesn’t appear in my data, it will be represented by NaN in the DataFrame. They’re easy to set to 0 with frequencies.fillna(0).

Conclusion

What I was trying to do with Pandas – unfortunately, the very first thing I ever tried to do with it – didn’t play to its strengths. It didn’t break my code, but it slowed it down by a factor of ~1700. Since I had thousands of items to process, the difference was hard to overlook!

Pandas looks great for some things, and I expect I’ll continue using it. This was just a bump in the road, albeit an interesting one.

February 14, 2017

I recently gave to the Duke Big Data Initiative entitled Dr. Hopper, or How I Quit My Ph.D. and Learned to Love Data Science. The talk was well received, and my slides seemed to resonate in the Twitter data science community.

I've started a long-form blog post with the same message, but it's not done yet. In the mean time, I wanted to share the slides that want along with the talk.

February 11, 2017

Summary

I recently wrote a blog post that involved exception handling, and gave short shrift to the part of exception handling I didn’t want to talk about in order to focus on the part I did want to talk about. For some readers, that clearly backfired.

Background

My recent blog post about coercing Python objects to integers caught people’s attention in a way I hadn’t intended. The point I was trying to make was that an innocent-looking call like int(an_object) calls the method an_object.__int__(), and since that can be arbitrary code, it can raise arbitrary exceptions. Therefore, it’s insufficient to catch only the usual exceptions of ValueError and TypeError if you don’t know the type of an_object in advance.

Here’s the code I suggested –

def int_or_else(value, else_value=None):
"""Given a value, returns the value as an int if possible.
If not, returns else_value which defaults to None.
"""
try:
return int(value)
# I don't like catch-all excepts, but since objects can raise arbitrary
# exceptions when executing __int__(), then any exception is
# possible here, even if only TypeError and ValueError are
# really likely.
except Exception:
return else_value

Several commenters objected to the fact that this code discards (and therefore silences/masks/hides) all exceptions. Here’s why I made that choice.

The Two Parts of Exception Handling

In Python, there’s two parts to consider about exception handling — what to catch, and what to do with the exception once you’ve caught it. My intention was to write only about the former.

The latter is an interesting topic, too. Once you’ve caught an exception, you might want to log it and then discard it, log it and then re-raise it, re-raise it as a different exception, silence it, let it pass up to the caller, modify its attributes and re-raise it, etc. There’s enough material for an entire blog post about different ways to react to an exception, and the pros and cons of each.

Someday I might write that post about different ways to react to trapped exceptions, and if I do, I’ll dedicate the entire post to the subject to give it the attention it deserves. That other blog post – that was not it. In fact, it was the opposite. I gave the topic of processing the trapped exception as little attention as possible so as not to detract attention from what I wanted to be the main topic (what exceptions need to be trapped).

That backfired.

Conclusion

My post was not advocacy of discarding exceptions, nor was it advocacy of not discarding exceptions. What’s the right choice? It depends. One situation where you might want to discard exceptions is in a blog post where you’re trying to keep the code as brief as possible for readability. Then again, you might regret that. :-)

In the future, I’ll be clearer about what shortcuts I’m taking for brevity of presentation.

Agree? Disagree? I’d like to hear from you. I like it when people agree with me. Those who disagree can expand my horizons, and I like that too. In short, all civil comments are welcome. I feel I’ve spent enough time thinking about this topic for now, but that doesn’t make me right! Let me know what you think.

I was there with Tobias McNulty of Caktus Group. We (Tobias and I) trained the Libyan employees of Libya’s High National Election Commission (HNEC) in the maintenance and use of the HNEC-commissioned SMS-based voter registration system that I had helped to develop while working with Caktus. The system has been open sourced as Smart Elect.

If the big picture was promoting democracy, the medium picture was training system admins and developers. And the very small picture was working together on the nitty gritty of features and bug fixes, like figuring out that if a @property method raises an exception when invoked by hasattr(), the exception isn’t propagated under Python 2.7.

The admin training consisted of a comprehensive review of the system, including the obscure corners and edge case handling. The developers were eager to get their hands dirty, so after some organizational review, we dove into fixing bugs and implementing some new features that HNEC wanted.

Abdullah (Photo by Tobias McNulty)

Tobias and I worked with the developers as both mentors and peers. Grinding through bugs from start to finish was really valuable. Our trainees have good development experience, but working in groups with us allowed them to participate in our approach to debugging, problem reporting, development, and test. It seemed a little different from what they were used to. We were very methodical about creating an issue in our tracker, creating a branch for that issue, reviewing one another’s code, documenting the fix, etc. “It’s a lot of process,” said one trainee after working through one particular bug with us. He’s right. I wish I had thought to ask if Libyan culture has a proverb similar to “For want of a nail…“. I could have said, “For want of filing an issue in the tracker, a voter was disenfranchised,” but it doesn’t have the same ring to it.

Tobias and Ahmed

This was my first trip to Africa, and, grand notions aside, what stood out to me was how mundane much of the experience was. The guys we worked with would have fit right in at any coding meetup I’ve been to. They had opinions about laptops. They were distracted by their phones. Everyone enjoyed a successful bug hunt. I remember one trainee being tired at 5PM, saying he had no more left in him, and seeing him there grinning 2 hours later when we finally solved the problem we’d been working on.

Outside of the training, I especially enjoyed the dinners at Sakura/Pasta Cosy and Chez Zina (my favorites, in that order).

We also ate at Le Bon Vieux Temps, where the handwritten chalkboard menu is carted from table to table on a charming-but-impractical frame. Tunisia is principally French speaking, with Arabic on an almost equal footing. At Le Bon Vieux Temps (“The Good Old Times”), the menu was all in French, and my vestigial French came in handy for translating the menu into English for the Libyans who in turn peppered the waiter with questions in Arabic. (That night in the restaurant began and ended my career as a French-to-English translator.)

On the weekends we rested, walked in the city, and paid a visit to the Bardo National Museum. The Bardo was famously attacked in 2015, and has since sprouted a razor wire fence around the entire property. Bored soldiers sat on a truck by the gate and motioned us to enter. It’s a nice museum, and I’m glad I went.

Inside the classroom and out, I got to know and really like our Libyan colleagues. They were generous with their good humor and kindness. If they lacked anything, it was a willingness to complain.

Libya is a difficult place to live at the moment. I think we all know that in an abstract sense, but talking to my Libyan friends made it more concrete for me. Banks don’t have enough cash. Electricity isn’t reliable. People they know have been kidnapped. My friends have a lot on their minds, and yet they found rooom to squeeze in opinions about good software development practices.

Munir

I’m glad I got the chance to go, and to get to know the people I did. In addition to working with Tobias and the Libyans, I had a lot of non-work experiences I’ll remember for a long time. I walked among ruins in Carthage that are over 2000 years old. I drove solo (and lost) through rush hour traffic in Tunis and survived. I saw a Tunisian wedding, and got to use the word “ululating” for the first time outside of Scrabble or Bananagrams. I swam in the Mediterranean. I saw flocks of flamingoes (many, many thanks to Hichem and Claudia of Les Amis des Oiseaux).

HNEC is now better positioned than ever to use the Smart Elect system, and I hope they do so again soon. That’s partly for egotistical reasons — I like to see my work get used. Who doesn’t? But more importantly, if it gets used, that means Libyans are voting to determine their own future.

January 02, 2017

Summary

In my opinion, the best way in Python to safely coerce things to integers requires use of an (almost) “naked” except, which is a construct I rarely want to use. Read on to see how I arrived at this conclusion, or you can jump ahead to what I think is the best solution.

The Problem

Suppose you had to write a Python function to convert to integer string values representing temperatures, like this list —

['22', '24', '24', '24', '23', '27']

The strings come from a file that a human has typed in, so even though most of the values are good, a few will have errors ('25C') that int() will reject.

Suppose this function gets input that’s even more unexpected, like None —

>>> print(force_to_int(None))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 6, in force_to_int
TypeError: int() argument must be a string or a number, not 'NoneType'

Hmmm, let’s write a better version that catches TypeError in addition to ValueError —

While the class Unintable is contrived, it reminds us that classes control their own conversion to int, and can raise any error they please, even a custom error. A scenario that’s more realistic than the Unintable class might be a class that wraps an industrial sensor. Calling int() on an instance normally returns a value representing pressure or temperature. However, it might reasonably raise a SensorNotReadyError.

And Finally, the Naked Except

Since any exception is possible when calling int(), our code has to accomodate that. That requires the ugly “naked” except. A “naked” except is an except statement that doesn’t specify which exceptions it catches, so it catches all of them, even SyntaxError. They give bugs a place to hide, and I don’t like them. Here, I think it’s the only choice —

It inherits from BaseException instead of Exception so that it is not accidentally caught by code that catches Exception. This allows the exception to properly propagate up and cause the interpreter to exit.

The difference between these two is only a side note here, but I wanted to point it out because (a) it was educational for me and (b) it explains why I’ve updated this post to hedge on what I was originally calling a ‘naked’ except.

The Final Version

We can make this a bit nicer by allowing the caller to control the non-int return value, giving the “naked” except a fig leaf, and changing the function name —

def int_or_else(value, else_value=None):
"""Given a value, returns the value as an int if possible.
If not, returns else_value which defaults to None.
"""
try:
return int(value)
# I don't like catch-all excepts, but since objects can raise arbitrary
# exceptions when executing __int__(), then any exception is
# possible here, even if only TypeError and ValueError are
# really likely.
except Exception:
return else_value

December 27, 2016

Athena Setup and Quick Start

Last week, I needed to retrieve a subset of some log files stored in S3. This seemed like a good opportunity to try Amazon's new Athena service. According to Amazon:

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis.

Athena uses Presto in the background to allow you to run SQL queries against data in S3. On paper, this seemed equivalent to and easier than mounting the data as Hive tables in an EMR cluster.

The Athena user interface is similar to Hue and even includes an interactive tutorial where it helps you mount and query publically available data. It was easy for me to mount my private data using the same CREATE statement I'd run in Hive:

At this point, I could write SQL queries against default.logs. Queries run from the Athena UI run in the background; even if you close the browser window, the query continues to run. Up to 5 queries can be run simultaneously.

Query results can be downloaded from the UI as CSV files. Results are also written as a CSV file to an S3 bucket; by default, results go to s3://aws-athena-query-results-<account-id>-region/. You can change the bucket by clicking Settings in the Athena UI.

Up to this point, I was thrilled with the Athena experience. However, after this, I started to uncover the limitations.

Athena Limitations

First, Athena doesn't allow you to create an external table on S3 and then write to it with INSERT INTO or INSERT OVERWRITE. Thus, you can't script where your output files are placed. More unsupported SQL statements are listed here.

Next, the Athena UI only allowed one statement to be run at once. Because I wanted to load partitioned data, I had to run a bunch of statements of the form `ALTER TABLE default.logs ADD partition (d = numeric-date) LOCATION 's3://bucket/path/numeric-date/'; using the Athena UI would've required me to run these one day at a time. Thankfully, I was able to run them all at once in SQL Workbench.

Third, Athena's output format is highly limited. It strictly outputs CSV files where every field is quoted. This was particularly problematic for me because I hoped to later load my data into Impala, and Impala can't extract text data from quoted fields! I was told by Athena support "We do plan to make improvements in this area but I don’t have an ETA yet."

Finally, Athena fell flat on its face in the presence of bad records. I'm not sure whether I had bad GZIPs for malformed logs, but when I did, Athena stopped in its tracks. For my application, I needed my query engine to be able to ignore bad files. Adding to the frustration, even when a query failed, Athena would write partial output (up to the failure) to S3, yet the output files didn't provide any indication that they were partial, incomplete output.

Conclusion

My first encounter with Athena was a flop. I ended up switching to EMR and filtering my logs with Hive. Until it offers more control over output and better error handling, Athena will be of limited value to me.

One thing I mention in the code that’s worth repeating here is that the code uses ElementTree to manipulate XML. It’s sufficient for this demo, and the fact that it’s part of the Python standard library means you can run the demo without installing any third party libraries. For real world (i.e. non-demo) usage, I recommend lxml as a more robust and helpful alternative to ElementTree.

A Curious Coincidence: Stinkin’ Badges

The title of my PyOhio talk was “We Don’t Need No Stinkin’ PDF Library: Build PDFs with Python the Lazy Way”. You know the “we don’t need no stinkin’ [whatever]” meme, don’t you? It’s from the Mel Brooks movie Blazing Saddles. (You can find the clip on YouTube.) Did you know that Blazing Saddles is quoting another movie?

The night before I gave my talk, I walked from my AirBnB to a nearby bar and bottle shop. (It’s simply called “The Bottle Shop”. Ohioans are plain dealers, apparently). I settled in there, happy with a pint of stout. On the big screen they were playing an old black and white Western — The Treasure of the Sierra Madre.

It’s pretty close to the line from B. Traven’s novel of the same name.

I didn’t have time in my talk to mention Blazing Saddles, the mysterious B. Traven, The Treasure of the Sierra Madre, Humphrey Bogart, The Bottle Shop, nor the stout. But I was amused by our brief coincidence in Columbus.

November 18, 2016

My former colleague's from Parse.ly wrote the fantastic pykafka library with an optional c-backed using rdkafka. I've had trouble getting it to work, and here are a few things I've learned:

The version of rdkafka installable with apt-get was out of data, and pykafka couldn't find the headers it need. I instead used the simple build instructions in the rdkafka README to build it from head.

I was getting the error ImportError: librdkafka.so.1: cannot open shared object file: No such file or directory when trying to use rdkafka from Pykafka. It could be set in the short term by using LD_LIBRARY_PATH=/usr/local/lib. However, I fixed it permanently by running sudo ldconfig after building rdkafka.

Pykafka has to be installed after building rdkafka. At the moment, Pykafka tries to build a C-extension to connect to rdkafka, and if that fails, it will install without offering the rdkafka backend. Check the output of pip install pykafkato see if the rdkafka extension built.

November 15, 2016

Whether we are managing production services or running computations on AWS machines, many data scientists are working on computers besides their laptops.

For me, this often takes the form of SSH-ing into remote boxes1, manually configuring the system with a combination of apt installs, Conda environments, and bash scripts.

To run my service or scripts, I open a tmux window, activate my virtual environement, and start the process.2

When I need to check my logs or see the output, I SSH back into each box, reconnect to tmux (after I remember the name of my session), and tail my logs. When running on multiple boxes, I repeat this process N times. If I need to restart a process, I flip through my tmux tabs until I find the correct process, kill it with a Ctrl-C, and use the up arrow to reload the last run command.

I recently introduced several colleagues to some Python-based tools that can help. Fabric is a "library and command-line tool for streamlining the use of SSH for application deployment or systems administration tasks." Fabric allows you to encapsulate sequences of commands as you might with a Makefile. It's killer feature is the ease with which it lets you execute those commands on remote machines over SSH. With Fabric, you could tail all the logs on all your nodes with a single command executed in your local terminal. There are a number of talks about Fabric on Youtube if you want to learn more. One of my colleagues reduced his daily workload by writing his system management tasks into a Fabric file.

Another great tool is Supervisor. If you run long running processes in tmux/screen/nohup, Supervisor might be for you. It allows you to define the tasks you want to run in an INI file and "provides you with one place to start, stop, and monitor your processes". Supervisor will log the stdout and stderr to a log location of your choice. It can be a little confusing to set up, but will likely make your life easier in the longer run.

A tool I want to learn but haven't is Ansible, "a free-software platform for configuring and managing computers which combines multi-node software deployment, ad hoc task execution, and configuration management". Unlike Chef and Puppet, Ansible doesn't require an agent on the systems you need to configure; it does all the configuration over SSH. You can use Ansible to configure your systems and install your dependencies, even Supervisor! Ansible is written in Python and, mercifully, doesn't require learning a Ruby-based DSL (as does Chef).

Recently I've been thinking that Fabric, Supervisor, and Ansible combined become a powerful toolset for management and configuration of data science systems. Each tool is also open source and can be installed in a few minutes. Each tool is well documented and offers helpful tutorials on getting started; however, learning to use them effectively may require some effort.

I would love to see someone create training materials on these tools (and others!) focused on how data scientists can take improve their system management, configuration, and operations. A screencast series may be the perfect thing. Someone please help data scientists be lazier, do less work, and reduce the mental overhead of dealing with computers!

It turns out you can easily use it to filter a DateTimeIndex level by a single date with df['2016-11-07'] or a range of dates with df['2016-11-07:2016-11-11']. This applies whether or not its a MultiIndex.

If you get an error like KeyError: 'Key length (1) was greater than MultiIndex lexsort depth (0)', it's because "MultiIndex Slicing requires the index to be fully lexsorted". You may fix your problem by calling df = df.sort_index().

For the last three weeks in Tunisia (which is next door to Libya and a whole lot safer), Caktus’ Tobias McNulty and I trained a dozen HNEC employees on how to use, develop, and maintain the system. We talked about Python, Django, open source culture, GitHub Flow, and, of course, the upcoming U.S. election.

On the eve of that election, I thought it appropriate to express gratitude for my opportunity to participate in the messy sausage-making that is democracy. Good luck to our new Libyan friends; I hope they get the opportunity to do the same in the very near future.

October 25, 2016

I gave a talk last week at Research Triangle Analysts on understanding probabilistic topic models (specificly LDA) by using Python for simulation. Here's the description:

Latent Dirichlet Allocation and related topic models are often presented in the form of complicated equations and confusing diagrams. Tim Hopper presents LDA as a generative model through probabilistic simulation in simple Python. Simulation will help data scientists to understand the model assumptions and limitations and more effectively use black box LDA implementations.

Here in part 3, I review the conversation we (the audience and I) had at the end of the PyOhio talk. I committed the speaker’s cardinal sin of not repeating (into the microphone) the questions people asked, so they’re inaudible in the video. In addition, we had some interesting conversations among multiple people that didn’t get picked up by the microphone. I don’t want them to get lost, so I summarized them here.

The most interesting thing I learned out of this conversation is that LibreOffice can open PDFs; once opened they’re like an ordinary LibreOffice document. You can edit them, save them to ODF, export to PDF, etc. Is this cool, or what?

First Question: What about Using Excel or Word?

One of the attendees jumped in to confirm that modern MS Word formats are XML-based. However, he went on to say, the XML contains a statement at the top that says something like “You cannot legally read the rest of this file”. I made a joke about not having one’s lawyer present when reading the file.

In all seriousness, I can’t find anything online that suggests that Microsoft’s XML contains a warning like that, and the few examples I looked at didn’t have have any such warning. If you can shed any light on this, please do so in the comments!

We also discussed the fact that one must invoke the office app (LibreOffice or Word, Excel, etc.) in order to render the document to PDF. LibreOffice has a reputation for performing badly when invoked repeatedly for this purpose. LibreOffice 5 may have addressed some of these problems, but as of this writing it’s still pretty new so the jury is still out on how this will work in practice.

Another attendee noted that Microsoft can save to LibreOffice format, so if Word (or Excel) is your document-editing tool of choice, you can still use LibreOffice to render it to PDF. That’s really useful if MS Office is your tool of choice but you’re doing rendering on a BSD/Linux server.

Question 2: What about Scraping PDFs?

The questioner noted that scraping a semi-complex PDF is very painful. It’d be ideal, he said, to be able to take a complex form like the 1040 and extract key value pairs of the question and answer. Is the story getting better for scraping PDFs?

My answer was that for the little experience I have with scraping PDFs, I’ve used PDFMiner, and the attendee said he was using the same.

Someone else chimed in that it’s a great use case for [Amazon’s] Mechanical Turk; in his case he was dealing with old faxes that had been scanned.

Question 3: Helper Libraries

Matt Wilson asked if it would make sense to begin building helper libraries to simplify common tasks related to manipulating LibreOffice XML. My answer was that I wasn’t sure since each project has very specific needs. Someone else suggested that one would have to start learning the spec in order to begin creating abstractions.

In the YouTube comments, Paul Hoffman1 called our attention to OdfPy a “thin abstraction over direct XML access”. It looks quite interesting.

Comment 1: Back to Scraping

One of the attendees commented that he had used Jython and PDFBox for PDF scraping. “It took a lot to get started, but once I started to figure out my way around it, it was a pretty good tool and it moved pretty speedily as compared to some of the other tools I used.” He went on to say that it was pretty complete and that it worked very well.

Question 4: About XML Parsing

The question was what I used to parse the XML, and my answer was that I used ElementTree from the standard library. Your favorite XML parsing library will work just fine.

Question 5: Protecting Bookmarks

The question was whether or not I did anything special to protect the bookmarks in the document. My answer was that I didn’t. (I’m not even sure it’s possible.) If you go through multiple rounds of editing with your client, those invisible bookmarks are inevitably going to get moved or deleted, so expect a little maintenance work related to that.

Comment 2: Weasyprint

One of the attendees commented that Weasyprint is a useful HTML/CSS to PDF converter. My observation was that tools in this class (HTML/CSS to PDF converters) are not as precise as either of the methods I outlined in this talk, but if you don’t need precision they’re probably a nice way to go.

Question 6: unoconv in a Web Server

Can one use unoconv in a Web server? My answer was that it’s possible, but it’s not practical to use it in-process. For me, it worked to do so in a demo of an intranet application, but that’s about as far as you want to go with it. It’s much more practical to use a distributed processing application (Celery, for example).

One of the attendees concurred that it makes sense to spin it off into a separate process, but “unoconv inexplicably crashes when it feels like it”.

Comment 3: Converting from Word

The initial comment was that pandoc might help with converting from Word to LibreOffice. This started a conversation which I’d summarize this way:

LibreOffice can open MS Office docs, so use that instead of pandoc and save as LibreOffice

If you open MS Office documents with LibreOffice, double check the formatting because it doesn’t always survive the transition