FLOSS Project Planets

Visualizations help us make sense of the world and allow us to convey large amounts of complex information, data and predictions in a concise form. Expert predictions that need to be conveyed to non-expert audiences, whether they be the path of a hurricane or the outcome of an election, always contain a degree of uncertainty. If this uncertainty is not conveyed in the relevant visualizations, the results can be misleading and even dangerous.

Here, we explore the role of data visualization in plotting the predicted paths of hurricanes. We explore different visual methods to convey the uncertainty of expert predictions and the impact on layperson interpretation. We connect this to a broader discussion of best practices with respect to how news media outlets report on both expert models and scientific results on topics important to the population at large.

No Spaghetti Plots?

We have recently seen the damage wreaked by tropical storm systems in the Americas. News outlets such as the New York Times have conveyed a great deal of what has been going on using interactive visualizations for Hurricanes Harvey and Irma, for example. Visualizations include geographical visualisation of percentage of people without electricity, amount of rainfall, amount of damage and number of people in shelters, among many other things.

One particular type of plot has understandably been coming up recently and raising controversy: how to plot the predicted path of a hurricane, say, over the next 72 hours. There are several ways to visualize predicted paths, each way with its own pitfalls and misconceptions. Recently, we even saw an article in Ars Technica called Please, please stop sharing spaghetti plots of hurricane models, directed at Nate Silver and fivethirtyeight.

In what follows, I'll compare three common ways, explore their pros and cons and make suggestions for further types of plots. I'll also delve into why these types are important, which will help us decide which visual methods and techniques are most appropriate.

Disclaimer: I am definitively a non-expert in metereological matters and hurricane forecasting. But I have thought a lot about visual methods to convey data, predictions and models. I welcome and actively encourage the feedback of experts, along with that of others.

Visualizing Predicted Hurricane Paths

There are three common ways of creating visualizations for predicted hurricane paths. Before talking about at them, I want you to look at them and consider what information you can get from each of them. Do your best to interpret what each of them is trying to tell you, in turn, and then we'll delve into what their intentions are, along with their pros and cons:

From The New York Times. Surrounding text tells us 'One of the best hurricane forecasting systems is a model developed by an independent intergovernmental organization in Europe, according to Jeff Masters, a founder of the Weather Underground. The system produces 52 distinct forecasts of the storm’s path, each represented by a line [above].'

Interpretation and Impact of Visualizations of Hurricanes' Predicted Paths
The Cone of Uncertainty

The cone of uncertainty, a tool used by the National Hurricane Center (NHC) and communicated by many news outlets, shows us the most likely path of the hurricane over the next five days, given by the black dots in the cone. It also shows how certain they are of this path. As time goes on, the prediction is less certain and this is captured by the cone, in that there is an approximately 66.6% chance that the centre of the hurricane will fall in the bounds of the cone.

Was this apparent from the plot itself?

It wasn't to me initially and I gathered this information from the plot itself, the NHC's 'about the cone of uncertainty' page and weather.com's demystification of the cone post. There are three more salient points, all of which we'll return to:

It is a common initial misconception that the widening of the cone over time suggests that the storm will grow;

The plot contains no information about the size of the storm, only about the potential path of its centre, and so is of limited use in telling us where to expect, for example, hurricane-force winds;

There is essential information contained in the text that accompanies the visualization, as well as the visualization itself, such as the note placed prominently at the top, '[t]he cone contains the probable path of the storm center but does not show the size of the storm...'; when judging the efficacy of a data visualization, we'll need to take into consideration all its properties, including text (and whether we can actually expect people to read it!); note that interactivity is a property that these visualizations do not have (but maybe should).

Spaghetti Plots (Type I)

This type of plots shows several predictions in one plot. One any given Type I spaghetti plot, the visualized trajectories are predictions from models from different agencies (NHC, the National Oceanic and Atmospheric Administration and the UK Met Office, for example). They are useful in that, like the cone of uncertainty, they inform us of the general region that may be in the hurricane's path. They are wonderfully unuseful and actually misleading in the fact that they weight each model (or prediction) equally.

In the Type I spaghetti plot above, there are predictions with varying degree uncertainty from agencies that have previously made predictions with variable degrees of success. So some paths are more likely than others, given what we currently know. This information is not present. Even more alarmingly, some of the paths are barely even predictions. Take the black dotted line XTRP, which is a straight-line prediction given the storm's current trajectory. This is not even a model. Eric Berger goes into more detail in this Ars Technica article.

Essentially, this type of plots provide an ensemble model (compare with aggregate polling). Yet, a key aspect of ensemble models is that each model is given an appropriate weight and these weights need be communicated in any data visualization. We'll soon see how to do this using a variation on Type I.

Spaghetti Plots (Type II)

These plots show many, say 50, different realizations of any given model. The point is that if we simulate (run) a model several times, it will given a different trajectory each time. Why? Nate Cohen put it well in The Upshot:

"It’s really tough to forecast exactly when a storm will make a turn. Even a 15- or 20-mile difference in when it turns north could change whether Miami is hit by the eye wall, the fierce ring of thunderstorms that include the storm’s strongest winds and surround the calmer eye."

These are perhaps my favourite of the three for several reasons:

By simulating multiple runs of the model, they provide an indication of the uncertainty underlying each model;

They give a picture of relative likelihood of the storm centre going through any given location. Put simply, if more of the plotted trajectories go through location A than through location B, then under the current model it is more likely that the centre of the storm will go through location A;

They are unlikely to be misinterpreted (at least compared to the cone of uncertainty and the Type I plots). All the words required on the visualization are 'Each line represents one forecast of Irma's path'.

One con of Type II is that they are not representative of multiple models but, as we'll see, this can be altered by combining them with Type I plots. Another con is that they, like the others, only communicate the path of the centre of the storm and say nothing about its size. Soon we'll also see how we can remedy this. Note that the distinction between Type I and Type II spaghetti plots is not one that I have found in the literature, but one that I created because these plots have such different interpretations and effects.

For the time being, however, note that we've been discussing the efficacy of certain types of plots without explicitly discussing their purpose, that is, why we need them at all. Before going any further, let's step back a bit and try to answer the question 'What is the purpose of visualizing the predicted path of a hurricane?' Performing such ostensibly naive tasks is often illuminating.

Why Plot Predicted Paths of Hurricanes?

Why are we trying to convey the predicted path of a tropical storm? I'll provide several answers to this in a minute.

But first, let me say what these visualizations are not intended for. We are not using these visualizations to help people decide whether or not to evacuate their homes or towns. Ordering or advising evacuation is something that is done by local authorities, after repeated consultation with experts, scientists, modelers and other key stakeholders.

The major point of this type of visualization is to allow the general populace to be as well-informed as possible about the possible paths of the hurricane and allow them to prepare for the worst if there's a chance that where they are or will be is in the path of destruction. It is not to unduly scare people. As weather.com states with respect to the function of the cone of uncertainty, '[e]ach tropical system is given a forecast cone to help the public better understand where it's headed' and '[t]he cone is designed to show increasing forecast uncertainty over time.'

To this end, I think that an important property would be for a reader to be able to look at it and say 'it is very likely/likely/50% possible/not likely/very unlikely' that my house (for example) will be significantly damaged by the hurricane.

Even better, to be able to say "There's a 30-40% chance, given the current state-of-the-art modeling, that my house will be significantly damaged".

Then we have a hierarchy of what we want our visualization to communicate:

At a bare minimum, we want civilians to be aware of the possible paths of the hurricane.

Then we would like civilians to be able to say whether it is very likely, likely, unlikely or very unlikely that their house, for example, is in the path.

Ideally, a civilian would look at the visualization and be able to read off quantitatively what the probability (or range of probabilities) of their house being in the hurricane's path is.

On top of this, we want our visualizations to be neither misleading nor easy to misinterpret.

The Cone of Uncertainty versus Spaghetti Plots

All three methods perform the minimum required function, to alert civilians to the possible paths of the hurricane. The cone of uncertainty does a pretty good job at allowing a civilian to say how likely it is that a hurricane goes through a particular location (within the cone, it's about two-thirds likely). At least qualitatively, Type II spaghetti plots also do a good job here, as described above, 'if more of the trajectories go through location A than through location B, then under the current model it is more likely that the centre of the storm will go through location A'.

If you plot 50 trajectories, you get a sense of where the centre of the storm will likely be, that is, if around half of the trajectories go through a location, then there's an approximately 50% chance (according to our model) that the centre of the storm will hit that location. None of these methods yet perform the 3rd function and we'll see below how combining Type I and Type II spaghetti plots will allow us to do this.

The major problem with the cone of uncertainty and Type I spaghetti models is that the cone of uncertainty is easy to misinterpret (in that many people interpret the cone as a growing storm and do not appreciate the role of uncertainty) and that the Type I spaghetti models are misleading (they make all models look equally believable). These models then don't satisfy the basic requirement that 'we want our visualizations to be neither misleading nor easy to misinterpret.'

Best Practices for Visualizing Hurricane Prediction Paths

Type II spaghetti plots are the most descriptive and the least open to misinterpretation. But they do fail at presenting the results of all models. That is, they don't aggregate over multiple models like we saw in Type I.

So what if we combined Type I and Type II?

To answer this, I did a small experiment using python, folium and numpy. You can find all the code here.

I first took one the NHC's Hurricane Irma's prediction paths from last week, added some random noise and plotted 50 trajectories. Note that, once again, I am a non-expert in all matters meteorological. The noise that I generated and added to the predicted signal/path was not based on any models and, in a real use case, would come from the models themselves (if you're interested, I used Gaussian noise). For the record, I also found it difficult to find data concerning any of the predicted paths reported in the media. The data I finally used I found here.

Here's a simple Type II spaghetti plot with 50 trajectories:

But these are possible trajectories generated by a single model. What if we had multiple models from different agencies? Well, we can plot 50 trajectories from each:

One of the really cool aspects of Type II spaghetti plots is that, if we plot enough of them, each trajectory becomes indistinct and we begin to see a heatmap of where the centre of the hurricane is likely to be. All this means is that the more blue in a given region, the more likely it is for the path to go through there. Zoom in to check it out.

Moreover, if we believe that one model is more likely than another (if, for example, the experts who produced that model have produced far more accurate models previously), we can weight these models accordingly via, for example, transparency of the trajectories, as we do below. Note that weighting these models is a task for an expert and an essential part of this process of aggregate modeling.

What the above does is solve the tasks required by the first two properties that we want our visualizations to have. To achieve the 3rd, a reader being able to read off that it's, say 30-40% likely for the centre of a hurricane to pass through a particular location, there are two solutions:

to alter the heatmap so that it moves between, say, red and blue and include a key that says, for example, red means a probability of greater than 90%;

To transform the heatmap into a contour map that shows regions in which the probability takes on certain values.

Also, do note that this will tell somebody the probability that a given location will be hit by the hurricane's center. You could combine (well, convolve) this with information about the size of the hurricane to transform the heatmap into one of the probability of a location being hit by hurricane-force winds. If you'd like to do this, go and hack around the code that I wrote to generate the plots above (I plan to write a follow-up post doing this and walking through the code).

Visualizing Uncertainty and Data Journalism

What can we take away from this? We have explored several types of visualization methods for predicted hurricane paths, discussed the pros and cons of each and suggested a way forward for more informative and less misleading plots of such paths, plots that communicate not only the results but also the uncertainty around the models.

This is part of a broader conversation that we need to be having about reporting uncertainty in visualizations and data journalism, in general. We need to actively participate in conversations about how experts report uncertainty to civilians via news media outlets. Here's a great piece from The Upshot demonstrating what the jobs report could look like due to statistical noise, even if jobs were steady. Here's another Upshot piece showing the role of noise and uncertainty in interpreting polls. I'm well aware that we need headlines to sell news and the role of click-bait in the modern news media landscape, but we need to be communicating not merely results, but uncertainty around those results so as not mislead the general public and potentially ourselves. Perhaps more importantly, the education system needs to shift and equip all civilians with levels of data literacy and statistical literacy in order to deal with this movement into the data-driven age. We can all contribute to this.

After a strong start, we continue our early access program (EAP) with its second release. Download EAP 2 now!

Testing RESTful Applications

Many of us work on web applications which expose a RESTful API, or at least an API that pretends to be RESTful. To test these some of us use cURL, some browser extension, or some other piece of software. There is a REST client in PyCharm, but we’ve decided it can use some improvement, so we’re making an all new one.

The new REST client is entirely editor based, you write your request in a file, and then run the request to get a response. Sounds easy enough, right?

We’ll start out by creating a new todo. This is done by POST-ing to the /todos/ endpoint. To use the new PyCharm REST client, we should start by creating a .http file. If we don’t intend to save this, we can create a scratch file. Press Ctrl+Alt+Shift+Insert (Shift+Command+N on macOS) to start creating a scratch file and choose ‘HTTP request’ as the type. Let’s type our request into the file:

Now click the green play button next to the first line, and you should see that the task was created:

You can see the response in the Run tool window, and you might also notice that PyCharm wrote a new line in our file with the name of a .json file. This file contains the response, so if we Ctrl+Click (Cmd+Click) the filename, or use Ctrl+B (Cmd+B) to go to definition we see the full response in a separate file.

Those files become really useful when we do the same request a couple times but get different results. If we use a GET request to get our todo, and then use a PUT to change it, and redo our GET, we’ll now have two files there. We can then use the blue icon with the arrows to see the difference between the responses:

Earlier this year, I showed how to use Talend Open Studio for Big Data to access data stored in HDFS, where HDFS had been configured to authenticate users using Kerberos. A similar blog post showed how to read data from an Apache Kafka topic using kerberos. In this tutorial I will show how to create a job in Talend Open Studio for Big Data to read data from an Apache Hive table using kerberos. As a prerequisite, please follow a recent tutorial on setting up Apache Hadoop and Apache Hive using kerberos.

1) Download Talend Open Studio for Big Data and create a job

Download Talend Open Studio for Big Data (6.4.1 was used for the purposes of this tutorial). Unzip the file when it is downloaded and then start the Studio using one of the platform-specific scripts. It will prompt you to download some additional dependencies and to accept the licenses. Click on "Create a new job" called "HiveKerberosRead". In the search bar under "Palette" on the right hand side enter "hive" and hit enter. Drag "tHiveConnection" and "tHiveInput" to the middle of the screen. Do the same for "tLogRow":"tHiveConnection" will be used to configure the connection to Hive. "tHiveInput" will be used to perform a query on the "words" table we have created in Hive (as per the earlier tutorial linked above), and finally "tLogRow" will just log the data so that we can be sure that it was read correctly. The next step is to join the components up. Right click on "tHiveConnection" and select "Trigger/On Subjob Ok" and drag the resulting line to "tHiveInput". Right click on "tHiveInput" and select "Row/Main" and drag the resulting line to "tLogRow":

3) Configure the components

Now let's configure the individual components. Double click on "tHiveConnection". Select the following configuration options:

Distribution: Hortonworks

Version: HDP V2.5.0

Host: localhost

Database: default

Select "Use Kerberos Authentication"

Hive Principal: hiveserver2/localhost@hadoop.apache.org

Namenode Principal: hdfs/localhost@hadoop.apache.org

Resource Manager Principal: mapred/localhost@hadoop.apache.org

Select "Use a keytab to authenticate"

Principal: alice

Keytab: Path to "alice.keytab" in the Kerby test project.

Unselect "Set Resource Manager"

Set Namenode URI: "hdfs://localhost:9000"

Now click on "tHiveInput" and select the following configuration options:

Select "Use an existing Connection"

Choose the tHiveConnection name from the resulting "Component List".

Click on "Edit schema". Create a new column called "word" of type String, and a column called "count" of type int.

Table name: words

Query: "select * from words where word == 'Dare'"

Now the only thing that remains is to point to the krb5.conf file that is generated by the Kerby project. Click on "Window/Preferences" at the top of the screen. Select "Talend" and "Run/Debug". Add a new JVM argument: "-Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf":Now we are ready to run the job. Click on the "Run" tab and then hit the "Run" button. You should see the following output in the Run Window in the Studio:

I am pleased to inform that Qt 5.6.3 has been released today. As always with a patch release Qt 5.6.3 does not bring any new features, just error corrections. For details of the bug fixes in Qt 5.6.3, please check the change logs for each module and known issues of Qt 5.6.3 wiki page.

Qt 5.6 LTS is currently in the ‘Strict’ phase, and only receives fixes to security issues, crashes, regressions and similar. Since end of 2016 we have already reduced the number of fixes going into the 5.6 branch and after Qt 5.6 LTS enters the ‘Very Strict’ phase it will receive security fixes only. The reason for gradually reducing the amount of changes going into an LTS version of Qt is to avoid problems in stability. While each fix as such is beneficial, they are also possible risks for behavior changes and regressions, which we want to avoid in LTS releases.

As part of our LTS commitment, we continue to support the commercial Qt 5.6 LTS users throughout the three-year standard support period, after which it is possible to purchase extended support. For description of our support services, please check the recent blog post describing the Standard, Extended and Premium Support. In May 2017 we released Qt 5.9 LTS, which includes a wide range of new features, functionality and overall performance improvements. We expect to release Qt 5.9.2 patch release still during September, including all the bug fixes of Qt 5.6.3 and many more. To learn more about the improvements that come with Qt 5.9 LTS you can find all relevant blogs and on-demand webinars here.

If you are using the online installer, Qt 5.6.3 can be updated using the maintenance tool. Offline packages are available for commercial users in the Qt Account portal and at the qt.io Download page for open-source users.

On Sunday, in my weekly report on my free software activities, I wrote about how sustainable my current level of activites are. I had identified the risk that the computer that I use for almost all of my free software work was slowly dying. Last night it entered an endless reboot loop and subsequent efforts to save it have failed.
I cannot afford to replace this machine and my next best machine has half the cores, half the RAM and less than half of the screen real estate.

I will be next week at the SUSECON in Prague to present about the new librmb/rbox project sponsored by Deutsche Telekom. The presentation slot is Tuesday, Sep 26, 1:30 PM - 2:30 PM. If you attend and are interested on how to store emails directly in Ceph RADOS add the session to your schedule.

This talk is based on the speaker’s experience running a Python
focused software company for more than 15 years and a recent consulting
project to support the valuation of a Python startup company in the due
diligence phase.

For the valuation we had to come up with metrics, a catalog of
criteria analyzing risks, potential and benefits of the startup’s
solution, as well as an estimate for how much effort it would take to
reimplement the solution from scratch.

In the talk, I am going to show the metrics we used, how they can be
applied to Python code, the importance of addressing risk factors, well
designed code and data(base) structures.

By following some of the advice from this talk, you should be able to
improve the valuation of your startup or consulting business in
preparation for investment rounds or an acquisition.

If you are interested in learning more about these advanced techniques, eGenix now offers Python project coaching and consulting services
to give your project teams advice on how to design Python applications,
successfully run projects, or find excellent Python programmers. Please
contact
our eGenix Sales Team for information.

I just want to spread the word that Bennet Schulz yesterday posted a great short blog how to get started with Apache Camel with just Java. It shows you the basics of creating a new Camel project and with your first Camel route and how to run that with just plain Java.

Less than a month after Krita 3.2.1, we’re getting ready to release Krita 3.3.0. We’re bumping the version because there are some important changes for Windows users in this version!

Alvin Wong has implemented support for the Windows 8 event API, which means that Krita now supports the n-trig pen in the Surface line of laptops (and similar laptops from Dell, HP and Acer) natively. This is still very new, so you have to enable this in the tablet settings:

And he also refactored Krita’s hardware-accelerated display functionality to optionally use Angle on Windows instead of native OpenGL. That means that many problems with Intel display chips and broken driver versions are worked around because Krita now indirectly uses Direct3D.

There are more changes in this release, of course:

Some visual glitches when using hi-dpi screens are fixed (remember: on Windows and Linux, you need to enable this in the settings dialog).

If you create a new image from clipboard, the image will have a title

Favorite blending modes and favorite brush presets are now loaded correctly on startup

GMIC

the plugin has been updated to the latest version for Windows and Linux.

the configuration for setting the path to the plugin has been removed. Krita looks for the plugin in the folder where the krita executable is, and optionally inside a folder with a name that starts with ‘gmic’ next to the krita executable.

there are several fixes for handling layers and communication between Krita and the plugin

Some websites save jpeg images with a .png extension: that used to confuse Krita, but Krita now first looks inside the file to see what kind of file it really is.

PNG:

16 and 32 bit floating point images are now converted to 16 bit integer when saving the images as PNG.

It’s now possible to save the alpha channel to PNG images even if there are no (semi-) transparent pixels in the image

When hardware accelerated display is disabled, the color picker mode of the brush tool showed a broken cursor; this has been fixed.

The Reference Images docker now only starts loading images when it is visible, instead on Krita startup. Note: the reference images docker uses Qt’s imageio plugins to load images. If you are running on Linux, remove all Deepin desktop components. Deepin comes with severely broken qimageio plugins that will crash any Qt application that tries to display images.

File layers now correctly reload on change again

Add several new commandline options:

–nosplash to start Krita without showing the splash screen

–canvasonly to start Krita in canvas-only mode

–fullscreen to start Krita full-screen

–workspace Workspace to start Krita with the given workspace

Selections

The Select All action now first clears the selection before selecting the entire image

It is now possible to extend selections outside the canvas boundary

Performance improvements: in several places superfluous reads from the settings were eliminated, which makes generating a layer thumbnail faster and improves painting if display acceleration is turned off.

The smart number input boxes now use the current locale to follow desktop settings for numbers

The system information dialog for bug reports is improved

macOS/OSX specific changes:

Bernhard Liebl has improved the tablet/stylus accuracy. The problem with circles having straight line segments is much improved, though it’s not perfect yet.

On macOS/OSX systems with and AMD gpu, support for hardware accelerated display is disabled because saving to PNG and JPG hangs Krita otherwise.

Download
Windows

Note for Windows users: if you encounter crashes, please follow these instructions to use the debug symbols so we can figure out where Krita crashes. There are no 32 bits packages at this point, but there will be for the final release.

The new PyDev release is now out and offers some really nice features on a number of fronts!

The interpreter configuration now integrates with both pip and conda, showing the installed packages and allowing any package to be installed and uninstalled from inside the IDE.

Also, it goes a step further in the conda integration and allows users to load the proper environment variables from the env -- this is actually false by default and can be turned on in the interpreter configuration page when PyDev identifies an interpreter as being managed by conda by checking the "Load conda env vars before run" configuration (so, if you have some library which relies on some configuration you don't have to activate the env outside the IDE).

Another change which is pretty nice is that now when creating a project there's an option to specify that the project should always use the interpreter version for syntax validation.

Previously a default version for the grammar was set, but users could be confused when the version didn't match the interpreter... note that it's still possible to set a different version or even add additional syntax validators, for cases when you're actually dealing with supporting more than one Python version.

The editor now has support for subword navigation (so, navigating words as MyReallyNiceClass with Ctrl+Left/Right will stop after each subword -- i.e.: 'My', 'Really', 'Nice', 'Class' -- remember that Shift+Alt+Up can be used to select the full word for the cases where Ctrl+ShiftLeft/Right did it previously).

This mode is now also consistent among all platforms (previously each platform had its own style based on the underlying platform -- it's still possible to revert to that mode in the Preferences > PyDev > Editor > Word navigation option).

Integration with PyLint and isort were also improved: the PyLint integration now provides an option to search for PyLint in the interpreter which a project is using and isort integration was improved to know about the available packages (i.e.: based on the project/interpreter configuration, PyDev knows a lot about which should be third party/ library projects and passes that information along to isort).

In the unittest front, Robert Gomulka did some nice work and now the name of the unittest being run is now properly shown in the run configuration and it's possible to right-click a given selection in the dialog to run tests (Ctrl+F9) and edit the run configuration (to edit environment variables, etc) before running it.

Aside from that there were also a number of other fixes and adjustments (see http://pydev.org for more details).

Several of our Lullabots and the team from our sister company, Drupalize.me, are about to descend upon the City of Music to present seven kick-ass sessions to the Drupal community in the EU. There will be a cornucopia of topics presented — from softer human-centric topics such as imposter syndrome to more technical topics such as Decoupled Drupal. So, if you're headed to DrupalCon Vienna next week, be sure to eat plenty of Sachertorte, drink lots of Ottakringer, and check out these sessions that will Rock You Like Amadeus:

Contenta is a decoupled Drupal distribution that has many examples of various front-ends available as best practices guides. Lullabot Senior Technical Architect Sally Young, Christina Chumillas, and Daniel Wehner will bring you up to speed on the latest Contenta developments, including its current features and roadmap. You will also get a tour of Contenta’s possibilities that come with reference applications that implement the out-of-the-box initiative’s cooking recipe.

Lullabot Developer, Ezequiel “Zequi” Vázquez, will explore the current state of test automation and present the most useful tools that provide testing capabilities for security, accessibility, performance, scaling, and more. Zequi will also give you advice on the best strategies to implement automated testing for your application, and how to cover relevant aspects of your software.

Drupalize.me Production Manager & Trainer, Amber Himes Matz, will survey the current state of voice and conversational interface APIs with an eye toward global language support. She’ll cover services including Alexa, Google, and Cortana by examining their distinct features and the devices, platforms, interactions, and spoken languages they support. If you’re looking for a better understanding of the voice and conversational interface services landscape, ideas on how to approach the voice UI design process, an understanding of concepts and terminology related to voice interaction, and ways to get started, this is the right session for you - complete with a demo!

Lullabot Developer, Juan Olalla Olmo, and Salvador Molina will share their experiences and explore the areas and attitudes that can help everyone become better professionals by embracing who they are and ultimately empower others to do the same. This inspiring session aims to help you grow professionally and provide more value at work by focusing on fostering the human relationships and growing as people.

Juan gave this presentation internally at Lullabot’s recent Design and Development Retreat. It was a highlight that sparked a lively conversation.

Want to make your own virtual reality experiences? Lullabot Senior Front-end Developer Wes Ruvalcaba will show you how. Starting with an overview of VR (and AR) concepts, technologies, and what its uses are, Wes will also demo and share code examples of VR websites we’ve made at Lullabot. You’ll also get an intro to A-Frame and Wes will explain how you can get started.

We’re especially proud of Drupalize.me's Joe Shindelar for being selected to give the Community Keynote. If you’ve been around Drupal for a while, it’s likely you’ve either met or learned from Joe. In this session, Joe will reflect on 10 years of both successfully and unsuccessfully engaging with the community. By doing so he hopes to help others learn about what they have to share, and the benefits of doing so. This is important because sharing:

Creates diversity, both of thought and culture

Builds people up, helps them realize their potential, and enriches our community

Lullabot Developer Marcos Cano will be presenting on Entity Browser, which is a Drupal 8 contrib module created to upload multiple images/files at once, select and re-use an image/file already present on the server, and more. In this session Marcos will:

Explain the basic architecture of the module, and how to take advantage of its plugin-based approach to extend and customize it

See how to configure it from scratch to solve different use-cases, including some pitfalls that often occur in that process

Check what we can copy or re-use from other contrib modules

Explore some possible integrations with other parts of the media ecosystem

We start by reproducing a
blogpost
published last June, but with 30x speedups. Then we talk about how we achieved
the speedup with Cython and Dask.

All code in this post is experimental. It should not be relied upon.

Experiment

In June Ravi Shekhar
published a blogpost Geospatial Operations at Scale with Dask and GeoPandas
in which he counted the number of rides originating from each of the official
taxi zones of New York City. He read, processed, and plotted 120 million
rides, performing an expensive point-in-polygon test for each ride, and produced a
figure much like the following:

This took about three hours on his laptop. He used Dask and a bit of custom
code to parallelize Geopandas across all of his cores. Using this combination he
got close to the speed of PostGIS, but from Python.

Today, using an accelerated GeoPandas and a new dask-geopandas library, we can do
the above computation in around eight minutes (half of which is reading CSV
files) and so can produce a number of other interesting images with faster
interaction times.

The rest of this article talks about GeoPandas, Cython, and speeding up
geospatial data analysis.

Background in Geospatial Data

The Shapely User Manual begins
with the following passage on the utility of geospatial analysis to our society.

Deterministic spatial analysis is an important component of computational
approaches to problems in agriculture, ecology, epidemiology, sociology, and
many other fields. What is the surveyed perimeter/area ratio of these patches
of animal habitat? Which properties in this town intersect with the 50-year
flood contour from this new flooding model? What are the extents of findspots
for ancient ceramic wares with maker’s marks “A” and “B”, and where do the
extents overlap? What’s the path from home to office that best skirts
identified zones of location based spam? These are just a few of the possible
questions addressable using non-statistical spatial analysis, and more
specifically, computational geometry.

Shapely is part of Python’s GeoSpatial stack which is currently composed of the
following libraries:

These libraries provide intuitive Python wrappers around the OSGeo C/C++
libraries (GEOS, GDAL, …) which power virtually every open source geospatial
library, like PostGIS, QGIS, etc.. They provide the same functionality, but
are typically much slower due to how they use Python. This is acceptable for
small datasets, but becomes an issue as we transition to larger and larger
datasets.

In this post we focus on GeoPandas, a geospatial extension of Pandas which
manages tabular data that is annotated with geometry information like points,
paths, and polygons.

GeoPandas Example

GeoPandas makes it easy to load, manipulate, and plot geospatial data. For
example, we can download the NYC taxi
zones, load and plot
them in a single line of code.

Cities are now doing a wonderful job publishing data into the open. This
provides transparency and an opportunity for civic involvement to help analyze,
understand, and improve our communities. Here are a few fun geospatially-aware
datasets to make you interested:

Unfortunately GeoPandas is slow. This limits interactive exploration on larger
datasets. For example the Chicago crimes data (the first dataset above) has
seven million entries and is several gigabytes in memory. Analyzing a dataset
of this size interactively with GeoPandas is not feasible today.

This slowdown is because GeoPandas wraps each geometry (like a point, line, or
polygon) with a Shapely object and stores all of those objects in an
object-dtype column. When we compute a GeoPandas operation on all of our
shapes we just iterate over these shapes in Python. As an example, here is how
one might implement a distance method in GeoPandas today.

Unfortunately this just iterates over elements in the series, each of which is
an individual Shapely object. This is inefficient for two reasons:

Iterating through Python objects is slow relative to iterating through those same objects in C.

Shapely Python objects consume more memory than the GEOS Geometry objects that they wrap.

This results in slow performance.

Cythonizing GeoPandas

Fortunately, we’ve rewritten GeoPandas with Cython to directly loop over the
underlying GEOS pointers. This provides a 10-100x speedup depending on the
operation.
So instead of using a Pandas object-dtype column that holds shapely objects
we instead store a NumPy array of direct pointers to the GEOS objects.

Before

After

As an example, our function for distance now looks like the following Cython
implementation (some liberties taken for brevity):

For fast operations we see speedups of 100x. For slower operations we’re
closer to 10x. Now these operations run at full C speed.

In his EuroSciPy
talk Joris compares the
performance of GeoPandas (both before and after Cython) with PostGIS, the standard geospatial plugin for the popular
PostgreSQL database (original
notebook
with the comparison). I’m stealing some plots from his talk below:

Cythonized GeoPandas and PostGIS run at almost exactly the same speed. This is
because they use the same underlying C library, GEOS. These algorithms are not
particularly complex, so it is not surprising that everyone implements them
in exactly the same way.

This is great. The Python GIS stack now has a full-speed library that operates
as fast as any other open GIS system is likely to manage.

Problems

However, this is still a work in progress, and there is still plenty of work
to do.

First, we need for Pandas to track our arrays of GEOS pointers differently from
how it tracks a normal integer array. This is both for usability reasons, like
we want to render them differently and don’t want users to be able to perform
numeric operations like sum and mean on these arrays, and also for stability
reasons, because we need to track these pointers and release their allocated
GEOSGeometry objects from memory at the appropriate times. Currently, this
goal is pursued by creating a new block type, the GeometryBlock (‘blocks’ are
the internal building blocks of pandas that hold the data of the different columns).
This will require some changes to Pandas itself to enable custom block types
(see this issue on the pandas
issue tracker).

Second, data ingestion is still quite slow. This relies not on GEOS, but on
GDAL/OGR, which is handled in Python today by Fiona. Fiona is more optimized
for consistency and usability rather than raw speed. Previously when GeoPandas
was slow this made sense because no one was operating on particularly large
datasets. However now we observe that data loading is often several times more
expensive than all of our manipulations so this will probably need some effort
in the future.

Third, there are some algorithms within GeoPandas that we haven’t yet
Cythonized. This includes both particular features like overlay and dissolve
operations as well as small components like GeoJSON output.

Finally as with any rewrite on a codebase that is not exhaustively tested
(we’re trying to improve testing as we do this) there are probably several bugs
that we won’t detect until some patient and forgiving user runs into them
first.

Still though, all linear geospatial operations work well and are thoroughly
tested. Also spatial joins (a backbone of many geospatial operations) are up
and running at full speed. If you work in a non-production environment then
Cythonized GeoPandas may be worth your time to investigate.

Cythonizing gives us speedups in the 10x-100x range. We use a single core as
effectively as is possible with the GEOS library. Now we move on to using
multiple cores in parallel. This gives us an extra 3-4x on a standard 4 core
laptop. We can also scale to clusters, though I’ll leave that for a future
blogpost.

To parallelize we need to split apart our dataset into multiple chunks. We can
do this naively by placing the first million rows in one chunk, the second
million rows in another chunk, etc. or we can partition our data spatially,
for example by placing all of the data for one region of our dataset in one
chunk and all of the data for another region in another chunk, and so on.
Both approaches are implemented in a rudimentary
dask-geopandas library
available on GitHub.

So just as dask-array organizes many NumPy arrays along a grid
and dask-dataframe organizes many Pandas dataframes along a linear index

the dask-geopandas library organizes many GeoPandas dataframes into spatial
regions. In the example below we might partition data in the city of New York
into its different boroughs. Data for each borough would be handled
separately by a different thread or, in a distributed situation, might live on
a different machine.

This gives us two advantages:

Even without geospatial partitioning, we can use many cores (or many
machines) to accelerate simple operations.

For spatially aware operations, like spatial joins or subselections we can
engage only those parts of the parallel dataframe that we know are relevant
for various parts of the computation.

However this is also expensive and not always necessary. In our initial
exercise with the NYC Taxi data we didn’t do this, and will still got
significant speedups just from normal multicore operation.

Exercise

And so to produce the images we did at the top of this post we used a
combination of dask.dataframe to load in CSV files, dask-geopandas to perform
the spatial join, and then dask.dataframe and normal pandas to perform the
actual computations. Our code looked something like the following:

We’ve replaced most of Ravi’s custom analysis with a few lines of new standard
code. This maxes our or CPU when doing spatial joins. Everything here
releases the GIL well and the entire computation operates in under a couple
gigabytes of RAM.

Problems

The dask-geopandas project is
currently a prototype. It will easily break for non-trivial applications (and
indeed many trivial ones). It was designed to see how hard it would be to
implement some of the trickier operations like spatial joins, repartitioning,
and overlays. This is why, for example, it supports a fully distributed
spatial join, but lacks simple operations like indexing. There are
other longer-term issues as well.

Serialization costs are manageable, but decently high. We currently use the
standard “well known binary” WKB format common in other geospatial applications
but have found it to be fairly slow, which bogs down inter-process parallelism.

Similarly distributed and spatially partitioned data stores don’t seem to be
common (or at least I haven’t run across them yet).

It’s not clear how dask-geopandas dataframes and normal dask dataframes should
interact. It would be very convenient to reuse all of the algorithms in
dask.dataframe, but the index structures of the two libraries is very
different. This may require some clever software engineering on the part of
the Dask developers.

Still though, these seem surmountable and generally this process has been easy
so far. I suspect that we can build an intuitive and performant parallel GIS
analytics system with modest effort.

The notebook for the example at the start of the blogpost shows using
dask-geopandas with good results.

Conclusion

With established technologies in the PyData space like Cython and Dask we’ve
been able to accelerate and scale GeoPandas operations above and beyond
industry standards. However this work is still experimental and not ready for
production use. This work is a bit of a side project for both Joris and
Matthew and they would welcome effort from other experienced open source
developers. We believe that this project can have a large social impact and
are enthusiastic about pursuing it in the future. We hope that you share our
enthusiasm.

Little’s law can be used to describe a system in steady state from a queuing perspective, i.e. arrival and leaving rates are balanced. In this case it is a crude way of modelling a system with a contention percentage of 100% under Amdahl’s law, in that throughput is one over latency. However this is an inaccurate way to model a system with locks. Amdahl’s law does not account for coherence costs. For example, if you wrote a microbenchmark with a single thread to measure the lock cost then it is much lower than in a multi-threaded environment where cache coherence, other OS costs such as scheduling, and lock implementations need to be considered. Universal Scalability Law (USL) accounts for both the contention and the coherence costs. http://www.perfdynamics.com/Manifesto/USLscalability.html When modelling locks it is necessary to consider how contention and coherence costs vary given how they can be implemented. Consider in Java how we have biased locking, thin locks, fat locks, inflation, and revoking biases which can cause safe points that bring all threads in the JVM to a stop with a significant coherence component.

There will be a lot of sessions on DrupalCon Vienna. That's nothing new to be fair. DrupalCons are the biggest Drupal events, so you don't catch all the sessions you want. Therefore, we have made a short list of the sessions you don't want to miss. We hope it will help you.
But before looking at it, it's fair to say that the Business sessions were excluded because we have already presented them on the other occasion. Moreover, our commercial director Iztok Smolic was directly involved in selecting them, so if we pointed out any session from the business track, you may have argued about the… READ MORE

Here’s a list of the 10 important tips and tricks to help make your sure you have a magical BADCamp experience.

BADCamp is sure to be a great event. Tickets are FREE. Register today!

1. Wear Good, Comfortable Shoes

If you want to have a great time the whole time you’re at BADCamp, we STRONGLY recommend wearing shoes that are comfortable but give you lots of support. You don’t want to miss out on all the fun stuff we have planned because you have to take a break to rest your poor tootsies. Don’t wear brand new shoes either and you might want to also get insoles.

2. Dress in Layers

October in Berkeley is beautiful, but let’s face it, room temperatures are unpredictable. For this reason, bring a hoodie (or donate to get a special edition 2017 BADCamp hoodie) that you can throw on and/or take off as the climate requires. The historical average for that time of year is in the mid 70’s (about 22 – 25 Celsius).

Do you want to find a new employer? Check out the job board and sponsors expo to meet awesome Drupal shops

Who do you want to meet with while you are at BADCamp? A famous podcaster or module maintainer? Find out who is coming on the attendee list and reach out to connect. Magical moments are frequent at BADCamp

4. Bring a Laptop

If you want to get the most out of your BADCamp experience, be sure to bring a laptop. You will want to follow along and try some of the cool things the presenters show you. Bring a notepad too. Sometimes getting to an outlet to charge your laptop can be tricky. So that you don’t forget something important while your laptop charges, bring a notebook or notepad and a pen and take some notes.

5. Bring a Water Bottle/Travel Mug

There will be water fountains and FREE coffee/tea. Bringing a refillable water bottle means that you can stay focused on what you’re doing longer and get the most out of the sessions you're attending. Not to mention, using a water bottle instead of buying bottles of water is far better for the environment.

6. Bring Chargers for ALL your Devices and a Mobile Charger

There’s nothing worse than being out and about with a dead phone/tablet/laptop. Bring chargers for all of the devices you intend to use at BADCamp. If you use a battery-operated mouse (or wireless remote for presenting), bringing an extra set of batteries couldn’t hurt either. Even if you don’t end up needing them, you could find yourself with a new friend when you share those extra batteries with someone in need.

7. Bring Business Cards

Make it easy to connect! You will meet lots of great people and some of them you will want to keep in touch with. Get in the habit of giving out a card when you meet someone.

8. Condense your Stuff

You will walk around campus, so a lighter load is ideal. Plus you will want room for SWAG! Condense your backpack down. Pro Tip: Get a small tote or even a Ziploc bag to stick all of your conference swag in. That way all of the stickers and little bits and pieces are in one bag that you can stick in your luggage at the end of the conference.

9. Bring a Pair of Headphones

As much as we all want to be able to unplug from our jobs and just focus on the sessions, it’s not always possible. Sometimes you have to put your nose to the grindstone and get some work done. If you’re the type that needs to listen to some music while you work, bring along a pair of earbuds so that you can focus and not disturb others around you.

10. Bring a Friend

While not required, having a friend tag along with you can make for a memorable BADCamp experience. If you’re like me and you’re road tripping to BADCamp, think of all of the awesome photos, sing-a-longs, and weird roadside attractions that you’ll see and get to enjoy together. If you’re flying, it’s always nice to have a travel buddy to keep you company while you’re waiting at the airport during the inevitable layover.

Pro Tip: Don’t use your buddy as a reason to shut out others. Go in with an open mind and you’re sure to find another new friend (or friends!) to share the experience with.

BADCamp is sure to be a great event. Tickets are FREE. Register today!

The following problems appeared in the programming assignments in the coursera course Applied Social Network Analysis in Python. The descriptions of the problems are taken from the assignments. The analysis is done using NetworkX. 1. Creating and Manipulating Graphs Eight employees at a small company were asked to choose 3 movies that they would most enjoy … Continue reading Some Social Network Analysis with Python

Several of our Lullabots and the team from our sister company, Drupalize.me, are about to descend upon the City of Music to present seven kick-ass sessions to the Drupal community in the EU. There will be a cornucopia of topics presented — from softer human-centric topics such as imposter syndrome to more technical topics such as Decoupled Drupal. So, if you're headed to DrupalCon Vienna next week, be sure to eat plenty of Sachertorte, drink lots of Ottakringer, and check out these sessions that will Rock You Like Amadeus:

Contenta is a decoupled Drupal distribution that has many examples of various front-ends available as best practices guides. Lullabot Senior Technical Architect Sally Young, Christina Chumillas, and Daniel Wehner will bring you up to speed on the latest Contenta developments, including its current features and roadmap. You will also get a tour of Contenta’s possibilities that come with reference applications that implement the out-of-the-box initiative’s cooking recipe.

Lullabot Developer, Ezequiel “Zequi” Vázquez, will explore the current state of test automation and present the most useful tools that provide testing capabilities for security, accessibility, performance, scaling, and more. Zequi will also give you advice on the best strategies to implement automated testing for your application, and how to cover relevant aspects of your software.

Drupalize.me Production Manager & Trainer, Amber Himes Matz, will survey the current state of voice and conversational interface APIs with an eye toward global language support. She’ll cover services including Alexa, Google, and Cortana by examining their distinct features and the devices, platforms, interactions, and spoken languages they support. If you’re looking for a better understanding of the voice and conversational interface services landscape, ideas on how to approach the voice UI design process, an understanding of concepts and terminology related to voice interaction, and ways to get started, this is the right session for you - complete with a demo!

Lullabot Developer, Juan Olalla Olmo, and Salvador Molina will share their experiences and explore the areas and attitudes that can help everyone become better professionals by embracing who they are and ultimately empower others to do the same. This inspiring session aims to help you grow professionally and provide more value at work by focusing on fostering the human relationships and growing as people.

Juan gave this presentation internally at Lullabot’s recent Design and Development Retreat. It was a highlight that sparked a lively conversation.

Want to make your own virtual reality experiences? Lullabot Senior Front-end Developer Wes Ruvalcaba will show you how. Starting with an overview of VR (and AR) concepts, technologies, and what its uses are, Wes will also demo and share code examples of VR websites we’ve made at Lullabot. You’ll also get an intro to A-Frame and Wes will explain how you can get started.

We’re especially proud of Drupalize.me's Joe Shindelar for being selected to give the Community Keynote. If you’ve been around Drupal for a while, it’s likely you’ve either met or learned from Joe. In this session, Joe will reflect on 10 years of both successfully and unsuccessfully engaging with the community. By doing so he hopes to help others learn about what they have to share, and the benefits of doing so. This is important because sharing:

Creates diversity, both of thought and culture

Builds people up, helps them realize their potential, and enriches our community

Lullabot Developer Marcos Cano will be presenting on Entity Browser, which is a Drupal 8 contrib module created to upload multiple images/files at once, select and re-use an image/file already present on the server, and more. In this session Marcos will:

Explain the basic architecture of the module, and how to take advantage of its plugin-based approach to extend and customize it

See how to configure it from scratch to solve different use-cases, including some pitfalls that often occur in that process

Check what we can copy or re-use from other contrib modules

Explore some possible integrations with other parts of the media ecosystem

This site has been partially supported by NSF Grant 07-08437. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the contributors and do not necessarily reflect the views of the National Science Foundation.