Description

PyData Berlin 2016

Data processing often splits into two disjunct categories: Classic access to RDBMS is well-understood, but often scales poorly after considerable GBytes of data. Big data approaches are powerful, but complex to set up and to maintain. In a test setup we tried a compromise of both: What happens if you glue more than 1000 single SQL databases into a huge cluster? We learned a whole lotta lessons!

Data processing often splits into two disjunct categories: Classic access to RDBMS with SQL and ORMs is well-understood and convenient, but often scales poorly after considerable GBytes of data. Big data approaches are powerful, but complex to set up and to maintain. In a test setup we tried a compromise of both: What happens if you glue more than 1000 single SQL databases into a huge cluster?

Thanks to access to an unused IaaS cluster, we had the opportunity to research the behavior of many nodes clustered together. Data loading becomes a real challenge, while maintenance and monitoring such a drove of containers was no longer possible manually. We investigated the effect of changing container-vm-ratios. For our experiments, we used Crate, an open source, highly scalable, shared-nothing distributed SQL database, that comes with Python client connectors and support for several ORMs.

We share unexpected experiences about data schema design with the attendees, will explain some tweaking options that turned out to be effective, and would like to campaign for more open data projects.