Vendor Free Data Science

We often get asked by clients what vendor solution(s) we propose for their data science needs. In this
blog post, I try to summarize why the answer is (almost always) some open source tool.

We make it no secret that GoDataDriven loves open source. But why is that?

Open source software doesn't do it all

This need some explanation. When people develop and release an open source product or project,
they probably thought that what they did was interesting for the community at large. Their goal is
achieved when people start using their software. But, especially in data science, they don't
pretend to do everything for you.

Since they are not trying to do it all, they are (or should be) extremely flexible when it comes to
interfacing with other part of your infrastructure, pipelines, or tools.

That means that using a particular tool does not force you to change all the other tools you are
using.

This is also important when it comes to the tools you use in your workflow.

Take for example git: I firmly believe git, or any other distributed version control system, is
fundamental when it comes to develop and productionize data science solutions.

Be it going back in time, be it integrating with your build system, be it working together on the
same feature at once.

On the other hand, many commercial vendors1 try to be your one stop shop solution for all your
data science need. They handle your code, they handle your data management, they handle your data
wrangling, they handle your models, and they handle how you put your models into production. And
they do not offer or even integrate with any mature version control systems.

You might argue, even when you're not factoring price, that this approach is more attractive as you
don't have to think about many moving parts. However it takes away a lot of freedom: if a
new, better in some way, system comes out for data wrangling, most of the time you cannot
replace just that: it is in the vendor's best interest to usually
keep you2 in their platform.

That means there is no (straightforward) way to mix and match with one stop shops.

Open source relies on mature software engineering principles

When I first mention this, people are puzzled, and wonder what the heck I'm talking about. I
usually end up explaining why software engineering is important for data science.

It boils down to data science needing software engineering discipline.

Since open source projects rely on software engineering principles, it is obvious that
they are architected so that, when you make use of them, you can also follow these principles.

Example

I will make an example, just in case your head is spinning. When I'm using a tool like Airflow, I
can use git for all my import pipelines. Airflow's goal is to orchestrate the pipelines, not to take
over my workflow. And as Airflow is using git for its development, it makes sense for them to allow
their users to also do that.

Another example is Spark. When you write Spark code, you should really
write the accompanying unit tests with it. Spark provides you with the machinery to write these
tests, because, guess what, Spark is also extensively tested.3

People used to open source tools might be baffled that some vendors do not integrate with your
version control system or that do not allow writing unit tests. But these are out there.
Mathematica, for example, added a unit testing framework only in version 10. SAS and SPSS, to
the best of my knowledge, do not offer any unit testing framework.

Open source gives you more power

Hence more responsibility! Except in cases where companies start to provide support and
premium features on top of an open source project, such as the Spark-Databricks,
Kafka-Confluent, and Cassandra-Datastax combo's, there is no support for open source
projects. This means that if there is a bug or feature that you would like to see fixed or
implemented, you are at the mercy of the maintainers of the project. This is the responsibility
part.

But... is it bad? Do vendors implement new features or fix bugs at your request and when you need
them? How much money do you have to pay your database vendor to give you what you want?

If the project is open source, the people working for you can fix it and then you can decide if
you want to contribute it back to the community.

It might seem scary at first: modifying software! However, if you think about it, your entire
company runs on software, and some, if not most, of that software has been written in house. It
could be your reporting software, your website, the SQL query that your DBAs wrote, etc. Eventually
it's all just code!

By now you are probably starting to understand the multitude of posts about our open source
contributions. We find new things that we want from the tools that we use, we implement them, and
we give back to the community. So the famous Spiderman quote should really be the other way around
when it comes to open source software:

Now comes the part where the objectors will say that they do not have the human resources available
to fix bugs or implement new features. What they often fail to see, however, is that the money you
are saving by not going vendor-driven4 can be much better invested in good people and in giving
them more time to do this sort of things. Don't assume your employees or colleagues are not
interested in doing this: they probably choose the open source tool in the first place and I bet
they'd love to give back to the community.

All it takes is to overcome that fear that accompainies us every time we do something for the first
time.

Besides: people are almost certainly the most important asset your company has5: Would you
rather empower them or empower your vendor?

Open source accelerates innovation

No vendor likes to talk about this one. Before knowing the viability of what you want to accomplish
(a new algorithm, a new data storage option, etc.) you need to start talking to the "sales"
department of the potential vendor: how much is it going to cost you before you can start. When
everything has been arranged, the fun can begin. The algorithm starts to take form, the results are
validated, you are ready to take it to production.

Wait! The license you have is not really suited for production. You need another license. Ugh!

Or let's say that you get a new exciting data source. You can't wait to connect it to the
rest of the data. A new database is created but... you cannot create new databases as the limit for
the current subscription has been reached. Contact sales to upgrade!

I have heard this over and over at many different companies. People can't innovate
as everything has been set up so rigidly: every piece of the infrastructure is in a different
kingdom. You can get though, but there'll be no energy left to innovate by the time you're done.

Open source is transparent

With transparent I mean you can see what is going on and act accordingly. This is relevant for two
things:

Documentation;

Quality assurance.

Documentation

Documentation seems silly, but ask programmers what they document. Usually documentation explains how
to use an API. But they don't say much about the (software) architecture in which it should be used.

A recent example from Heap analytics illustrates this problem the best: the company was
ingesting thousands of events per day to enable real time analytics on customers data.
When inserting new records in the database, the process would look like this: one record
for customer A, then one for customer B, then C, then D, then A again, etc. At random.

The database solution Heap uses, PostgreSQL, caches a part of the data when you create indices
for these records6. However this random pattern of data ingestion (customer A-B-C-D-A-???) was
causing the cache for a particular customer to be practically immediately evicted.

This random access pattern does not matter at most scales. This behavior was not reported in the
documentation, as the database was doing the thing you would expect from a good database: write the
data in the asked order and, on top of that, was caching to speed things up.

For Heap, however, this was an issue: it meant that many customers had to wait up to an hour to see
their data in their dashboard as the caching mechanism wasn't really working.

An engineer, therefore, decided to take a deeper look. He ended up reading PostgreSQL source code
and found what the issue was.

He then implemented a client side fix (PostgreSQL was, at the end of the day, not the culprit) that
saves Heap millions of dollars.7

Again, if you're an objector of open source you will go and say: but we have support from our
vendor.

Yes, this is true. But the more archaic the edge case is, the closer you have to get to the source
code. And I can assure you that the first and second line support of your vendor do not even have
access to or have the skills to understand the source code of the product you are using!

Once you get to third line, they will probably need to understand your use case, maybe they want to
even take a peek at your database (then NDA's need to be signed, the engineers will probably be in
the US, and the list goes on). Once a solution has been found, maybe 3 months have passed, and
part of your customers already left as your product was slow.

Quality assurance

Another tricky one! This is biased towards scientific software, the Python scientific stack in
general, and scikit-learn in particular.

Scikit-learn is a library of machine learning algorithms written in Python. The algorithms are
written by scientific researchers all over the world. Every algorithm that gets in scikit-learn is
thoroughly vexed by other researchers.

When you use the code, you can sleep safe, knowing this. But, if you suffer from insomnia or
paranoia, you can open up a text editor, download scikit-learn, and take a look.

This is not only true for scikit-learn: each open source project can be inspected to see if it does
what it claims it does.

What then?

The next relevant question is: what is the advice of GoDataDriven? Well, the meme says it all: we
apply an open source first, vendors second approach. If the vendor solution is much better than
what an open source solution offers, then we choose the vendor. With two conditions though!
There is no data lock-in and there's no code lock-in. This avoids a 'Hotel California'-type of
situation wherein "You can check out anytime you like, but you can never leave".

Want to know more about what it means to work like that? We're hiring!