Archive for the ‘Data Pipelines’ Category

Companies, non-profit organizations, and governments are all starting to realize the huge value that data can provide to customers, decision makers, and concerned citizens. What is often neglected is the amount of engineering required to make that data accessible. Simply using SQL is no longer an option for large, unstructured, or real-time data. Building a system that makes data usable becomes a monumental challenge for data engineers.

There is no plug and play solution that solves every use case. A data pipeline meant for serving ads will look very different from a data pipeline meant for retail analytics. Since there are unlimited permutations of open-source technologies that can be cobbled together, it can be overwhelming when you first encounter them. What do all these tools do and how do they fit into the ecosystem?

Insight Data Engineering Fellows face these same questions when they begin working on their data pipelines. Fortunately, after several iterations of the Insight Data Engineering Program, we have developed this framework for visualizing a typical pipeline and the various data engineering tools. Along with the framework, we have included a set of tools for each category in the interactive map.
…

This looks quite handy if you are studying for a certification test and need to know the components and a brief bit about each one.

For engineering purposes, it would be even better if you could connect your pieces together and then map the data flows through the pipelines. That is where did the data previously held in table X go during each step and what operations were performed on it? Not to mention being able to track an individual datum through the process.

Is there a tool that I haven’t seen or overlooked that allows that type of insight into a data pipeline? With subject identities of course for the various subjects along the way.

Like other companies pushing databases, San Francisco startup MemSQL wants to solve low-level problems, such as easily importing data from critical sources. Today MemSQL is acting on that impulse by releasing a tool to send data from the S3 storage service on the Amazon Web Services cloud and from the Hadoop open-source file system into its proprietary in-memory SQL database — or the open-source MySQL database.

The existing “LOAD DATA” command in MemSQL and MySQL can bring data in, although it has its shortcomings, as Wayne Song, a software engineer at the startup, wrote in a blog post today. Song and his colleagues ran into those snags and started coding.
…

How very cool!

Not every database project seeks to “easily import… data from critical sources.” but I am very glad to see MemSQL take up the challenge.

Reducing the friction between data stores and tools will make data pipelines more robust, reducing the amount of time spent trouble shooting routine data traffic issues and increasing the time spend on analysis that fuels your ROI from data science.

True enough, if you want to make ASCII importing a $custom assistance from your staff task, that is one business model. On the whole I would not say it is a very viable one. Particularly with more production minded folks like MemSQL around.

Motivation: The analysis of large biological datasets often requires complex processing pipelines that run for a long time on large computational infrastructures. We designed and implemented a simple script-like programming language with a clean and minimalist syntax to develop and manage pipeline execution and provide robustness to various types of software and hardware failures as well as portability.

Results: We introduce the BigDataScript (BDS) programming language for data processing pipelines, which improves abstraction from hardware resources and assists with robustness. Hardware abstraction allows BDS pipelines to run without modification on a wide range of computer architectures, from a small laptop to multi-core servers, server farms, clusters and clouds. BDS achieves robustness by incorporating the concepts of absolute serialization and lazy processing, thus allowing pipelines to recover from errors. By abstracting pipeline concepts at programming language level, BDS simplifies implementation, execution and management of complex bioinformatics pipelines, resulting in reduced development and debugging cycles as well as cleaner code.