Topics

Featured in Development

As part of our core values of sharing knowledge, the InfoQ editors were keen to capture and share our book and article recommendations for 2018, so that others can benefit from this too. In this second part we are sharing the final batch of recommendations

Featured in Architecture & Design

Tanya Reilly discusses her research into how the fire code evolved in New York and draws on some of the parallels she sees in software. Along the way, she discusses what it means to be an SRE, what effective aspects of the role might look like, and her opinions on what we as an industry should be doing to prevent disasters.

Featured in Culture & Methods

Mik Kersten has published a book, Project to Product, in which he describes a framework for delivering products in the age of software. Drawing on research and experience with many organisations across a wide range of industries, he presents the Flow Framework™ as a way for organisations to adapt their product delivery to the speed of the market.

Featured in DevOps

The fact that machine learning development focuses on hyperparameter tuning and data pipelines does not mean that we need to reinvent the wheel or look for a completely new way. According to Thiago de Faria, DevOps lays a strong foundation: culture change to support experimentation, continuous evaluation, sharing, abstraction layers, observability, and working in products and services.

The DSS platform covers all the steps of a data-science end-to-end project, from connectivity, data wrangling and visualization to machine learning and production deployment. Its machine-learning module supports standard libraries such as Scikit-learn, XGBoost, MLLIb or H20. Developers can also connect to a Hadoop cluster and integrate multiple Spark engines.

DSS is tailored for common use cases in predictive analytics such as demand forecast, lifetime value optimization, churn analytics or fraud detection. Dataiku customers include companies like General Electric, L'Oreal or Unilever. The company has been named a "visionary" in the Gartner 2018 Magic Quadrant for Data Science Platforms for the second consecutive year.

InfoQ sat down with Florian Douetteau, CEO of Dataiku, to learn more about the company and its flagship product.

Florian Douetteau: Dataiku is a software that users download and install on their infrastructure. So for many customers, that means in the cloud, but for others, it's still their data center (it's about 50/50).

We are generally guided by the deployment constraints and challenges of our clients, so therefore we have to keep our product as simple as possible. Our architecture is multi-process but also monolithic, in the sense that it's self-contained. Basically, the solution embeds everything it needs, including the databases that are SQLite and H2. We code primarily in Java, which is one of the principal languages of big data and that is considered a good compromise between performance and productivity.

On the back end, you'll find a web server that does job scheduling, storage and management of metadata, and search indexing. We also have some Python and R processes as well as, obviously, Spark processes. And on the front end, we use a single-page application (SPA) in AngularJS.

InfoQ: Who is the typical user of the platform? Can the marketing or sales department use it or does it require some level of data science expertise?

Douetteau: The great thing about Dataiku is that it is for everyone within an enterprise that uses or interacts with data. Of course, there are lots of features specifically for those with coding and data science expertise - they can use their favorite big data programming languages for more advanced and custom work. But we also have many data scientists using the platform that combine those coding features with the point-and-click visual interface, because at times it can just be more efficient.

The visual interface in Dataiku allows for analysts or other non-technical profiles to go from connecting data sources to data wrangling to applying machine learning models to visualization and more without writing a single line of code. For larger teams with lots of analysts, this is great because it brings a huge scalability component. For small teams that maybe don't have a data scientist, this allows for a lot of flexibility as well.

InfoQ: With the release of DSS 4.2, you include deep-learning image based recognition. What is your deep learning product strategy?

Douetteau: Throughout 2018 and into 2019, we'll have our sights set on helping businesses remove the roadblocks standing in the way of productionalized data projects while also providing the structure and stability necessary for long-term success. This means an accelerated focus on deep learning, AI, and deployment to production in our product development roadmap.

InfoQ: How do you handle machine learning at scale with big data and deep learning in terms of computing power and storage?

Douetteau: In terms of performance, because Dataiku is a software that users download and install on their infrastructure, it's up to the client to deploy multiple instances to ensure good performance.

And this is exactly what we support - scaling out and adding new nodes. Convincing our customers to trust us with hosting their data would be complicated, especially since we're focused on large, international enterprises. But on top of that, there are also underlying technical issues; for example, when it comes to processing as close to the data as possible, SaaS is not a good solution. On the other hand, this works well in the cloud - we integrate with AWS, Microsoft Azure, and GCP via their managed Hadoop solutions.