The progress made in the field of machine learning and AI over the past several years has been tremendous. We’ve moved from science projects and edge use cases, to core businesses and competitive advantages, with more companies across various industries looking for opportunities to tap into their data and leverage AI. In 2018, we saw...

Continued Innovation and Expanded Availability for the Next-gen Unified Analytics Engine Databricks Delta the next generation unified analytics engine, built on top of Apache Spark, and aimed at helping data engineers build robust production data pipelines at scale is continuing to make strides. Already a powerful approach to building data pipelines, new capabilities and performance...

Databricks became aware of a new critical runc vulnerability (CVE-2019-5736) on February 12, 2019 that allows malicious container users to gain root access to the host operating system. This vulnerability affects many container runtimes, including Docker and LXC. The Databricks security team has evaluated the vulnerability and confirmed that, due to the Databricks platform architecture,...

In January 2013 when Stephen O’Grady, an analyst at RedMonk, published “The New Kingmakers: How Developers Conquered the World,” the book’s central argument (then and still now) universally resonated with an emerging open-source community. He convincingly charts developers’ movement “out of the shadows and into the light as new influencers on society’s [technical landscape].” Using...

With 2018 behind us, it’s been amazing to see AI projects gain steam and make significant impact across industries. In fact, a recent survey by CIO.com cites that 90% of enterprises are actively investing in AI. What has fueled this innovation is the massive influx of organizations tapping into the potential of their data and...

In the previous blog post, we introduced the new built-in Apache Avro data source in Apache Spark and explained how you can use it to build streaming data pipelines with the from_avro and to_avro functions. Apache Kafka and Apache Avro are commonly used to build a scalable and near-real-time data pipeline. In this blog post,...

You might be using Bayesian techniques in your data science without knowing it! And if you're not, then it could enhance the power of your analysis. This blog follows the introduction to Bayesian reasoning on Data Science Central, and will demonstrate how these ideas can improve a real-world use case: estimating hard drive failure rate...

The most important factor to successful machine learning is having the right data at scale, but organizations struggle with how to get the right datasets combined into the right format for their projects. At Databricks, we’ve seen customers achieve success by bringing data and ML together, and we’ve partnered with Snowflake to share their stories...

As organizations move to the cloud, the architecture for a Modern Data Warehouse (MDW) allows a new level of performance and scalability. A modern data warehouse enables bringing together data at any scale easily, and to get insights through analytical dashboards, operational reports, or advanced analytics. But what does an MDW look like? The following...

We are excited to announce the release of Databricks Runtime 5.2 which introduces several new features including the following: Delta Time Travel Fast Parquet Import Databricks Advisor Let’s unpack each of these features in more detail: Delta Time Travel Time Travel, released as an Experimental feature, adds the ability to query a snapshot of a table...