Dremio Fleshes Out Data Platform

Alex Woodie

Dremio today rolled out a version 3.0 update to its multi-purpose data analytics tool with new features like support for Kubernetes, a built-in data catalog, and support for Teradata, Elasticsearch, and Microsoft ADLS data sources, among others.

Dremio advertises its software as a data-as-a-service platform that’s designed to improve the ability of data analysts and data scientists to access and process data residing in multiple locations.

In many cases, the Dremio product intercept SQL queries generated by BI tools like Tableau or Looker and processes them more quickly in its own Dremio cluster, thereby eliminating the need to have engineers create complicated ETL jobs that touch multiple systems . In other cases, Dremio uses its “push down” processing capability and utilizes an Oracle or (now) Teradata database to actually process the query.

In either case, the core benefit is the same: making it easier for users to get access to and process data.

“It’s 10 years into the AWS era. Now you’ve got almost everything available as a service,” says Kelly Stirman, the vice president of strategy and CMO of Dremio. “Companies love this model because it lets them focus on what’s core to their business. What we hear over and over again is we have these things for our infrastructure, we have them for our applications. Why not for our data? That’s what Dremio is all about.”

With Dremio version 3.0, the company has taken the data-as-a-service concept to a new level. Among the most exciting developments here is the new integrated data catalog in Dremio that allows users to record where they’re storing certain data sets, along with metadata and a wiki page that describes the data.

“When someone starts their data journey, the first question they ask is ‘Where is my data?'” Stirman tells Datanami. “Most companies don’t have an inventory of their data assets, so this gives them a way to inventory and document the metadata and tribal knowledge about their data sets in one central location.”

Dremio is not looking to compete with the likes of data catalog providers, such as Alation or Informatica, Stirman says. Those products offer more cataloging functionality than what you will find in Dremio. But one advantage that the Dremio data catalog has over those offerings is it helps customers move quickly along the analytics path, he says.

“The idea here is, after taking the first step to finding a data set you’re looking for, you can immediately take the next step in Dremio to start doing analysis in your next tool, whether that’s Tableau or Jupyter notebook or something else,” Stirman says. “It’s about improving the quality of experience for the data consumer and for the data engineer by keeping things in one integrated platform.”

Support for Kubernetes in Dremio 3.0 will help customers scale their Dremio clusters up and down rapidly, particularly in Amazon and Azure cloud environments.

“If you’re running Dremio in Hadoop, you’d use something like YARN to provision it, bBut if you’re running it outside of Hadoop, like on S3 or Azure, then people want to orchestrate and provision with Kubernetes,” Stirman says. “You can literally with the press of a button click go from one configuration of a cluster to scale up to 10x for a heavy workload in a holiday season, without taking the cluster down.”

New workload management features in Dremio 3.0 will give administrators more fine-grained control over the execution of workloads in Dremio, including the capability to restrict cluster resources according to compute and memory usage and the duration of jobs.

“When it comes to data, you don’t just have one user accessing data — you have different users with different types of need,” Stirman says. “Not everyone is equal. Maybe you have executives who need a VIP experience with quality and speed of data, which is different from a bulk job that runs at odd hours.”

With Dremio 3.0, administrators can associate different jobs to different types of users. For example, the admin could dedicate 80% of a cluster’s resources to a certain group of user at a certain time of day. “It’s an incredibly flexible set of options in terms of associating workloads with resources in the cluster,” he says.

The product also now supports end-to-end encryption over TLS, which should help assuage the security concerns of companies large and small. It’s also supporting row and column-level access controls on data, even for data sets that don’t natively offer that, such as HDFS and S3. “In Dremio, you get those controls and features, as well as dynamic masking, just by putting Dremio between your tool and your data sources,” Stirman says.

Dremio was founded by the co-creators of Apache Arrow, the in-memory, columnar data format designed to address integration and runtime challenges in the big data environment. The Arrow format has been adopted by a range of vendors and projects, including Apache Spark, Tensorflow, H2O.ai, Anaconda, and Dask, among many others.

With Dremio 3.0, the Arrow engine is being upgraded through something called the Gandiva Initiative, which has resulted in a new kernel based on LLVM compiler technology. As a result of that work, the Arrow engine can process data up to 100x faster, according to Dremio.

Separately, the company has delivered new connectors for Teradata, AWS GovCloud, Elasticsearch version 6, and the Azure Data Lake Store (ALDS). Like it does with Oracle, Dremio customers targeting data in the Teradata environment can utilize the “Reflections” feature to optimize the data access, but push the actual SQL query processing down to the underlying data warehouse, according to Stirman.