The amount of data collected by enterprises continues to increase exponentially coupled with constant disruptions in the business environment, putting great pressure on data analysts to analyze this data rapidly and give their organizations a competitive edge. This phenomenon is driving a change in the way data analysis is done. Firstly, the analysis needs to be done quickly because insight has a shelf life. Secondly, the depth of the analysis that needs to be performed is unprecedented. The rapid proliferation of advanced analytics, machine learning, and real-time analytics means that competitors are tackling data as soon as it comes in. If your data analysis reveals how you could have satisfied your customers yesterday, you are lagging behind competitors who have already figured out how to make them happy today and in the future. The only way to manage this time and complexity pressure is to remove both technology and capability hurdles for data analysts. The goal of analytics technologies should be to make data analysts self-sufficient throughout the analytics spectrum so that they can independently analyze the data and share insights freely and widely in the organization.

Data analysis typically involves a series of operations from data preparation to data modeling to visualization. This can be thought of as a pipeline of different operations, each of which requires a different skill and is carried out in a separate tool. So far, analysts have been made self-sufficient through tools that deal with only one analytical operation out of the entire pipeline. For example, modern data visualization tools allow analysts to create reports and dashboards on their own, but these tools need data to be made available to them a priori in the right shape and form. Similarly, self-service data preparation tools make it easy for analysts to transform data and create clean data. However, analysts need to move to another tool and, in some cases, move the data too before they can start generating visualizations. Although there is a lot of motivation to merge these different tools and clear savings in terms of time, risk, and money, it has not been done before. It is worth analyzing why these different analytical processes were never merged:

Diverse Workloads: Data Visualization generates an interactive workload whereas data processing generates streaming or batch workloads. Supporting both these workloads in a single environment in a cost-effective way puts tremendous performance pressure on the analytics working environment.

Diverse User Experience: The user experience for building visualizations compared to that of generating clean data sets is very different. For visualization, the requirement is to bind and configure visuals whereas for data processing the goal is to string different data tasks together, the interaction with the development environment and the output created in both these cases is quite divergent.

Governance around Data Products: Historically, the data model in the data warehouse provided a clean separation between the worlds of data processing and data visualization, but when analysts can create data products along the entire pipeline of analytics then governance, data lineage, role-based access all become very important to maintain sanity on the data lake. Ensuring that these advanced features are integrated and available to the end user is both a user experience and performance challenge.

Accelerite ShareInsights addresses these problems by unifying different analytical operations from data processing to visualization in a single cohesive tool, thereby making data analysts self-sufficient in every analytical operation regardless of complexity and enabling them to respond to business questions in real time.

While unifying all these tools, we faced some of the problems outlined earlier and I will illustrate how we solved them. In ShareInsights, data visualization and data preparation, two most common operations, are thought of as two sides of the same coin. On one side, you are attempting to create a dataset of right shape by applying various transformations on the dataset. On the other side, for visualization, you are enabling the end-user to apply similar transformations on the data, but in a visual way and enabling the user to find the right slice of data, we call this data slice as insight. For solving the diverse workloads problem ShareInsights works on top of an existing big data cluster, where it leverages some of the innovative technologies that support diverse workloads from batch, streaming to interactive workloads. Finally, we came up with a declarative format that is used to capture the information about the analytics pipeline in a single text file. Having all the information in a single text file allows us to provide fine-grained governance around different analytical operations and enable analysts to navigate different data products easily. This file is also used by the code generator to generate code for specific architecture. More details about this can be found in our SIGMOD paper.

Combining operations that have been traditionally performed by multiple tools within ShareInsights has solved problems that have plagued the analytics pipelines for ages.

End-to-end Optimization: Given ShareInsights captures the entire analytics spectrum from data processing to visualization, ShareInsights has full visibility of the nature of data and the operations done on it, both in the preparation stage and the visualization stage. This visibility allows ShareInsights to optimize the entire pipeline much better than what a single tool could do, for example a downstream tool like an OLAP layer in a pipeline will have no visibility over which workload is going to be generated by the reporting tools upstream, making it difficult to generate indexes a-priori. ShareInsights inherently stores lineage from data to transformations to widgets as part of a single text file. So, the best part is that this optimization can be done automatically without any involvement from the analyst.

Unbeatable DataOps: ShareInsights captures the definition of entire data pipeline in a text file, in most cases a single text file. This makes managing the lifecycle of the entire pipeline, historically a big paint point, as easy as managing the version of this single text file. Furthermore, features prevalent in source control systems like branching, versioning and merging can be easily used. Deploying analytics is also easy and simplified as there is clean separation between the analytics pipeline and the analytics architecture.

Collaboration: The only effective way to manage complexity is to enable collaboration amongst analysts. However, multiple tools and technologies mean that it is hard for analysts to share work easily. Combining the capabilities of different tools in one tool solves this problem to a great extent. Furthermore, enforcing clean abstractions for various analytical components such as tasks, widgets, and data sets means that analysts can easily mix and match not just the pipelines but all the intermediate components.

Insight2Action: The target outcome for data analytics has traditionally been to generate insights which data analysts can share with others, e.g. managers, so they can act on them. Today, that is done outside analytics tools. ShareInsights’ Insight2Action functionality allows enterprises to initiate custom actions ranging from sending a simple text or email to initiating full enterprise workflows using a simple API when certain conditions are met. The ability to automate the integration of actions to insights removes the need for human intervention required in responding to business needs.

We truly believe that ShareInsights delivers on our goal which is to enable analysts in an organization to work on their data lake and find, act upon, and share insights.