README.md

OAP - Optimized Analytics Package for Spark Platform

OAP - Optimized Analytics Package (previously known as Spinach) is designed to accelerate Ad-hoc query. OAP defines a new parquet-like columnar storage data format and offering a fine-grained hierarchical cache mechanism in the unit of “Fiber” in memory. What’s more, OAP has extended the Spark SQL DDL to allow user to define the customized indices based on relation.

Building

By defaut, it builds for Spark 2.1.0. To specify the Spark version, please use profile spark-2.1 or spark-2.2.

mvn -DskipTests package
mvn -DskipTests -Pspark-2.2 package

Prerequisites

You should have Apache Spark of version 2.1.0 or 2.2.0 installed in your cluster. Refer to Apache Spark's documents for details.

Run spark by bin/spark-sql, bin/spark-shell, sbin/start-thriftserver or bin/pyspark and try our examples

NOTE: 1. For spark standalone mode, you have to put oap-<version>.jar to both driver and executor since spark.files is not working. Also don't forget to update extraClassPath.
2. For yarn mode, we need to config all spark.driver.memory, spark.memory.offHeap.size and spark.yarn.executor.memoryOverhead (should be close to offHeap.size) to enable fiber cache.
3. The comprehensive guidence and example of OAP configuration can be referred @https://github.com/Intel-bigdata/OAP/wiki/OAP-User-guide. Briefly speaking, the recommanded configuration is one executor per one node with fully memory/computation capability.

Features

Index - BTREE, BITMAP
Index is an optimization that is widely used in traditional databases. We are adopting 2 most used index types in OAP project.
BTREE index(default in 0.2.0) is intended for datasets that has a lot of distinct values, and distributed randomly, such as telephone number or ID number.
BitMap index is intended for datasets with a limited total amount of distinct values, such as state or age.

Statistics - MinMax, Bloom Filter
Sometimes, reading index could bring extra cost for some queries, for example if we have to read all the data after all since there's no valid filter. OAP will automatically write some statistic data to index files, depending on what type of index you are using. With statistics, we can make sure we only use index if we can possibly boost the execution.

Fine-grained cache
OAP format data file consists of several row groups. For each row group, we have many different columns according to user defined table schema. Each column data in one row is called a "Fiber", we are using this as the minimum cache unit.

Parquet Data Adaptor
Parquet is the most popular and recommended data format in Spark open-source community. Since a lot of potential users are now using Parquet storing their data, it would be expensive for them to shift their existing data to OAP. So we designed the compatible layer to allow user to create index directly on top of parquet data. With the Parquet reader we implemented, query over indexed Parquet data is also accelerated, though not as much as OAP.

Usage: Fiber cache locates in off heap storage memory, basically this size is spark.memory.offHeap.size * 0.7. But as execution can borrow a few memory from storage in UnifiedMemoryManager mode, it may vary during execution.

Full Scan Threshold - If the analysis result is above this threshold, it will go through the whole data file instead of read index data.