Here is a very basic sample of connecting to a drillbit via Java and retrieving records.

The code connects to the drillbit using DriverManager and the constructed URL. Then a Statement object is created, which is used to execute the SQL query and returning a Resultset. The while loop, loops over the records in the resultset and prints the value of a column labeled "name". Finally the resultset and connection are closed.package drilltest;import java.sql.Connection;import java.sql.DriverManager;import java.sql.ResultSet;import java.sql.SQLException;import java.sql.Statement;import org.apache.drill.jdbc.DrillConnection;import org.apache.drill.jdbc.Driver;

I hope this get's you going with the exiting possibilities of Apache Drill

I think all of us have problems getting rid of data over time. We accumulate the data over years in Datamarts but it's more difficult to get rid of it. Ok - we can delete the data. It's gone. But it's not that easy. Maybe there will be the need for the data in the future. The business has sometimes great ideas and when they plan to implement it, they want to know some statistcs about how the data looked like in the past, to determine if it's worth the effort.

By the way, I am not talking about a datawarehouse right now. I am thinking more about datamarts that serve certain purposes for reporting or analytics. In a datawarehouse you actually might want to store many years of data and there are certainly strategies for doing so while keeping a good performance.

Another aspect is that more data makes everything slower: ETL processes, updates, inserts. The database has to update not only the data but also indexes and the more data there is, the slower the process will get. Again: bigger hardware, more hardware or more memory can solve this. But that is often a questions of money...

On the other hand there are legal requirements. Maybe we have to store the data 5 or 10 years. But when the users usually only care for 2 or three years, what do we do with the rest? We could make a database dump. But there is the risk, that we get a new db server version and a future import might not work anymore. And of course, we might do updates to the db schemas, so we can not simply import the data anymore.

So what we are looking for would be an easy way to store the data. We do not want to waste too much effort on this old data that probably nobody ever looks at - or rarely. But still it should be accessible.

I have looked at Apache Drill in the last weeks, to understand it better and I believe it would be a good candidate for the use case above. Drill is very easy to setup. It runs on a single machine if necessary. Still, it scales horizontally if necessary. It can work with CSV files and also with files in parquet format. It also allows to query directories of files/data; one could put multiple files in one folder and query them together in one go.

So why CSV files? It is a simple format and does not rely on software versions and can be read with many different tools. It can be converted and also compressed easily.

And parquet files? I am not an expert on these files. But the idea is, that it focusses on columns and not on rows. It optimizes the storage on these columns and thus queries on columns - which have a single data type unlike rows - are more efficient and fast.Two things I like about that: Apache Drill has an easy way of converting CSV files into parquet format/files. So if I decide to use parquet format, it's easy to do. Parquet allows to partition the data into seperate files for a given partitioning column (or multiple ones). So one can split the data - in case of the use case above "time" would be a good candidate: we partition by year. Or by year and month. That depends on the amount of data and of course on the queries that are likely to be run against it. But again, we do not want to spent too much time on this - it should be easy fast and straightforward. And it should also be uniquely usable for different database systems, or other data sources that we have.

Example: We have 10 years of data in our database. We assume that we only need three: the current year and the last two. The rest is data nobody usually looks at. So we extract the data for the 7 years from the database into a csv file. Or into multiple ones - e.g. one per year. In the best case, we can export the data in a read-to-use format for Drill. Otherwise we can use tools such as sed, grep and cut to adjust the data.

Then we go to our running Apache Drill instance. We can create a new storage location in the config for our file(s). And then we copy our files into a folder on a harddisk or shared storage. That's it. Ready for using the data using the drill web interface, or a query tool or programatically using the jdbc driver.If we decide to convert the data to parquet format, we write a create table as statement that defines the columns (and column names) and the partitioning (from Drill 1.1. onwards) and run it. The parquet file(s) will be created and then we can query the data efficiently through Drill as well.

I think most of this can even be automated, so that there will be minimal effort to create the files, store them and potentially convert to parquet format. The data is accessible, we can remove it from the production datamart or database.

In one of the next entries here, I will show an example with real data and screenshots to make the process more visible.