September 11, 2017

Have you thought of trying out AWS Athena to query your CSV files in S3? This post outlines some steps you would need to do to get Athena parsing your files correctly. Let's walk through it step by step.

Our pet data has some dates, doubles, ints, and strings with comma in it. It is a good representation of a typical CSV file data you might have. I uploaded this pet.csv into an S3 location: S3://test-athena-linh/pet/, then headed to AWS Athena to create the table. (NOTE: run any required Drop table statement before running create statement).

Our DDL is quite ambitious with double, int, and date as data type. Let's see if Athena can parse our data correctly. Viewing the data is interesting, because with the above table definition Athena doesn't parse the comma in quote correctly using LazySimpleSerDe. Because of having comma in middle of a field, columns are shifted.

Columns are shifted with quoted comma

Quirk #1: with all default setting, Athena doesn't deal with quoted comma in CSV very well.

Now with all string data types, our CSV can be read in correctly with quoted comma.

No error when parsing String data type

From the output, we can see header row is included and break type parsing. Searching on the Internet suggested OpenCSVSerde has a config in TBLPROPERTIES 'skip.header.line.count'='1' which could be useful. However trying it out in Athena didn't lead to expected outcome. AWS Support confirmed: "it's a known issue in Athena that property "skip.header.line.count"="1" does not work because of the following Presto issue: https://github.com/prestodb/presto/issues/1848. Hive does honor the skip.header.line property and skips header while querying the table. However, presto displays the header record when querying the same table. The Athena Product team is aware of this issue and is planning to fix it."

Quirk #3: header row is included in result set when using OpenCSVSerde.

We can certainly exclude header using query condition, but we can't do arithmetic operation (SUM, AVG) on strings.

Arithmetic on Strings doesn't work

To move forward with our data and accomodating all Athena quirks so far, we will need to run CREATE table as strings and do type conversion on the fly.

SELECT SUM(weight)
FROM
(
SELECT
date_of_birth,
pet_type,
pet_name,
cast(weight AS DOUBLE) as weight,
cast(age AS INTEGER) as age
FROM athena_test."pet_data"
WHERE date_of_birth <> 'date_of_birth'
)

Type casting in inner query allows outer query to do arithmetic

Yayyy, we got the expected result.

Let's attempt to parse the date in format "MMMM dd, yyyy" (e.g. "July 10, 2017") as we might want to do arithmetic on dates rather than leaving it as string. Preso has some date_parse function taking in string and a pattern, date patterns are defined here: https://prestodb.io/docs/current/functions/datetime.html. The above date pattern is represented as '%M %d, %Y'.

Our complete casting of pet_data to its right data types looks like this:

SELECT
date_parse(date_of_birth, '%M %d, %Y') as date_of_birth,
pet_type,
pet_name,
cast(weight AS DOUBLE) as weight,
cast(age AS INTEGER) as age
FROM athena_test."pet_data"
WHERE date_of_birth <> 'date_of_birth'

Unfortunately we will need to wrap this as a subquery for any query we want to do on our pet_data table. It would be better to be able to create a view on this. Potentially we could create a view to encapsulate the inner query, but as far as I know Athena doesn't support View. I have tried a version of Presto Create Table AS but hit an error. I think Athena SQL is not the full set of Presto SQL: https://prestodb.io/docs/current/sql/create-table-as.html

-- This query has error
CREATE TABLE athena_test.pet_data_typed(date_of_birth, pet_type, pet_name, weight, age)
AS
SELECT
date_of_birth,
pet_type,
pet_name,
cast(weight AS DOUBLE) as weight,
cast(age AS INTEGER) as age
FROM athena_test."pet_data"
WHERE date_of_birth <> 'date_of_birth'

Athena doesn't support CREATE TABLE AS

Quirk #4: Athena doesn't support View

From my trial with Athena so far, I am quite disappointed in how Athena handles CSV files. There is a lot of fiddling around with type casting. Not sure what I did wrong there, please point out how I could improve on the above if you have a better way, and thanks in advance. I've heard Athena might work better with JSON or Avro files, I will try this out next. I will also write about my experience with Google BigQuery in future blog posts. Stay tuned!

February 25, 2017

Data warehouse comes in different shape and sizes, our legacy implementation of Data warehouse is Microsoft SQL Server 2008 with many tables inside. Recently we have to export a sample of all tables in our database for auditing purpose, specifically verifying there is no PII (Personal Identifiable Information) in our database.

A simple approach is doing a SELECT TOP 100 * from each table. This can be a little tedious if you have hundreds of tables. One way is to use a for loop to go through each of the table.

This query will return the names of all the tables in the current database.

SELECT name FROM sysobjects WHERE xtype='U' order by 1;

Or if your database is organised in schema, you will need schema name together with table name.

SELECT
t.name AS table_name,
SCHEMA_NAME(schema_id) AS schema_name
FROM sys.tables AS t
ORDER BY schema_name, table_name;

With that principle, we can loop through each table and apply SELECT TOP 100 * on each of the table.

Well, turn out that MSSQL has a built-in stored procedure to loop through each of the table.

exec sp_MSforeachtable @command1='SELECT TOP 100 * FROM ?'

Another improvement: the statement 'SELECT TOP 100 *' does just that: top 100 rows from the table. It might not give a good idea about the data in the table. The top 100 rows might contain only bad data and does not give us a grasp of the data in the table. We can use 'TABLESAMPLE' to get more random rows, a true sampling of data.