Presto supports multiple Array and JSON functions using which you can write queries to get the required results. There is no single way to define a CREATE TABLE and later your QUERIES. You can have CREATE TABLE predefine the schema structure and later your queries can refer the elements you need in the schema. Or you can define your schema as a string and later use functions on your queries to parse this string to get the required results.

Now, I was able to use the following CREATE TABLE syntax on the DynamoDB items having List of strings. I was able to flatten out the list of strings using some of the functions like CAST. Please note that this is not the only way to define table and query. My query might be over complexing what your are trying to get. There might be a simpler way as well. So, its really important that you understand the data types that you define and what each query returns and use the correct functions as each function takes a datatype and returns another datatype.

# damageShapes can take different structures in schema ex : damageShapes struct<L:array<struct<S:string>>> or damageShapes map<string,array<map<string,string>>> or struct<L:array<map<string,string>>> .

# Or if may be a simple string so that you need to rely on queries to parse this string.

# Since damageshapes.l is array(row(s varchar)) , you can find any presto functions which can flatten this to your required format which is array<varchar>. If there is not direct function, you might need to do 2 conversions.

# values_of_damageshapes is in JSON format and you can use JSON functions to get the necessary output.

# https://prestodb.io/docs/current/functions/json.html

-------------------------------------------------

# Here I am converting values_of_damageshapes(json) to ARRAY(ARRAY(varchar)))) so that I use an array function called flatten which Flattens an array(array(T)) to an array(T) by concatenating the contained arrays.

S3 server access logs can grow very big over time and it is very hard for a single machine to Process/Query/Analyze all these logs. So, we can use distributed computing to query the logs quickly. Ideally we might think of Apache Hadoop/Hive/Spark/Presto etc to process these logs. But, the simplicity of AWS Athena service as a Serverless model will make it even easier. This article will guide you to use Athena to process your s3 access logs with example queries and has some partitioning considerations which can help you to query TB’s of logs just in few seconds.

The server access log files consist of a sequence of new-line delimited log records. Each log record represents one request and consists of space delimited fields. The following is an example log consisting of six log records.

Data Format considerations:

On first look, the data format appears simple , which is a textfile with space filed delimiter and newline(/n) delimited. However, there is a catch in this data format, the columns like Time , RequestURI & User-Agent can have space in their data ( [06/Feb/2014:00:00:38 +0000] , "GET /gdelt/1980.csv.gz HTTP/1.1" & aws-cli/1.7.36 Python/2.7.9 Linux/3.14.44-32.39.amzn1.x86_64) which will mess up with the SerDe parsing .

For RequestURI & User-Agent field, the data is enclosed in quotes with spaces inside it. This can be parsed by any SerDe’s that support Quotes. Unfortunately, Athena does not support such SerDe’s like org.apache.hadoop.hive.serde2.OpenCSVSerde which does has quotes feature.

The org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe included by Athena will not support quotes yet.

A custom SerDe called com.amazon.emr.hive.serde.s3.S3LogDeserializer comes with all EMR AMI’s just for parsing these logs. A query like the following would create the table easily. However, this SerDe will not be supported by Athena.

Looking at these limitations, the org.apache.hadoop.hive.serde2.RegexSerDe only seems like the feasible option. Writing RegEx for your log structure was bit time consuming , but I was able to write it with the help of ELB log structure example and http://regexr.com/ .

Also found at : http://regexr.com/3erct , where I removed some extra escape chars like \’s. In DDL , we need to add these escape characters like in following DDL.

([^ ]*) – matches a continuous string. Because the number fields like ‘time’ can return ‘-‘ character , I did not use ([-0-9]*) regex which is used for numbers. Because of same reason , I had to use STRING for all fields in DDL. We can use presto string functions to convert strings for appropriate conversions and comparison operators on DML

\\[(.*?)\\] or \\\[(.*?)\\\] – is for Time field – For a string like [06/Feb/2014:00:00:38 +0000], it will match and give a string like 05/Dec/2016:16:56:36 +0000 which is easier for querying between times.

\\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" – is for Request-URI field – removes the quotes and splits the "GET /mybucket?versioning HTTP/1.1" into three groups.

(\"[^\"]*\") – is for User-Agent – matches the whole string with quotes.

(-|[0-9]*) – is for HTTPstatus which is always a number like 200 or 404 . If you expect a ‘-‘ string, you can use ([^ ]*) instead,

WHERE parse_datetime(RequestDateTime,'dd/MMM/yyyy:HH:mm:ss Z')>parse_datetime('2016-12-05:16:56:40','yyyy-MM-dd:HH:mm:ss')

GROUP BY key

ORDER BY cnt DESC;

Any field can be set to “-” to indicate that the data was unknown or unavailable, or that the field was not applicable to this request.

Considerations:

1. From time to time, AWS might extend the access log record format by adding new fields to the end of each line.The DDL/RegEx must be written to handle trailing fields that it does not understand.
2. If the s3 directory is huge with a lot of log files , based on the query , you might encounter Athena’s “Query timeout: 30 minutes”. http://docs.aws.amazon.com/athena/latest/ug/service-limits.html. We have some optimizations here : https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

Partitioning / Compression :

Overtime, the logs in your target prefix(s3://s3server-accesslogs-bucket/logs/) can grow big. This is especially true if you have this same bucket/prefix for other buckets access logs. If you enable logging on multiple source buckets that identify the same target bucket, the target bucket will have access logs for all those source buckets, but each log object will report access log records for a specific source bucket.

Amazon S3 uses the following object key format for the log objects it uploads in the target bucket: TargetPrefixYYYY-mm-DD-HH-MM-SS-UniqueString . So, it basically creates multiple files with a bunch of logs in the S3 bucket.

Querying can be slower if there are large number of small files in textformat.

To speed things up, we need at-least one of these options to consolidate these logs.
1. PARTITIONING – can work on subset of files instead of going through all files.
2. CONVERT TO a COLUMNAR STORAGE – like ORC or PARQUET For faster query performance
3. ENABLE COMPRESSION – Snappy or GZ
4. BLOOM FILTERS

Note that S3 access logs data are not partitioned and they are not organized with prefixes like /year/month/day or /year=2010/month=10/day=11/ etc using which Athena could make static or Dynamic partitions respectively. Also access logs are uncompressed and flat text files.

Unfortunately, we cannot enable all above options by just defining a DDL. We need to copy/move out text based data-set with the above options enabled. One option to use of AWS EMR to periodically structure and partition the S3 access logs so that you can query those logs easily with Athena. At this moment, it is not possible to use Athena itself to convert non-partitioned data into partitioned data.

On EMR, we could use either
– HIVE : To partition/compress/covert , or
– S3-DIST-CP utility : To compress/groupBy

Note that we can use EMR’s Presto / Spark etc to query the logs, but Athena being a serverless model is much easier to query without managing anything.

3. Run queries on this table with WHERE clauses on specific year/month/date partition to speed the querying up.

Ex: select * from Accesslogs_partitionedbyYearMonthDay where request_year='2016' AND request_month='12';

-------------------------------------------------

Continuing the Partitioning for incoming logs :

Note that above discussion talks about one time creation of all partitions. There’s two things to notice
1. While the above job is running , there could be new incoming logs which will not be partitioned onto destination.
2. Also after the job is completed, you keep getting new logs.

Which S3 files will be missed by the above job, really depends on hive-client’s split calculation on your source s3 bucket. when you run your ‘insert overwrite’ command, hive-client calculates splits initially by listing all objects inside the S3 prefix. This usually takes minutes and depends on number of s3 objects. After this split calculation is done, the actual jobs to do partitioning is done by mappers / reducers. So, any log files pushed to source prefix after split calculation is done will not be included on your job.

When you want to automatically add partitions to Athena table(Accesslogs_partitionedbyYearMonthDay) with new files coming to your S3 access logs with newer request days on your log bucket/prefix, you will need to have a ‘batch job on schedule’ or a ‘trigger mechanism’ to make sure the new data is also partitioned accordingly.

Batch job on schedule :

Note that our source s3 access logs are dumped into one single directory. Each file name is like 2016-12-05-17-17-06-5D51435906E10586. Since we can run our batch jobs after the above one-time-bulk-job, on the daily(or frequent) batches, we need to avoid the files that we already processed by previous one-time-job. For this , we can rely on the filename itself to get a list of newly generated files and then use an ETL to convert these files to partitioned data. We could also maintain a manifest file for the list of files already processed and the files which need to be processed etc.

Managing all that is left beyond this blog. There’s multiple ways to do ETL. One way is to run an AWS Data Pipeline with EMR, on schedule, and run the hive queries on the new data.

/

Trigger based solution :

If you want to run a partition job on every S3 PUT or bunch of PUTS , you can use AWS Lambda which can trigger a piece of code on every S3 object PUT. You can get Metadata about that put and Handle that on your Lambda code accordingly to fireoff an ETL etc to do the partitioning. AWS Lambda should come with AWS SDK and you can make all types of AWS API calls with proper permissions.
http://docs.aws.amazon.com/lambda/latest/dg/with-s3.html
http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html

Example 3 :

Example 4 :

If you want to concatenate matching files in the root directory and and all matching files inside a ‘sample_directory’ into a single file and compress that in gzip format. on http://regexr.com/3ftn9 will concatenate all matched file contents and creates one .gz file

Task Runner Configuration Options :

--tasks : The number of task poll threads to run simultaneously. Optional. The default is 2. however, these threads are just for polling.

Maximum number of Activities that can be run in parallel by Taskrunner :

If you have 10 Activities(which may be from different Data-pipelines) currently running on that single Taskrunner worker-group , new activities cannot be processed by this taskrunner and will be waiting for the previous activities to finish. i.e Taskrunner tied to a worker-group has a hard limit that it can execute a Max of 10 Activities in Parallel.

For example :

Suppose 10 activities are to be executed. By default poller pulls 2 Tasks(Activities) per second. In the next seconds it pulls another 2 activities. so, at the end of 5 seconds , there will be 10 activities submitted by poller to worker-group.

A help on this command or AgentOptions would provide the options that you can specify on the agent. So, you could have multiple agents , each running with different configurations. Currently(Oct 2016), each Agent can only have one authentication configuration set. So, each Agent process can only write to streams on just one account. So, with separate config files you can send data to different accounts. The authentication for each agent can be using a IAM user credentials or with an Assumed Role.

AWS CloudTrail is a web service that records AWS API calls for your account and delivers log files to you. The recorded information includes the identity of the API caller, the time of the API call, the source IP address of the API caller, the request parameters, and the response elements returned by the AWS service.

Cloudtrail usually sends the logs to S3 bucket segregated into Accounts(Regions(data(logFile.json.gz))).

Cloudtrail recommends to use aws-cloudtrail-processing-library , but it may be complex if you wish to Ad-hoc query a huge number of big log files faster. If there are many files , it may also be harder to download all logs to a Linux/Unix node , unzip it and do RegEx matching on all these files. So, we can use a distributed environment like Apache Hive on AWS EMR cluster to parse/organize all these files using very simple SQL like commands.

This article guides you to query your Cloudtrail logs using EMR Hive. This article also provides some example queries which may be useful in different scenarios. It assumes that you have a running EMR cluster which Hive application installed and explored a bit.

# Each log file has a structure similar to the following with nested JSON structure.

# The following example shows that an IAM user named Alice

# used the AWS CLI to call the Amazon EC2 StartInstances action

# by using the ec2-start-instances command for instance i-ebeaf9e2.

{

"Records":[{

"eventVersion":"1.0",

"userIdentity":{

"type":"IAMUser",

"principalId":"EX_PRINCIPAL_ID",

"arn":"arn:aws:iam::123456789012:user/Alice",

"accessKeyId":"EXAMPLE_KEY_ID",

"accountId":"123456789012",

"userName":"Alice"

},

"eventTime":"2014-03-06T21:22:54Z",

"eventSource":"ec2.amazonaws.com",

"eventName":"StartInstances",

"awsRegion":"us-west-2",

"sourceIPAddress":"205.251.233.176",

"userAgent":"ec2-api-tools 1.6.12.2",

"requestParameters":{

"instancesSet":{

"items":[{

"instanceId":"i-ebeaf9e2"

}]

}

},

"responseElements":{

"instancesSet":{

"items":[{

"instanceId":"i-ebeaf9e2",

"currentState":{

"code":0,

"name":"pending"

},

"previousState":{

"code":80,

"name":"stopped"

}

}]

}

}

},

...additional entries...

]

}

The following Hive queries shows how to create a Hive table and reference the cloud trial s3 bucket. Cloudtrail data is processed by CloudTrailInputFormat implementation, which defines the input data split and key/value records. The CloudTrailLogDeserializer class defined in SerDe is called to format the data into a record that maps to column and data types in a table. Data (such as using an INSERT statement) to be written is translated by the Serializer class defined in SerDe to the format that the OUTPUTFORMAT class( HiveIgnoreKeyTextOutputFormat) can read.

These classes are part of /usr/share/aws/emr/goodies/lib/ EmrHadoopGoodies-x.jar & /usr/share/aws/emr/goodies/lib/ EmrHadoopGoodies-x.jar files and are automatically included in Hive classpath. Hive can also automatically de-compress the GZ files. All you need to do is to run a query similar to SQL commands. Some sample queries are included.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

# Invoke HIVE shell by typing hive on EMR Master node.

# You can also use HUE UI to run the following commands.

>Hive

# Since the cloudtrail logs files are organized into directories,

# We want hive to recursively parse through all these directories and work on files.

# So, we set the following option.

Set mapred.input.dir.recursive=true;

# You can either use tez or mr as Hive engine. Tez is defult on EMR 5.x .

set hive.execution.engine=mr;

# Using EXTERNAL table references the S3 folder and its directories.

# This is just a dataDefinition statement and does not involve any data transfer.

BETWEEN TO_UNIX_TIMESTAMP("2016-10-26T14:00:53Z","yyyy-MM-dd'T'HH:mm:ss'Z'")

AND

TO_UNIX_TIMESTAMP("2016-10-26T15:05:53Z","yyyy-MM-dd'T'HH:mm:ss'Z'");

# QUERY 6

--Show all API calls made by given user

SELECT DISTINCT(eventName)

FROM cloudtrail_logs_2016

WHERE userIdentity.principalId="123456";

# QUERY 7

--Show count of different clients used

SELECT userAgent,count(requestId)AScnt

FROM CloudTrailTable

GROUP BY userAgent

ORDER BY cnt DESC;

EMR uses an instance profile role on its nodes to auntenticate requests made to cloudtrail bucket. The default IAM policy on the role EMR_EC2_DefaultRole allows access to all S3 buckets. If your cluster do not have access , you may need to make sure the instance profile/Role has access to necessary s3 cloudtrail bucket.

Do not run any Insert overwrite on this hive table. If EMR has write access to the s3 bucket, an insert overwrite may delete all logs from this bucket. Please check hive language manual before attempting any commands

Cloudtrail json elements are extensive and are different for different kind of request. This SerDe (which is kind-of abandoned by EMR team)doesn’t include all possible rows in Ctrail. For example, if you try to query requestparameters , it would give FAILED: SemanticException [Error 10004]: Line 6:28 Invalid table alias or column reference ‘requestparameters’.

If your cloudtrail bucket has large number of files, Tez’s grouped splits or MR’s input splits calculation may take considerable time and memory. Make sure you allocate enough resources to the hive-client or tez-client.

2017-12-11T23:22:19.685ZINFO main com.facebook.presto.server.PrestoServer

========SERVER STARTED========

Now, run the queries on the cloudtrail table already created with Hive. The query syntax and functions are different from Hive and you should use Presto’s functions. Some example of queries are provided in AWS Athena’s documentation(which uses Presto ) http://docs.aws.amazon.com/athena/latest/ug/cloudtrail-logs.html

Often times HiveServer2 can be single point of failure. It can easy crash with OOM. If HiveServer2 restarts now-and-then , it must be due to OOM where it is likely set to be killed and re-spawned. We need to check the JVM options to see the behavior of this process on OOM , analyze thread dumps and check the underlying issue to mitigate such issues.

This post explains – How to identify OOM issue on HiveServer2 – How to analyze Thread/Heap dumps – How to Mitigate this issue. Some of the discussion is generic to EMR.

Some observations from above JVM:

1. Heap space for HiveServer2 JVM is 1000MB and this option is sourced from hive-env.sh.
2. Hive logging options like(-Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-server2.log -Dhive.log.threshold=INFO -server) are sourced from /etc/init/hive-server2.conf
3. All other options starting from ‘-server’ are set by EMR.
4. If this JVM goes OutOfMemoryError , -XX:OnOutOfMemoryError=kill -9 %p option will kill HS2 on OOM.

If for some reason the HiveServer2 goes OOM , it simply gets killed and there is no way to find why.. hive-server2.log and hive-server2.out may not show OOM at all. you can check hive-server2.log to verify HiveServer2 getting restarted . So, we tweak JVM options for HS2 to include GC verbose and to Dump Heap On OutOfMemoryError.

With the following settings , if the HS2 process does go OOM , it would display the error like following on /var/log/hive/hive-server2.out. This message means that for some reason the garbage collector of this HS2 JVM is taking an excessive amount of time and recovers very little memory in each run.

1

2

3

4

5

[Full GC22988K->22494K(84480K),0.1552850secs]

#

# java.lang.OutOfMemoryError: Java heap space

# -XX:OnOutOfMemoryError="kill -9 %p"

# Executing /bin/sh -c "kill -9 111510"...

An example script to enable GC verbose on HS2 JVM – can be like following on EMR 4.x.x and 5.x.x:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

#!/bin/bash

# This script sources the hive-env.sh on EMR 4.x.x/5.x.x with custom HADOOP_OPTS for hiveserver2 and restarts the HS2 process.

# These options enable GC verbose on HS2 which gets logged to /var/log/hive/hive-server2.out.

PART 2 : How to analyze Thread/Heap dumps

With above options if HS2 fails with OOM , it will log such error to /var/log/hive/hive-server2.out. It will also heap dump to a file like /var/log/hive/java_pid8548.hprof . There are multiple ways to open and analyze this file. I found “eclipse memory analyzer tool” helpful in identifying memory leaks and exploring thread dumps on java classes and sub-classes.

PART 3 : How to Mitigate this issue

1. Tweaking hive-site.xml by checking stack trace:

Best way to mitigate the OOM’s is to check the stack trace of the Hive-server2 process and see if there’s any leaks. It is also a good idea to check for top consumers and act accordingly. For each of those threads , see if any of the hive-site.xml settings would reduce its memory consumption. Some times you may not have control on any of the thread specifics , in which case , you may need to further increase the heap space.

For example : In the stack trace ,

– if you see multiple threads getting blocked etc, you can edit a setting like hive.server2.thrift.http.max.worker.threads

The heap space is defined in hive-env.sh using env variable . Its important to identify memory requirements of 3 hive services HiveServer2 , Hive metastore , Hive clients in advance , using load/concurrency tests before moving to PROD. If you observe any OOM’s , you may need to increase memory accordingly.

Identifying hive services and increasing memory :

if ‘export HADOOP_HEAPSIZE=2048’ string is present on hive-env.sh , it would be applied to all of the 3 hive services if restarted. So, you can use if-statements to provide different settings(HADOOP_HEAPSIZE and HADOOP_OPTS) for 3 of these hive services.

3. High availability and LoadBalancing of Hiveserver2 :

Increasing heap space for HS2 may help , However , As per Cloudera , after certain memory limit, there is possibility that you continue to hit OOM no mater how much you increase the memory for HS2. In that case, “Cloudera recommends splitting HiveServer2 into multiple instances and load balancing once you start allocating >12 GB to HiveServer2. The objective is to size to reduce impact of Java garbage collection on active processing by the service.”

4. Watch out for any bugs in the LockManager :

If you are using Hive’s Table Lock Manager Service by setting hive.support.concurrency to true , check if there’s issues with the Lockmanager which is responsible for maintaining locks for concurrent user support. Lockmanager can be ZooKeeperHiveLockManager based or can be EmbeddedLockManager which is shared lock manager for dedicated hive server , where – all locks are managed in memory. if the issue seems to be with these lock managers , you may need to edit their configs.

spark-redshift is a library to load data into Spark SQL DataFrames from Amazon Redshift, and write them back to Redshift tables. Amazon S3 is used to efficiently transfer data in and out of Redshift, and JDBC is used to automatically trigger the appropriate COPY and UNLOAD commands on Redshift.This library is more suited to ETL than interactive queries, since large amounts of data could be extracted to S3 for each query execution

spark-redshift installation instructions on EMR:

Steps 1-8 shows how to compile your own spark-redshift package(JAR). You can directly skip to step 9 if you wish to use pre-compiled JAR’s(V 0.6.1). Later we use spark-shell to invoke these JAR’s and run scala code to query Redshift table and put contents into a dataframe. This guide assumes you had followed github page https://github.com/databricks/spark-redshift and its tutorial.