AWS CloudTrail is a web service that records AWS API calls for your account and delivers log files to you. The recorded information includes the identity of the API caller, the time of the API call, the source IP address of the API caller, the request parameters, and the response elements returned by the AWS service.

Cloudtrail usually sends the logs to S3 bucket segregated into Accounts(Regions(data(logFile.json.gz))).

Cloudtrail recommends to use aws-cloudtrail-processing-library , but it may be complex if you wish to Ad-hoc query a huge number of big log files faster. If there are many files , it may also be harder to download all logs to a Linux/Unix node , unzip it and do RegEx matching on all these files. So, we can use a distributed environment like Apache Hive on AWS EMR cluster to parse/organize all these files using very simple SQL like commands.

This article guides you to query your Cloudtrail logs using EMR Hive. This article also provides some example queries which may be useful in different scenarios. It assumes that you have a running EMR cluster which Hive application installed and explored a bit.

# Each log file has a structure similar to the following with nested JSON structure.

# The following example shows that an IAM user named Alice

# used the AWS CLI to call the Amazon EC2 StartInstances action

# by using the ec2-start-instances command for instance i-ebeaf9e2.

{

"Records":[{

"eventVersion":"1.0",

"userIdentity":{

"type":"IAMUser",

"principalId":"EX_PRINCIPAL_ID",

"arn":"arn:aws:iam::123456789012:user/Alice",

"accessKeyId":"EXAMPLE_KEY_ID",

"accountId":"123456789012",

"userName":"Alice"

},

"eventTime":"2014-03-06T21:22:54Z",

"eventSource":"ec2.amazonaws.com",

"eventName":"StartInstances",

"awsRegion":"us-west-2",

"sourceIPAddress":"205.251.233.176",

"userAgent":"ec2-api-tools 1.6.12.2",

"requestParameters":{

"instancesSet":{

"items":[{

"instanceId":"i-ebeaf9e2"

}]

}

},

"responseElements":{

"instancesSet":{

"items":[{

"instanceId":"i-ebeaf9e2",

"currentState":{

"code":0,

"name":"pending"

},

"previousState":{

"code":80,

"name":"stopped"

}

}]

}

}

},

...additional entries...

]

}

The following Hive queries shows how to create a Hive table and reference the cloud trial s3 bucket. Cloudtrail data is processed by CloudTrailInputFormat implementation, which defines the input data split and key/value records. The CloudTrailLogDeserializer class defined in SerDe is called to format the data into a record that maps to column and data types in a table. Data (such as using an INSERT statement) to be written is translated by the Serializer class defined in SerDe to the format that the OUTPUTFORMAT class( HiveIgnoreKeyTextOutputFormat) can read.

These classes are part of /usr/share/aws/emr/goodies/lib/ EmrHadoopGoodies-x.jar & /usr/share/aws/emr/goodies/lib/ EmrHadoopGoodies-x.jar files and are automatically included in Hive classpath. Hive can also automatically de-compress the GZ files. All you need to do is to run a query similar to SQL commands. Some sample queries are included.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

# Invoke HIVE shell by typing hive on EMR Master node.

# You can also use HUE UI to run the following commands.

>Hive

# Since the cloudtrail logs files are organized into directories,

# We want hive to recursively parse through all these directories and work on files.

# So, we set the following option.

Set mapred.input.dir.recursive=true;

# You can either use tez or mr as Hive engine. Tez is defult on EMR 5.x .

set hive.execution.engine=mr;

# Using EXTERNAL table references the S3 folder and its directories.

# This is just a dataDefinition statement and does not involve any data transfer.

BETWEEN TO_UNIX_TIMESTAMP("2016-10-26T14:00:53Z","yyyy-MM-dd'T'HH:mm:ss'Z'")

AND

TO_UNIX_TIMESTAMP("2016-10-26T15:05:53Z","yyyy-MM-dd'T'HH:mm:ss'Z'");

# QUERY 6

--Show all API calls made by given user

SELECT DISTINCT(eventName)

FROM cloudtrail_logs_2016

WHERE userIdentity.principalId="123456";

# QUERY 7

--Show count of different clients used

SELECT userAgent,count(requestId)AScnt

FROM CloudTrailTable

GROUP BY userAgent

ORDER BY cnt DESC;

EMR uses an instance profile role on its nodes to auntenticate requests made to cloudtrail bucket. The default IAM policy on the role EMR_EC2_DefaultRole allows access to all S3 buckets. If your cluster do not have access , you may need to make sure the instance profile/Role has access to necessary s3 cloudtrail bucket.

Do not run any Insert overwrite on this hive table. If EMR has write access to the s3 bucket, an insert overwrite may delete all logs from this bucket. Please check hive language manual before attempting any commands

Cloudtrail json elements are extensive and are different for different kind of request. This SerDe (which is kind-of abandoned by EMR team)doesn’t include all possible rows in Ctrail. For example, if you try to query requestparameters , it would give FAILED: SemanticException [Error 10004]: Line 6:28 Invalid table alias or column reference ‘requestparameters’.

If your cloudtrail bucket has large number of files, Tez’s grouped splits or MR’s input splits calculation may take considerable time and memory. Make sure you allocate enough resources to the hive-client or tez-client.

2017-12-11T23:22:19.685ZINFO main com.facebook.presto.server.PrestoServer

========SERVER STARTED========

Now, run the queries on the cloudtrail table already created with Hive. The query syntax and functions are different from Hive and you should use Presto’s functions. Some example of queries are provided in AWS Athena’s documentation(which uses Presto ) http://docs.aws.amazon.com/athena/latest/ug/cloudtrail-logs.html