This is a customer post by Ajay Rathod, a Staff Data Engineer at Realtor.com.

Realtor.com, in their own words: Realtor.com®, operated by Move, Inc., is a trusted resource for home buyers, sellers, and dreamers. It offers the most comprehensive database of for-sale properties, among competing national sites, and the information, tools, and professional expertise to help people move confidently through every step of their home journey.

Move, Inc. processes hundreds of terabytes of data partitioned by day and hour. Various teams run hundreds of queries on this data. Using AWS services, Move, Inc. has built an infrastructure for gathering and analyzing data:

To increase the effectiveness of the storage and subsequent querying, the data is converted into a Parquet format, and stored again in S3.

Amazon Athena is used as the SQL (Structured Query Language) engine to query the data in S3. Athena is easy to use and is often quickly adopted by various teams.

Teams visualize query results in Amazon QuickSight. Amazon QuickSight is a business analytics service that allows you to quickly and easily visualize data and collaborate with other users in your account.

This architecture is known as the data platform and is shared by the data science, data engineering, and the data operations teams within the organization. Move, Inc. also enables other cross-functional teams to use Athena. When many users use Athena, it helps to monitor its usage to ensure cost-effectiveness. This leads to a strong need for Athena metrics that can give details about the following:

Users

Amount of data scanned (to monitor the cost of AWS service usage)

The databases used for queries

Actual queries that teams run

Currently, the Move, Inc. team does not have an easy way of obtaining all these metrics from a single tool. Having a way to do this would greatly simplify monitoring efforts. For example, the data operations team wants to collect several metrics every day obtained from queries run on Athena for their data. They require the following metrics:

Amount of data scanned by each user

Number of queries by each user

Databases accessed by each user

In this post, I discuss how to build a solution for monitoring Athena usage. To build this solution, you rely on AWS CloudTrail. CloudTrail is a web service that records AWS API calls for your AWS account and delivers log files to an S3 bucket.

Solution

Here is the high-level overview:

Use the CloudTrail API to audit the user queries, and then use Athena to create a table from the CloudTrail logs.

Query the Athena API with the AWS CLI to gather metrics about the data scanned by the user queries and put this information into another table in Athena.

Combine the information from these two sources by joining the two tables.

Use the resulting data to analyze, build insights, and create a dashboard that shows the usage of Athena by users within different teams in the organization.

Step 1: Create a table in Athena for data in CloudTrail

The CloudTrail API records all Athena queries run by different teams within the organization. These logs are saved in S3. The fields of most interest are:

User identity

Start time of the API call

Source IP address

Request parameters

Response elements returned by the service

When end users make queries in Athena, these queries are recorded by CloudTrail as responses from Athena web service calls. In these responses, each query is represented as a JSON (JavaScript Object Notation) string.

You can use the following CREATE TABLE statement to create the cloudtrail_logs table in Athena. For more information, see Querying CloudTrail Logs in the Athena documentation.

Step 2: Create a table in Amazon Athena for data from API output

Athena provides an API that can be queried to obtain information of a specific query ID. It also provides an API to obtain information of a batch of query IDs, with a batch size of up to 50 query IDs.

You can use this API call to obtain information about the Athena queries that you are interested in and store this information in an S3 location. Create an Athena table to represent this data in S3. For the purpose of this post, the response fields that are of interest are as follows:

You can inspect the query IDs and user information for the last day. The query is as follows:

with data AS (
SELECT
json_extract(responseelements,
'$.queryExecutionId') AS query_id,
(useridentity.arn) AS uid,
(useridentity.sessioncontext.sessionIssuer.userName) AS role,
from_iso8601_timestamp(eventtime) AS dt
FROM cloudtrail_logs
WHERE eventsource='athena.amazonaws.com'
AND eventname='StartQueryExecution'
AND json_extract(responseelements, '$.queryExecutionId') is NOT null)
SELECT *
FROM data
WHERE dt > date_add('day',-1,now() )

Step 3: Obtain Query Statistics from Athena API

You can write a simple Python script to loop through queries in batches of 50 and query the Athena API for query statistics. You can use the Boto library for these lookups. Boto is a library that provides you with an easy way to interact with and automate your AWS development. The response from the Boto API can be parsed to extract the fields that you need as described in Step 2.

An example Python script is available in the AthenaMetrics GitHub repo.

Format these fields, for each query ID, as CSV strings and store them for the entire batch response in an S3 bucket. This S3 bucket is represented by the table created in Step 2, cloudtrail_logs.

In your Python code, create a variable named sql_query, and assign it a string representing the SQL query defined in Step 2. The s3_query_folder is the location in S3 that is used by Athena for storing results of the query. The code is as follows:

You can iterate through the results in the response object and consolidate them in batches of 50 results. For each batch, you can invoke the Athena API, batch-get-query-execution.

Store the output in the S3 location pointed to by the CREATE TABLE definition for the table athena_api_output, in Step 2. The SQL statement above returns only queries run in the last 24 hours. You may want to increase that to get usage over a longer period of time. The code snippet for this API call is as follows:

The batchqueryids value is an array of 50 query IDs extracted from the result set of the SELECT query. This script creates the data needed by your second table, athena_api_output, and you are now ready to join both tables in Athena.

Step 4: Join the CloudTrail and Athena API data

Now that the two tables are available with the data that you need, you can run the following Athena query to look at the usage by user. You can limit the output of this query to the most recent five days.

Conclusion

Using the solution described in this post, you can continuously monitor the usage of Athena by various teams. Taking this a step further, you can automate and set user limits for how much data the Athena users in your team can query within a given period of time. You may also choose to add notifications when the usage by a particular user crosses a specified threshold. This helps you manage costs incurred by different teams in your organization.

Realtor.com would like to acknowledge the tremendous support and guidance provided by Hemant Borole, Senior Consultant, Big Data & Analytics with AWS Professional Services in helping to author this post.

About the Author

Ajay Rathod is Staff Data Engineer at Realtor.com. With a deep background in AWS Cloud Platform and Data Infrastructure, Ajay leads the Data Engineering and automation aspect of Data Operations at Realtor.com. He has designed and deployed many ETL pipelines and workflows for the Realtor Data Analytics Platform using AWS services like Data Pipeline, Athena, Batch, Glue and Boto3. He has created various operational metrics to monitor ETL Pipelines and Resource Usage.

Big things are afoot in the world of HackSpace magazine! This month we’re running our first special issue, with wearables projects throughout the magazine. Moreover, we’re giving away our first subscription gift free to all 12-month print subscribers. Lastly, and most importantly, we’ve made the cover EXTRA SHINY!

Prepare your eyeballs — it’s HackSpace magazine issue 4!

Wearables

In this issue, we’re taking an in-depth look at wearable tech. Not Fitbits or Apple Watches — we’re talking stuff you can make yourself, from projects that take a couple of hours to put together, to the huge, inspiring builds that are bringing technology to the runway. If you like wearing clothes and you like using your brain to make things better, then you’ll love this feature.

We’re continuing our obsession with Nixie tubes, with the brilliant Time-To-Go-Clock – Trump edition. This ingenious bit of kit uses obsolete Russian electronics to count down the time until the end of the 45th president’s term in office. However, you can also program it to tell the time left to any predictable event, such as the deadline for your tax return or essay submission, or the date England gets knocked out of the World Cup.

We’re also talking to Dr Lucy Rogers — NASA alumna, Robot Wars judge, and fellow of the Institution of Mechanical Engineers — about the difference between making as a hobby and as a job, and about why we need the Guild of Makers. Plus, issue 4 has a teeny boat, the most beautiful Raspberry Pi cases you’ve ever seen, and it explores the results of what happens when you put a bunch of hardware hackers together in a French chateau — sacré bleu!

Tutorials

As always, we’ve got more how-tos than you can shake a soldering iron at. Fittingly for the current climate here in the UK, there’s a hot water monitor, which shows you how long you have before your morning shower turns cold, and an Internet of Tea project to summon a cuppa from your kettle via the web. Perhaps not so fittingly, there’s also an ESP8266 project for monitoring a solar power station online. Readers in the southern hemisphere, we’ll leave that one for you — we haven’t seen the sun here for months!

And there’s more!

We’re super happy to say that all our 12-month print subscribers have been sent an Adafruit Circuit Playground Express with this new issue:

This gadget was developed primarily with wearables in mind and comes with all sorts of in-built functionality, so subscribers can get cracking with their latest wearable project today! If you’re not a 12-month print subscriber, you’ll miss out, so subscribe here to get your magazine and your device, and let us know what you’ll make.

Following intense pressure from entertainment industry groups, in 2014 Australia began developing legislation which would allow ‘pirate’ sites to be blocked at the ISP level.

In March 2015 the Copyright Amendment (Online Infringement) Bill 2015 (pdf) was introduced to parliament and after just three months of consideration, the Australian Senate passed the legislation into law.

Soon after, copyright holders began preparing their first cases and in December 2016, the Australian Federal Court ordered dozens of local Internet service providers to block The Pirate Bay, Torrentz, TorrentHound, IsoHunt, SolarMovie, plus many proxy and mirror services.

Since then, more processes have been launched establishing site-blocking as a permanent fixture on the Aussie anti-piracy agenda. But with yet more applications for injunction looming on the horizon, how is the mechanism performing and does anything else need to be done to improve or amend it?

Those are the questions now being asked by the responsible department of the Australian Government via a consultation titled Review of Copyright Online Infringement Amendment. The review should’ve been carried out 18 months after the law’s introduction in 2015 but the department says that it delayed the consultation to let more evidence emerge.

“The Department of Communications and the Arts is seeking views from stakeholders on the questions put forward in this paper. The Department welcomes single, consolidated submissions from organizations or parties, capturing all views on the Copyright Amendment (Online Infringement) Act 2015 (Online Infringement Amendment),” the consultation paper begins.

The three key questions for response are as follows:

– How effective and efficient is the mechanism introduced by the Online Infringement Amendment?

– Is the application process working well for parties and are injunctions operating well, once granted?

– Are any amendments required to improve the operation of the Online Infringement Amendment?

Given the tendency for copyright holders to continuously demand more bang for their buck, it will perhaps come as a surprise that at least for now there is a level of consensus that the system is working as planned.

“Case law and survey data suggests the Online Infringement Amendment has enabled copyright owners to work with [Internet service providers] to reduce large-scale online copyright infringement. So far, it appears that copyright owners and [ISPs] find the current arrangement acceptable, clear and effective,” the paper reads.

Thus far under the legislation there have been four applications for injunctions through the Federal Court, notably against leading torrent indexes and browser-based streaming sites, which were both granted.

The other two processes, which began separately but will be heard together, at least in part, involve the recent trend of set-top box based streaming.

Village Roadshow, Disney, Universal, Warner Bros, Twentieth Century Fox, and Paramount are currently presenting their case to the Federal Court. Along with Hong Kong-based broadcaster Television Broadcasts Limited (TVB), which has a separate application, the companies have been told to put together quality evidence for an April 2018 hearing.

With these applications already in the pipeline, yet more are on the horizon. The paper notes that more applications are expected to reach the Federal Court shortly, with the Department of Communications monitoring to assess whether current arrangements are refined as additional applications are filed.

Thus far, however, steady progress appears to have been made. The paper cites various precedents established as a result of the blocking process including the use of landing pages to inform Internet users why sites are blocked and who is paying.

“Either a copyright owner or [ISP] can establish a landing page. If an [ISP] wishes to avoid the cost of its own landing page, it can redirect customers to one that the copyright owner would provide. Another precedent allocates responsibility for compliance costs. Cases to date have required copyright owners to pay all or a significant proportion of compliance costs,” the paper notes.

But perhaps the issue of most importance is whether site-blocking as a whole has had any effect on the levels of copyright infringement in Australia.

The Government says that research carried out by Kantar shows that downloading “fell slightly from 2015 to 2017” with a 5-10% decrease in individuals consuming unlicensed content across movies, music and television. It’s worth noting, however, that Netflix didn’t arrive on Australian shores until May 2015, just a month before the new legislation was passed.

Research commissioned by the Department of Communications and published a year later in 2016 (pdf) found that improved availability of legal streaming alternatives was the main contributor to falling infringement rates. In a juicy twist, the report also revealed that Aussie pirates were the entertainment industries’ best customers.

“The Department is aware that other factors — such as the increasing availability of television, music and film streaming services and of subscription gaming services — may also contribute to falling levels of copyright infringement,” the paper notes.

Submissions to the consultation (pdf) are invited by 5.00 pm AEST on Friday 16 March 2018 via the government’s website.

See how we built it, including our materials, code, and supplemental instructions, on Hackster.io: https://www.hackster.io/hackerhouse/automated-indoor-gardener-a90907 With how busy our lives are, it’s sometimes easy to forget to pay a little attention to your thirsty indoor plants until it’s too late and you are left with a crusty pile of yellow carcasses.

Building an automated gardener

Tired of their plants looking a little too ‘crispy’, Hacker House have created an automated gardener using a Raspberry Pi Zero W alongside some 3D-printed parts, a 5v USB grow light, and a peristaltic pump.

They designed and 3D printed a PLA casing for the project, allowing enough space within for the Raspberry Pi Zero W, the pump, and the added electronics including soldered wiring and two N-channel power MOSFETs. The MOSFETs serve to switch the light and the pump on and off.

Due to the amount of power the light and pump need, the team replaced the Pi’s standard micro USB power supply with a 12v switching supply.

Raspberry Pi and your home garden

Raspberry Pis make great babysitters for your favourite plants, both inside and outside your home. Here at Pi Towers, we have Bert, our Slack- and Twitter-connected potted plant who reminds us when he’s thirsty and in need of water.

Cerberus Technologies, in their own words: Cerberus is a company founded in 2017 by a team of visionary iGaming veterans. Our mission is simple – to offer the best tech solutions through a data-driven and a customer-first approach, delivering innovative solutions that go against traditional forms of working and process. This mission is based on the solid foundations of reliability, flexibility and security, and we intend to fundamentally change the way iGaming and other industries interact with technology.

Over the years, I have developed and created a number of data warehouses from scratch. Recently, I built a data warehouse for the iGaming industry single-handedly. To do it, I used the power and flexibility of Amazon Redshift and the wider AWS data management ecosystem. In this post, I explain how I was able to build a robust and scalable data warehouse without the large team of experts typically needed.

In two of my recent projects, I ran into challenges when scaling our data warehouse using on-premises infrastructure. Data was growing at many tens of gigabytes per day, and query performance was suffering. Scaling required major capital investment for hardware and software licenses, and also significant operational costs for maintenance and technical staff to keep it running and performing well. Unfortunately, I couldn’t get the resources needed to scale the infrastructure with data growth, and these projects were abandoned. Thanks to cloud data warehousing, the bottleneck of infrastructure resources, capital expense, and operational costs have been significantly reduced or have totally gone away. There is no more excuse for allowing obstacles of the past to delay delivering timely insights to decision makers, no matter how much data you have.

With Amazon Redshift and AWS, I delivered a cloud data warehouse to the business very quickly, and with a small team: me. I didn’t have to order hardware or software, and I no longer needed to install, configure, tune, or keep up with patches and version updates. Instead, I easily set up a robust data processing pipeline and we were quickly ingesting and analyzing data. Now, my data warehouse team can be extremely lean, and focus more time on bringing in new data and delivering insights. In this post, I show you the AWS services and the architecture that I used.

Handling data feeds

I have several different data sources that provide everything needed to run the business. The data includes activity from our iGaming platform, social media posts, clickstream data, marketing and campaign performance, and customer support engagements.

To handle the diversity of data feeds, I developed abstract integration applications using Docker that run on Amazon EC2 Container Service (Amazon ECS) and feed data to Amazon Kinesis Data Streams. These data streams can be used for real time analytics. In my system, each record in Kinesis is preprocessed by an AWS Lambda function to cleanse and aggregate information. My system then routes it to be stored where I need on Amazon S3 by Amazon Kinesis Data Firehose. Suppose that you used an on-premises architecture to accomplish the same task. A team of data engineers would be required to maintain and monitor a Kafka cluster, develop applications to stream data, and maintain a Hadoop cluster and the infrastructure underneath it for data storage. With my stream processing architecture, there are no servers to manage, no disk drives to replace, and no service monitoring to write.

Setting up a Kinesis stream can be done with a few clicks, and the same for Kinesis Firehose. Firehose can be configured to automatically consume data from a Kinesis Data Stream, and then write compressed data every N minutes to Amazon S3. When I want to process a Kinesis data stream, it’s very easy to set up a Lambda function to be executed on each message received. I can just set a trigger from the AWS Lambda Management Console, as shown following.

Regardless of the format I receive the data from our partners, I can send it to Kinesis as JSON data using my own formatters. After Firehose writes this to Amazon S3, I have everything in nearly the same structure I received but compressed, encrypted, and optimized for reading.

This data is automatically crawled by AWS Glue and placed into the AWS Glue Data Catalog. This means that I can immediately query the data directly on S3 using Amazon Athena or through Amazon Redshift Spectrum. Previously, I used Amazon EMR and an Amazon RDS–based metastore in Apache Hive for catalog management. Now I can avoid the complexity of maintaining Hive Metastore catalogs. Glue takes care of high availability and the operations side so that I know that end users can always be productive.

Working with Amazon Athena and Amazon Redshift for analysis

I found Amazon Athena extremely useful out of the box for ad hoc analysis. Our engineers (me) use Athena to understand new datasets that we receive and to understand what transformations will be needed for long-term query efficiency.

For our data analysts and data scientists, we’ve selected Amazon Redshift. Amazon Redshift has proven to be the right tool for us over and over again. It easily processes 20+ million transactions per day, regardless of the footprint of the tables and the type of analytics required by the business. Latency is low and query performance expectations have been more than met. We use Redshift Spectrum for long-term data retention, which enables me to extend the analytic power of Amazon Redshift beyond local data to anything stored in S3, and without requiring me to load any data. Redshift Spectrum gives me the freedom to store data where I want, in the format I want, and have it available for processing when I need it.

To load data directly into Amazon Redshift, I use AWS Data Pipeline to orchestrate data workflows. I create Amazon EMR clusters on an intra-day basis, which I can easily adjust to run more or less frequently as needed throughout the day. EMR clusters are used together with Amazon RDS, Apache Spark 2.0, and S3 storage. The data pipeline application loads ETL configurations from Spring RESTful services hosted on AWS Elastic Beanstalk. The application then loads data from S3 into memory, aggregates and cleans the data, and then writes the final version of the data to Amazon Redshift. This data is then ready to use for analysis. Spark on EMR also helps with recommendations and personalization use cases for various business users, and I find this easy to set up and deliver what users want. Finally, business users use Amazon QuickSight for self-service BI to slice, dice, and visualize the data depending on their requirements.

Each AWS service in this architecture plays its part in saving precious time that’s crucial for delivery and getting different departments in the business on board. I found the services easy to set up and use, and all have proven to be highly reliable for our use as our production environments. When the architecture was in place, scaling out was either completely handled by the service, or a matter of a simple API call, and crucially doesn’t require me to change one line of code. Increasing shards for Kinesis can be done in a minute by editing a stream. Increasing capacity for Lambda functions can be accomplished by editing the megabytes allocated for processing, and concurrency is handled automatically. EMR cluster capacity can easily be increased by changing the master and slave node types in Data Pipeline, or by using Auto Scaling. Lastly, RDS and Amazon Redshift can be easily upgraded without any major tasks to be performed by our team (again, me).

In the end, using AWS services including Kinesis, Lambda, Data Pipeline, and Amazon Redshift allows me to keep my team lean and highly productive. I eliminated the cost and delays of capital infrastructure, as well as the late night and weekend calls for support. I can now give maximum value to the business while keeping operational costs down. My team pushed out an agile and highly responsive data warehouse solution in record time and we can handle changing business requirements rapidly, and quickly adapt to new data and new user requests.

About the Author

Stephen Borg is the Head of Big Data and BI at Cerberus Technologies. He has a background in platform software engineering, and first became involved in data warehousing using the typical RDBMS, SQL, ETL, and BI tools. He quickly became passionate about providing insight to help others optimize the business and add personalization to products. He is now the Head of Big Data and BI at Cerberus Technologies.

Application architects are faced with key decisions throughout the process of designing and implementing their systems. One decision common to nearly all solutions is how to manage the storage and access rights of application configuration. Shared configuration should be stored centrally and securely with each system component having access only to the properties that it needs for functioning.

This post demonstrates how to create and access shared configurations in Parameter Store from AWS Lambda. Both encrypted and plaintext parameter values are stored with only the Lambda function having permissions to decrypt the secrets. You also use AWS X-Ray to profile the function.

Solution overview

An unencrypted Parameter Store parameter that the Lambda function loads

A KMS key that only the Lambda function can access. You use this key to create an encrypted parameter later.

Lambda function code in Python 3.6 that demonstrates how to load values from Parameter Store at function initialization for reuse across invocations.

Launch the AWS SAM template

To create the resources shown in this post, you can download the SAM template or choose the button to launch the stack. The template requires one parameter, an IAM user name, which is the name of the IAM user to be the admin of the KMS key that you create. In order to perform the steps listed in this post, this IAM user will need permissions to execute Lambda functions, create Parameter Store parameters, administer keys in KMS, and view the X-Ray console. If you have these privileges in your IAM user account you can use your own account to complete the walkthrough. You can not use the root user to administer the KMS keys.

SAM template resources

The following sections show the code for the resources defined in the template.Lambda function

In this YAML code, you define a Lambda function named ParameterStoreBlogFunctionDev using the SAM AWS::Serverless::Function type. The environment variables for this function include the ENV (dev) and the APP_CONFIG_PATH where you find the configuration for this app in Parameter Store. X-Ray tracing is also enabled for profiling later.

The IAM role for this function extends the AWSLambdaBasicExecutionRole by adding IAM policies that grant the function permissions to write to X-Ray and get parameters from Parameter Store, limited to paths under /dev/parameterStoreBlog*.Parameter Store parameter

This YAML code creates an encryption key with a key policy with two statements.

The first statement allows a given user (${IAMUsername}) to administer the key. Importantly, this includes the ability to encrypt values using this key and disable or delete this key, but does not allow the administrator to decrypt values that were encrypted with this key.

The second statement grants your Lambda function permission to encrypt and decrypt values using this key. The alias for this key in KMS is ParameterStoreBlogKeyDev, which is how you reference it later.

Beneath the import statements, you import the patch_all function from the AWS X-Ray library, which you use to patch boto3 to create X-Ray segments for all your boto3 operations.

Next, you create a boto3 SSM client at the global scope for reuse across function invocations, following Lambda best practices. Using the function environment variables, you assemble the path where you expect to find your configuration in Parameter Store. The class MyApp is meant to serve as an example of an application that would need its configuration injected at construction. In this example, you create an instance of ConfigParser, a class in Python’s standard library for handling basic configurations, to give to MyApp.

The load_config function loads the all the parameters from Parameter Store at the level immediately beneath the path provided in the Lambda function environment variables. Each parameter found is put into a new section in ConfigParser. The name of the section is the name of the parameter, less the base path. In this example, the full parameter name is /dev/parameterStoreBlog/appConfig, which is put in a section named appConfig.

Finally, the lambda_handler function initializes an instance of MyApp if it doesn’t already exist, constructing it with the loaded configuration from Parameter Store. Then it simply returns the currently loaded configuration in MyApp. The impact of this design is that the configuration is only loaded from Parameter Store the first time that the Lambda function execution environment is initialized. Subsequent invocations reuse the existing instance of MyApp, resulting in improved performance. You see this in the X-Ray traces later in this post. For more advanced use cases where configuration changes need to be received immediately, you could implement an expiry policy for your configuration entries or push notifications to your function.

To confirm that everything was created successfully, test the function in the Lambda console.

For KMS Key ID, choose alias/ParameterStoreBlogKeyDev, which is the key that your SAM template created.

For Value, enter {"secretKey": "secretValue"}.

Choose Create Parameter.

If you now try to view the value of this parameter by choosing the name of the parameter in the parameters list and then choosing Show next to the Value field, you won’t see the value appear. This is because, even though you have permission to encrypt values using this KMS key, you do not have permissions to decrypt values.

In the Lambda console, run another test of your function. You now also see the secret parameter that you created and its decrypted value.

If you do not see the new parameter in the Lambda output, this may be because the Lambda execution environment is still warm from the previous test. Because the parameters are loaded at Lambda startup, you need a fresh execution environment to refresh the values.

Adjust the function timeout to a different value in the Advanced Settings at the bottom of the Lambda Configuration tab. Choose Save and test to trigger the creation of a new Lambda execution environment.

Profiling the impact of querying Parameter Store using AWS X-Ray

By using the AWS X-Ray SDK to patch boto3 in your Lambda function code, each invocation of the function creates traces in X-Ray. In this example, you can use these traces to validate the performance impact of your design decision to only load configuration from Parameter Store on the first invocation of the function in a new execution environment.

From the Lambda function details page where you tested the function earlier, under the function name, choose Monitoring. Choose View traces in X-Ray.

This opens the X-Ray console in a new window filtered to your function. Be aware of the time range field next to the search bar if you don’t see any search results. In this screenshot, I’ve invoked the Lambda function twice, one time 10.3 minutes ago with a response time of 1.1 seconds and again 9.8 minutes ago with a response time of 8 milliseconds.

Looking at the details of the longer running trace by clicking the trace ID, you can see that the Lambda function spent the first ~350 ms of the full 1.1 sec routing the request through Lambda and creating a new execution environment for this function, as this was the first invocation with this code. This is the portion of time before the initialization subsegment.

Next, it took 725 ms to initialize the function, which includes executing the code at the global scope (including creating the boto3 client). This is also a one-time cost for a fresh execution environment.

Finally, the function executed for 65 ms, of which 63.5 ms was the GetParametersByPath call to Parameter Store.

Looking at the trace for the second, much faster function invocation, you see that the majority of the 8 ms execution time was Lambda routing the request to the function and returning the response. Only 1 ms of the overall execution time was attributed to the execution of the function, which makes sense given that after the first invocation you’re simply returning the config stored in MyApp.

While the Traces screen allows you to view the details of individual traces, the X-Ray Service Map screen allows you to view aggregate performance data for all traced services over a period of time.

In the X-Ray console navigation pane, choose Service map. Selecting a service node shows the metrics for node-specific requests. Selecting an edge between two nodes shows the metrics for requests that traveled that connection. Again, be aware of the time range field next to the search bar if you don’t see any search results.

After invoking your Lambda function several more times by testing it from the Lambda console, you can view some aggregate performance metrics. Look at the following:

From the client perspective, requests to the Lambda service for the function are taking an average of 50 ms to respond. The function is generating ~1 trace per minute.

The function itself is responding in an average of 3 ms. In the following screenshot, I’ve clicked on this node, which reveals a latency histogram of the traced requests showing that over 95% of requests return in under 5 ms.

Parameter Store is responding to requests in an average of 64 ms, but note the much lower trace rate in the node. This is because you only fetch data from Parameter Store on the initialization of the Lambda execution environment.

Conclusion

Deduplication, encryption, and restricted access to shared configuration and secrets is a key component to any mature architecture. Serverless architectures designed using event-driven, on-demand, compute services like Lambda are no different.

In this post, I walked you through a sample application accessing unencrypted and encrypted values in Parameter Store. These values were created in a hierarchy by application environment and component name, with the permissions to decrypt secret values restricted to only the function needing access. The techniques used here can become the foundation of secure, robust configuration management in your enterprise serverless applications.

If these statements are familiar, most likely you rely on file server backups to safeguard your valuable endpoint data.

The problem is, the workplace has changed. Where server backups might have fit how offices worked at one time in the past, relying solely on server backups today means you could be missing valuable endpoint data from your backups. On top of that, you likely are unnecessarily expending valuable user and IT time in attempting to secure and restore endpoint data.

Times Have Changed, and so have Effective Enterprise Backup Strategies

The ways we use computers and handle files today are vastly different from just five or ten years ago. Employees are mobile, and we no longer are limited to monolithic PC and Mac-based office suites. Cloud applications are everywhere. Company-mandated network drive policies are difficult to enforce as office practices change, devices proliferate, and organizational culture evolves. Besides, your IT staff has other things to do than babysit your employees to make sure they follow your organization’s policies for managing files.

Server Backup has its Place, but Does it Support How People Work Today?

Many organizations still rely on server backup. If your organization works primarily in centralized offices with all endpoints — likely desktops — connected directly to your network, and you maintain tight control of how employees manage their files, it still might work for you.

Your IT department probably has set network drive policies that require employees to save files in standard places that are regularly backed up to your file server. Turns out, though, that even standard applications don’t always save files where IT would like them to be. They could be in a directory or folder that’s not regularly backed up.

As employees have become more mobile, they have adopted practices that enable them to access files from different places, but these practices might not fit in with your organization’s server policies. An employee saving a file to Dropbox might be planning to copy it to an “official” location later, but whether that ever happens could be doubtful. Often people don’t realize until it’s too late that accidentally deleting a file in one sync service directory means that all copies in all locations — even the cloud — are also deleted.

Employees are under increasing demands to produce, which means that network drive policies aren’t always followed; time constraints and deadlines can cause best practices to go out the window. Users will attempt to comply with policies as best they can — and you might get 70% or even 75% effective compliance — but getting even to that level requires training, monitoring, and repeatedly reminding employees of policies they need to follow — none of which leads to a good work environment.

Even if you get to 75% compliance with network file policies, what happens if the critical file needed to close out an end-of-year financial summary isn’t one of the files backed up? The effort required for IT to get from 70% to 80% or 90% of an endpoint’s files effectively backed up could require multiple hours from your IT department, and you still might not have backed up the one critical file you need later.

Your Organization Operates on its Data — And Today That Data Exists in Multiple Locations

Users are no longer tied to one endpoint, and may use different computers in the office, at home, or traveling. The greater the number of endpoints used, the greater the chance of an accidental or malicious device loss or data corruption. The loss of the Sales VP’s laptop at the airport on her way back from meeting with major customers can affect an entire organization and require weeks to resolve.

Even with the best intentions and efforts, following policies when out of the office can be difficult or impossible. Connecting to your private network when remote most likely requires a VPN, and VPN connectivity can be challenging from the lobby Wi-Fi at the Radisson. Server restores require time from the IT staff, which can mean taking resources away from other IT priorities and a growing backlog of requests from users to need their files as soon as possible. When users are dependent on IT to get back files critical to their work, employee productivity and often deadlines are affected.

Managing Finite Server Storage Is an Ongoing Challenge

Network drive backup usually requires on-premises data storage for endpoint backups. Since it is a finite resource, allocating that storage is another burden on your IT staff. To make sure that storage isn’t exceeded, IT departments often ration storage by department and/or user — another oversight duty for IT, and even more choices required by your IT department and department heads who have to decide which files to prioritize for backing up.

Having an endpoint backup strategy in place can mitigate these problems and improve user productivity, as well. A good endpoint backup service, such as Backblaze Cloud Backup, will ensure that all devices are backed up securely, automatically, without requiring any action by the user or by your IT department.

For 99% of users, no configuration is required for Backblaze Backup. Everything on the endpoint is encrypted and securely backed up to the cloud, including program configuration files and files outside of standard document folders. Even temp files are backed up, which can prove invaluable when recovering a file after a crash or other program interruption. Cloud storage is unlimited with Backblaze Backup, so there are no worries about running out of storage or rationing file backups.

The Backblaze client can be silently and remotely installed to both Macintosh and Windows clients with no user interaction. And, with Backblaze Groups, your IT staff has complete visibility into when files were last backed up. IT staff can recover any backed up file, folder, or entire computer from the admin panel, and even give file restore capability to the user, if desired, which reduces dependency on IT and time spent waiting for restores.

You Need Data Security That Matches the Way People Work Today

Both file server and endpoint backup have their places in an organization’s data security plan, but their use and value differ. If you already are using file server backup, adding endpoint backup will make a valuable contribution to your organization by reducing workload, improving productivity, and increasing confidence that all critical files are backed up.

By guaranteeing fast and automatic backup of all endpoint data, and matching the current way organizations and people work with data, Backblaze Backup will enable you to effectively and affordably meet the data security demands of your organization.

I recently heard my manager (Ariel Kelman, VP of Marketing for AWS) talk about the important role that education plays in our work. In fact, he assigned it a significantly higher priority than traditional marketing activities that focus on leads or conversions. I’ve also heard our other leaders talk about their work to create highly scalable education programs that will allow developers, architects, and other IT professionals to improve their skills and to earn AWS Certifications.

AWS Developer Professional Series Today I would like to tell you about the new AWS Developer Professional Series. The AWS Training and Certification team has teamed up with edX to create this new three-part series. Founded by MIT and Harvard, edX is the leading non-profit online learning destination, with a global community of over 14 million learners, backed up by 130 global partners including universities, non-profits, and institutions. This collaboration expands our offerings, and gives you another training option!

The new series is designed to help you and your colleagues to build development and DevOps skills on AWS. The courses are self-paced and build on each other in order to help you to create Python applications that run on AWS by way of the AWS SDK for Python (also known as Boto). Here are the courses:

AWS Developer: Optimizing on AWS – This course focuses on performance optimization and tuning of the application that you built in the predecessor courses. You will learn how to use caching and content distribution to increase performance and to improve the end-user experience for your app. You’ll also learn how to use AWS Key Management Service (KMS) to encrypt data at rest and in transit.

The courses are built with the expectation that you already have one to three years of software development experience, including some Python skills. Each course runs for six weeks and requires three to four hours of work per week on your part. Courses start in February (Building), April (Deploying), and May (Optimizing), and you can enroll now at no charge. You can also pursue a Verified Certificate for a fee of $149 per course.

Amazon EMR enables data analysts and scientists to deploy a cluster running popular frameworks such as Spark, HBase, Presto, and Flink of any size in minutes. When you launch a cluster, Amazon EMR automatically configures the underlying Amazon EC2 instances with the frameworks and applications that you choose for your cluster. This can include popular web interfaces such as Hue workbench, Zeppelin notebook, and Ganglia monitoring dashboards and tools.

These web interfaces are hosted on the EMR master node and must be accessed using the public DNS name of the master node (master public DNS value). The master public DNS value is dynamically created, not very user friendly and is hard to remember— it looks something like ip-###-###-###-###.us-west-2.compute.internal. Not having a friendly URL to connect to the popular workbench or notebook interfaces may impact the workflow and hinder your gained agility.

Some customers have addressed this challenge through custom bootstrap actions, steps, or external scripts that periodically check for new clusters and register a friendlier name in DNS. These approaches either put additional burden on the data practitioners or require additional resources to execute the scripts. In addition, there is typically some lag time associated with such scripts. They often don’t do a great job cleaning up the DNS records after the cluster has terminated, potentially resulting in a security risk.

The solution in this post provides an automated, serverless approach to registering a friendly master node name for easy access to the web interfaces.

Before I dive deeper, I review these key services and how they are part of this solution.

CloudWatch Events

CloudWatch Events delivers a near real-time stream of system events that describe changes in AWS resources. Using simple rules, you can match events and route them to one or more target functions or streams. An event can be generated in one of four ways:

In this solution, I cover the first type of event, which is automatically emitted by EMR when the cluster state changes. Based on the state of this event, either create or update the DNS record in Route 53 when the cluster state changes to STARTING, or delete the DNS record when the cluster is no longer needed and the state changes to TERMINATED. For more information about all EMR event details, see Monitor CloudWatch Events.

Route 53 private hosted zones

A private hosted zone is a container that holds information about how to route traffic for a domain and its subdomains within one or more VPCs. Private hosted zones enable you to use custom DNS names for your internal resources without exposing the names or IP addresses to the internet.

Route 53 supports resource record sets with a wide range of record types. In this solution, you use a CNAME record that is used to specify a domain name as an alias for another domain (the ‘canonical’ domain). You use a friendly name of the cluster as the CNAME for the EMR master public DNS value.

Lambda

Lambda is a compute service that lets you run code without provisioning or managing servers. Lambda executes your code only when needed and scales automatically to thousands of requests per second. Lambda takes care of high availability, and server and OS maintenance and patching. You pay only for the consumed compute time. There is no charge when your code is not running.

Lambda provides the ability to invoke your code in response to events, such as when an object is put to an Amazon S3 bucket or as in this case, when a CloudWatch event is emitted. As part of this solution, you deploy a Lambda function as a target that is invoked by CloudWatch Events when the event matches your rule. You also configure the necessary permissions based on the Lambda permissions model, including a Lambda function policy and Lambda execution role.

Putting it all together

Now that you have all of the pieces, you can put together a complete solution. The following diagram illustrates how the solution works:

Start with a user activity such as launching or terminating an EMR cluster.

EMR automatically sends events to the CloudWatch Events stream.

A CloudWatch Events rule matches the specified event, and routes it to a target, which in this case is a Lambda function. In this case, you are using the EMR Cluster State Change

The Lambda function performs the following key steps:

Get the clusterId value from the event detail and use it to call EMR. DescribeCluster API to retrieve the following data points:

MasterPublicDnsName – public DNS name of the master node

Locate the tag containing the friendly name to use as the CNAME for the cluster. The key name containing the friendly name should be The value should be specified as host.domain.com, where domain is the private hosted zone in which to update the DNS record.

Update DNS based on the state in the event detail.

If the state is STARTING, the function calls the Route 53 API to create or update a resource record set in the private hosted zone specified by the domain tag. This is a CNAME record mapped to MasterPublicDnsName.

Conversely, if the state is TERMINATED, the function calls the Route 53 API to delete the associated resource record set from the private hosted zone.

Deploying the solution

Because all of the components of this solution are serverless, use the AWS Serverless Application Model (AWS SAM) template to deploy the solution. AWS SAM is natively supported by AWS CloudFormation and provides a simplified syntax for expressing serverless resources, resulting in fewer lines of code.

Overview of the SAM template

For this solution, the SAM template has 76 lines of text as compared to 142 lines without SAM resources (and writing the template in YAML would be even slightly smaller). The solution can be deployed using the AWS Management Console, AWS Command Line Interface (AWS CLI), or AWS SAM Local.

CloudFormation transforms help simplify template authoring by condensing a multiple-line resource declaration into a single line in your template. To inform CloudFormation that your template defines a serverless application, add a line under the template format version as follows:

Before SAM, you would use the AWS::Lambda::Function resource type to define your Lambda function. You would then need a resource to define the permissions for the function (AWS::Lambda::Permission), another resource to define a Lambda execution role (AWS::IAM::Role), and finally a CloudWatch Events resource (Events::Rule) that triggers this function.

With SAM, you need to define just a single resource for your function, AWS::Serverless::Function. Using this single resource type, you can define everything that you need, including function properties such as function handler, runtime, and code URI, as well as the required IAM policies and the CloudWatch event.

CodeUri – Before you can deploy a SAM template, first upload your Lambda function code zip to S3. You can do this manually or use the aws cloudformation package CLI command to automate the task of uploading local artifacts to a S3 bucket, as shown later.

Lambda execution role and permissions – You are not specifying a Lambda execution role in the template. Rather, you are providing the required permissions as IAM policy documents. When the template is submitted, CloudFormation expands the AWS::Serverless::Function resource, declaring a Lambda function and an execution role. The created role has two attached policies: a default AWSLambdaBasicExecutionRole and the inline policy specified in the template.

CloudWatch Events rule – Instead of specifying a CloudWatch Events resource type, you are defining an event source object as a property of the function itself. When the template is submitted, CloudFormation expands this into a CloudWatch Events rule resource and automatically creates the Lambda resource-based permissions to allow the CloudWatch Events rule to trigger the function.

NOTE: If you are trying this solution outside of us-east-1, then you should download the necessary files, upload them to the buckets in your region, edit the script as appropriate and then run it or use the CLI deployment method below.

3.) Choose Next.

4.) On the Specify Details page, keep or modify the stack name and choose Next.

5.) On the Options page, choose Next.

6.) On the Review page, take the following steps:

Acknowledge the two Transform access capabilities. This allows the CloudFormation transform to create the required IAM resources with custom names.

Under Transforms, choose Create Change Set.

Wait a few seconds for the change set to be created before proceeding. The change set should look as follows:

7.) Choose Execute to deploy the template.

After the template is deployed, you should see four resources created:

Validating results

To test the solution, launch an EMR cluster. The Lambda function looks for the cluster_name tag associated with the EMR cluster. Make sure to specify the friendly name of your cluster as host.domain.com where the domain is the private hosted zone in which to create the CNAME record.

Here is a sample CLI command to launch a cluster within a specific subnet in a VPC with the required tag cluster_name.

After the cluster is launched, log in to the Route 53 console. In the left navigation pane, choose Hosted Zones to view the list of private and public zones currently configured in Route 53. Select the hosted zone that you specified in the ZONE tag when you launched the cluster. Verify that the resource records were created.

You can also monitor the CloudWatch Events metrics that are published to CloudWatch every minute, such as the number of TriggeredRules and Invocations.

Now that you’ve verified that the Lambda function successfully updated the Route 53 resource records in the zone file, terminate the EMR cluster and verify that the records are removed by the same function.

Conclusion

This solution provides a serverless approach to automatically assigning a friendly name for your EMR cluster for easy access to popular notebooks and other web interfaces. CloudWatch Events also supports cross-account event delivery, so if you are running EMR clusters in multiple AWS accounts, all cluster state events across accounts can be consolidated into a single account.

I hope that this solution provides a small glimpse into the power of CloudWatch Events and Lambda and how they can be leveraged with EMR and other AWS big data services. For example, by using the EMR step state change event, you can chain various pieces of your analytics pipeline. You may have a transient cluster perform data ingest and, when the task successfully completes, spin up an ETL cluster for transformation and upload to Amazon Redshift. The possibilities are truly endless.

My career is a different story. Over the past two decades and a change, I went from writing CGI scripts and setting up WAN routers for a chain of shopping malls, to doing pentests for institutional customers, to designing a series of network monitoring platforms and handling incident response for a big telco, to building and running the product security org for one of the largest companies in the world. It’s been an interesting ride – and now that I’m on the hook for the well-being of about 100 folks across more than a dozen subteams around the world, I’ve been thinking a bit about the lessons learned along the way.

Of course, I’m a bit hesitant to write such a post: sometimes, your efforts pan out not because of your approach, but despite it – and it’s possible to draw precisely the wrong conclusions from such anecdotes. Still, I’m very proud of the culture we’ve created and the caliber of folks working on our team. It happened through the work of quite a few talented tech leads and managers even before my time, but it did not happen by accident – so I figured that my observations may be useful for some, as long as they are taken with a grain of salt.

But first, let me start on a somewhat somber note: what nobody tells you is that one’s level on the leadership ladder tends to be inversely correlated with several measures of happiness. The reason is fairly simple: as you get more senior, a growing number of people will come to you expecting you to solve increasingly fuzzy and challenging problems – and you will no longer be patted on the back for doing so. This should not scare you away from such opportunities, but it definitely calls for a particular mindset: your motivation must come from within. Look beyond the fight-of-the-day; find satisfaction in seeing how far your teams have come over the years.

With that out of the way, here’s a collection of notes, loosely organized into three major themes.

The curse of a techie leader

Perhaps the most interesting observation I have is that for a person coming from a technical background, building a healthy team is first and foremost about the subtle art of letting go.

There is a natural urge to stay involved in any project you’ve started or helped improve; after all, it’s your baby: you’re familiar with all the nuts and bolts, and nobody else can do this job as well as you. But as your sphere of influence grows, this becomes a choke point: there are only so many things you could be doing at once. Just as importantly, the project-hoarding behavior robs more junior folks of the ability to take on new responsibilities and bring their own ideas to life. In other words, when done properly, delegation is not just about freeing up your plate; it’s also about empowerment and about signalling trust.

Of course, when you hand your project over to somebody else, the new owner will initially be slower and more clumsy than you; but if you pick the new leads wisely, give them the right tools and the right incentives, and don’t make them deathly afraid of messing up, they will soon excel at their new jobs – and be grateful for the opportunity.

A related affliction of many accomplished techies is the conviction that they know the answers to every question even tangentially related to their domain of expertise; that belief is coupled with a burning desire to have the last word in every debate. When practiced in moderation, this behavior is fine among peers – but for a leader, one of the most important skills to learn is knowing when to keep your mouth shut: people learn a lot better by experimenting and making small mistakes than by being schooled by their boss, and they often try to read into your passing remarks. Don’t run an authoritarian camp focused on total risk aversion or perfectly efficient resource management; just set reasonable boundaries and exit conditions for experiments so that they don’t spiral out of control – and be amazed by the results every now and then.

Death by planning

When nothing is on fire, it’s easy to get preoccupied with maintaining the status quo. If your current headcount or budget request lists all the same projects as last year’s, or if you ever find yourself ending an argument by deferring to a policy or a process document, it’s probably a sign that you’re getting complacent. In security, complacency usually ends in tears – and when it doesn’t, it leads to burnout or boredom.

In my experience, your goal should be to develop a cadre of managers or tech leads capable of coming up with clever ideas, prioritizing them among themselves, and seeing them to completion without your day-to-day involvement. In your spare time, make it your mission to challenge them to stay ahead of the curve. Ask your vendor security lead how they’d streamline their work if they had a 40% jump in the number of vendors but no extra headcount; ask your product security folks what’s the second line of defense or containment should your primary defenses fail. Help them get good ideas off the ground; set some mental success and failure criteria to be able to cut your losses if something does not pan out.

Of course, malfunctions happen even in the best-run teams; to spot trouble early on, instead of overzealous project tracking, I found it useful to encourage folks to run a data-driven org. I’d usually ask them to imagine that a brand new VP shows up in our office and, as his first order of business, asks “why do you have so many people here and how do I know they are doing the right things?”. Not everything in security can be quantified, but hard data can validate many of your assumptions – and will alert you to unseen issues early on.

When focusing on data, it’s important not to treat pie charts and spreadsheets as an art unto itself; if you run a security review process for your company, your CSAT scores are going to reach 100% if you just rubberstamp every launch request within ten minutes of receiving it. Make sure you’re asking the right questions; instead of “how satisfied are you with our process”, try “is your product better as a consequence of talking to us?”

Whenever things are not progressing as expected, it is a natural instinct to fall back to micromanagement, but it seldom truly cures the ill. It’s probable that your team disagrees with your vision or its feasibility – and that you’re either not listening to their feedback, or they don’t think you’d care. It’s good to assume that most of your employees are as smart or smarter than you; barking your orders at them more loudly or more frequently does not lead anyplace good. It’s good to listen to them and either present new facts or work with them on a plan you can all get behind.

In some circumstances, all that’s needed is honesty about the business trade-offs, so that your team feels like your “partner in crime”, not a victim of circumstance. For example, we’d tell our folks that by not falling behind on basic, unglamorous work, we earn the trust of our VPs and SVPs – and that this translates into the independence and the resources we need to pursue more ambitious ideas without being told what to do; it’s how we game the system, so to speak. Oh: leading by example is a pretty powerful tool at your disposal, too.

The human factor

I’ve come to appreciate that hiring decent folks who can get along with others is far more important than trying to recruit conference-circuit superstars. In fact, hiring superstars is a decidedly hit-and-miss affair: while certainly not a rule, there is a proportion of folks who put the maintenance of their celebrity status ahead of job responsibilities or the well-being of their peers.

For teams, one of the most powerful demotivators is a sense of unfairness and disempowerment. This is where tech-originating leaders can shine, because their teams usually feel that their bosses understand and can evaluate the merits of the work. But it also means you need to be decisive and actually solve problems for them, rather than just letting them vent. You will need to make unpopular decisions every now and then; in such cases, I think it’s important to move quickly, rather than prolonging the uncertainty – but it’s also important to sincerely listen to concerns, explain your reasoning, and be frank about the risks and trade-offs.

Whenever you see a clash of personalities on your team, you probably need to respond swiftly and decisively; being right should not justify being a bully. If you don’t react to repeated scuffles, your best people will probably start looking for other opportunities: it’s draining to put up with constant pie fights, no matter if the pies are thrown straight at you or if you just need to duck one every now and then.

More broadly, personality differences seem to be a much better predictor of conflict than any technical aspects underpinning a debate. As a boss, you need to identify such differences early on and come up with creative solutions. Sometimes, all you need is taking some badly-delivered but valid feedback and having a conversation with the other person, asking some questions that can help them reach the same conclusions without feeling that their worldview is under attack. Other times, the only path forward is making sure that some folks simply don’t run into each for a while.

Finally, dealing with low performers is a notoriously hard but important part of the game. Especially within large companies, there is always the temptation to just let it slide: sideline a struggling person and wait for them to either get over their issues or leave. But this sends an awful message to the rest of the team; for better or worse, fairness is important to most. Simply firing the low performers is seldom the best solution, though; successful recovery cases are what sets great managers apart from the average ones.

Oh, one more thought: people in leadership roles have their allegiance divided between the company and the people who depend on them. The obligation to the company is more formal, but the impact you have on your team is longer-lasting and more intimate. When the obligations to the employer and to your team collide in some way, make sure you can make the right call; it might be one of the the most consequential decisions you’ll ever make.

Beginning in April 2013, Backblaze has recorded and saved daily hard drive statistics from the drives in our data centers. Each entry consists of the date, manufacturer, model, serial number, status (operational or failed), and all of the SMART attributes reported by that drive. As of the end of 2017, there are about 88 million entries totaling 23 GB of data. You can download this data from our website if you want to do your own research, but for starters here’s what we found.

Overview

At the end of 2017 we had 93,240 spinning hard drives. Of that number, there were 1,935 boot drives and 91,305 data drives. This post looks at the hard drive statistics of the data drives we monitor. We’ll review the stats for Q4 2017, all of 2017, and the lifetime statistics for all of the drives Backblaze has used in our cloud storage data centers since we started keeping track. Along the way we’ll share observations and insights on the data presented and we look forward to you doing the same in the comments.

Hard Drive Reliability Statistics for Q4 2017

At the end of Q4 2017 Backblaze was monitoring 91,305 hard drives used to store data. For our evaluation we remove from consideration those drives which were used for testing purposes and those drive models for which we did not have at least 45 drives (read why after the chart). This leaves us with 91,243 hard drives. The table below is for the period of Q4 2017.

A few things to remember when viewing this chart:

The failure rate listed is for just Q4 2017. If a drive model has a failure rate of 0%, it means there were no drive failures of that model during Q4 2017.

There were 62 drives (91,305 minus 91,243) that were not included in the list above because we did not have at least 45 of a given drive model. The most common reason we would have fewer than 45 drives of one model is that we needed to replace a failed drive and we had to purchase a different model as a replacement because the original model was no longer available. We use 45 drives of the same model as the minimum number to qualify for reporting quarterly, yearly, and lifetime drive statistics.

Quarterly failure rates can be volatile, especially for models that have a small number of drives and/or a small number of drive days. For example, the Seagate 4 TB drive, model ST4000DM005, has a annualized failure rate of 29.08%, but that is based on only 1,255 drive days and 1 (one) drive failure.

AFR stands for Annualized Failure Rate, which is the projected failure rate for a year based on the data from this quarter only.

Bulking Up and Adding On Storage

Looking back over 2017, we not only added new drives, we “bulked up” by swapping out functional and smaller 2, 3, and 4TB drives with larger 8, 10, and 12TB drives. The changes in drive quantity by quarter are shown in the chart below:

For 2017 we added 25,746 new drives, and lost 6,442 drives to retirement for a net of 19,304 drives. When you look at storage space, we added 230 petabytes and retired 19 petabytes, netting us an additional 211 petabytes of storage in our data center in 2017.

2017 Hard Drive Failure Stats

Below are the lifetime hard drive failure statistics for the hard drive models that were operational at the end of Q4 2017. As with the quarterly results above, we have removed any non-production drives and any models that had fewer than 45 drives.

The chart above gives us the lifetime view of the various drive models in our data center. The Q4 2017 chart at the beginning of the post gives us a snapshot of the most recent quarter of the same models.

Let’s take a look at the same models over time, in our case over the past 3 years (2015 through 2017), by looking at the annual failure rates for each of those years.

The failure rate for each year is calculated for just that year. In looking at the results the following observations can be made:

The failure rates for both of the 6 TB models, Seagate and WDC, have decreased over the years while the number of drives has stayed fairly consistent from year to year.

While it looks like the failure rates for the 3 TB WDC drives have also decreased, you’ll notice that we migrated out nearly 1,000 of these WDC drives in 2017. While the remaining 180 WDC 3 TB drives are performing very well, decreasing the data set that dramatically makes trend analysis suspect.

The Toshiba 5 TB model and the HGST 8 TB model had zero failures over the last year. That’s impressive, but with only 45 drives in use for each model, not statistically useful.

The HGST/Hitachi 4 TB models delivered sub 1.0% failure rates for each of the three years. Amazing.

A Few More Numbers

To save you countless hours of looking, we’ve culled through the data to uncover the following tidbits regarding our ever changing hard drive farm.

116,833 — The number of hard drives for which we have data from April 2013 through the end of December 2017. Currently there are 91,305 drives (data drives) in operation. This means 25,528 drives have either failed or been removed from service due for some other reason — typically migration.

29,844 — The number of hard drives that were installed in 2017. This includes new drives, migrations, and failure replacements.

81.76 — The number of hard drives that were installed each day in 2017. This includes new drives, migrations, and failure replacements.

95,638 — The number of drives installed since we started keeping records in April 2013 through the end of December 2017.

55.41 — The average number of hard drives installed per day from April 2013 to the end of December 2017. The installations can be new drives, migration replacements, or failure replacements.

1,508 — The number of hard drives that were replaced as failed in 2017.

4.13 — The average number of hard drives that have failed each day in 2017.

6,795 — The number of hard drives that have failed from April 2013 until the end of December 2017.

3.94 — The average number of hard drives that have failed each day from April 2013 until the end of December 2017.

Can’t Get Enough Hard Drive Stats?

We’ll be presenting the webinar “Backblaze Hard Drive Stats for 2017” on Thursday February 9, 2017 at 10:00 Pacific time. The webinar will dig deeper into the quarterly, yearly, and lifetime hard drive stats and include the annual and lifetime stats by drive size and manufacturer. You will need to subscribe to the Backblaze BrightTALK channel to view the webinar. Sign up today.

As a reminder, the complete data set used to create the information used in this review is available on our Hard Drive Test Data page. You can download and use this data for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone — it is free.

High-altitude ballooning

Some of you may be familiar with Raspberry Pi being used as the flight computer, or tracker, of high-altitude balloon (HAB) payloads. For those who aren’t, high-altitude ballooning is a relatively simple activity (at least in principle) where a tracker is attached to a large weather balloon which is then released into the atmosphere. While the HAB ascends, the tracker takes pictures and data readings the whole time. Eventually (around 30km up) the balloon bursts, leaving the payload free to descend and be recovered. For a better explanation, I’m handing over to the students of UTC Oxfordshire:

On Tuesday 2nd May, students launched a Raspberry Pi computer 35,000 metres into the stratosphere as part of an Employer-Led project at UTC Oxfordshire, set by the Raspberry Pi Foundation. The project involved engineering, scientific and communication/publicity skills being developed to create the payload and code to interpret experiments set by the science team.

Skycademy

Over the past few years, we’ve seen schools and their students explore the possibilities that high-altitude ballooning offers, and back in 2015 and 2016 we ran Skycademy. The programme was simple enough: get a bunch of educators together in the same space, show them how to launch a balloon flight, and then send them back to their students to try and repeat what they’ve learned. Since the first Skycademy event, a number of participants have carried out launches, and we are extremely proud of each and every one of them.

The case of the vanishing PACA HAB

Not every launch has been a 100% success though. There are many things that can and do go wrong during HAB flights, and watching each launch from the comfort of our office can be a nerve-wracking experience. We had such an experience back in July 2017, during the launch performed by Skycademy graduate and Raspberry Pi Certified Educator Dave Hartley and his students from Portslade Aldridge Community Academy (PACA).

Dave and his team had been working on their payload for some time, and were awaiting suitable weather conditions. Early one Wednesday in July, everything aligned: they had a narrow window of good weather and so set their launch plan in motion. Soon they had assembled the payload in the school grounds and all was ready for the launch.

Just before 11:00, they’d completed their final checks and released their payload into the atmosphere. Over the course of 64 minutes, the HAB steadily rose to an altitude of 25647m, where it captured some amazing pictures before the balloon burst and a rapid descent began.

Soon after the payload began to descend, the team noticed something worrying: their predicted descent path took the payload dangerously far south — it was threatening to land in the sea. As the payload continued to lose altitude, their calculated results kept shifting, alternately predicting a landing on the ground or out to sea. Eventually it became clear that the payload would narrowly overshoot the land, and it finally landed about 2 km out to sea.

The path of the balloon

It’s not uncommon for a HAB payload to get lost. There are many ways this can happen, particularly in a narrow country with a prevailing easterly wind like the UK. Payloads can get lost at sea, land somewhere inaccessible, or simply run out of power before they are located and retrieved. So normally, this would be the end of the story for the PACA students — even if the team had had a speedboat to hand, their payload was surely lost for good.

A message from Denmark

However, this is not the end of our story! A couple of months later, I arrived at work and saw this tweet from a colleague:

Anyone lost a Raspberry Pi HAB? Someone found this one on a beach in south western Denmark yesterday #UKHAS https://t.co/7lBzFiemgr

Good Samaritan Henning Hansen had found a Raspberry Pi washed up on a remote beach in Denmark! While walking a stretch of coast to collect plastic debris for an environmental monitoring project, he came across something unusual near the shore at 55°04’53.0″N and 8°38’46.9″E.

This of course piqued my interest, and we began to investigate the image he had shared on Facebook.

Inspecting the photo closely, we noticed a small asset label — the kind of label that, over a year earlier, we’d stuck to each and every bit of Skycademy field kit. We excitedly claimed the kit on behalf of Dave and his students, and contacted Henning to arrange the recovery of the payload. He told us it must have been carried ashore with the tide some time between 21 and 27 September, and probably on 21 September, since that day had the highest tide over the period. This meant the payload must have spent over two months at sea!

From the photo we could tell that the Raspberry Pi had suffered significant corrosion, having been exposed to salt water for so long, and so we felt pessimistic about the chances that there would be any recoverable data on it. However, Henning said that he’d been able to read some files from the FAT partition of the SD card, so all hope was not lost.

After a few weeks and a number of complications around dispatch and delivery (thank you, Henning, for your infinite patience!), Helen collected the HAB from a local Post Office.

SUCCESS!

We set about trying to read the data from the SD card, and eventually became disheartened: despite several attempts, we were unable to read its contents.

In a last-ditch effort, we gave the SD card to Jonathan, one of our engineers, who initially laughed at the prospect of recovering any data from it. But ten minutes later, he returned with news of success!

Since then, we’ve been able to reunite the payload with the PACA launch team, and the students sent us the perfect message to end this story:

An ETL (Extract, Transform, Load) process enables you to load data from source systems into your data warehouse. This is typically executed as a batch or near-real-time ingest process to keep the data warehouse current and provide up-to-date analytical data to end users.

Amazon Redshift is a fast, petabyte-scale data warehouse that enables you easily to make data-driven decisions. With Amazon Redshift, you can get insights into your big data in a cost-effective fashion using standard SQL. You can set up any type of data model, from star and snowflake schemas, to simple de-normalized tables for running any analytical queries.

To operate a robust ETL platform and deliver data to Amazon Redshift in a timely manner, design your ETL processes to take account of Amazon Redshift’s architecture. When migrating from a legacy data warehouse to Amazon Redshift, it is tempting to adopt a lift-and-shift approach, but this can result in performance and scale issues long term. This post guides you through the following best practices for ensuring optimal, consistent runtimes for your ETL processes:

COPY data from multiple, evenly sized files.

Use workload management to improve ETL runtimes.

Perform table maintenance regularly.

Perform multiple steps in a single transaction.

Loading data in bulk.

Use UNLOAD to extract large result sets.

Use Amazon Redshift Spectrum for ad hoc ETL processing.

Monitor daily ETL health using diagnostic queries.

1. COPY data from multiple, evenly sized files

Amazon Redshift is an MPP (massively parallel processing) database, where all the compute nodes divide and parallelize the work of ingesting data. Each node is further subdivided into slices, with each slice having one or more dedicated cores, equally dividing the processing capacity. The number of slices per node depends on the node type of the cluster. For example, each DS2.XLARGE compute node has two slices, whereas each DS2.8XLARGE compute node has 16 slices.

When you load data into Amazon Redshift, you should aim to have each slice do an equal amount of work. When you load the data from a single large file or from files split into uneven sizes, some slices do more work than others. As a result, the process runs only as fast as the slowest, or most heavily loaded, slice. In the example shown below, a single large file is loaded into a two-node cluster, resulting in only one of the nodes, “Compute-0”, performing all the data ingestion:

When splitting your data files, ensure that they are of approximately equal size – between 1 MB and 1 GB after compression. The number of files should be a multiple of the number of slices in your cluster. Also, I strongly recommend that you individually compress the load files using gzip, lzop, or bzip2 to efficiently load large datasets.

When loading multiple files into a single table, use a single COPY command for the table, rather than multiple COPY commands. Amazon Redshift automatically parallelizes the data ingestion. Using a single COPY command to bulk load data into a table ensures optimal use of cluster resources, and quickest possible throughput.

2. Use workload management to improve ETL runtimes

Use Amazon Redshift’s workload management (WLM) to define multiple queues dedicated to different workloads (for example, ETL versus reporting) and to manage the runtimes of queries. As you migrate more workloads into Amazon Redshift, your ETL runtimes can become inconsistent if WLM is not appropriately set up.

I recommend limiting the overall concurrency of WLM across all queues to around 15 or less. This WLM guide helps you organize and monitor the different queues for your Amazon Redshift cluster.

When managing different workloads on your Amazon Redshift cluster, consider the following for the queue setup:

Create a queue dedicated to your ETL processes. Configure this queue with a small number of slots (5 or fewer). Amazon Redshift is designed for analytics queries, rather than transaction processing. The cost of COMMIT is relatively high, and excessive use of COMMIT can result in queries waiting for access to the commit queue. Because ETL is a commit-intensive process, having a separate queue with a small number of slots helps mitigate this issue.

Claim extra memory available in a queue. When executing an ETL query, you can take advantage of the wlm_query_slot_count to claim the extra memory available in a particular queue. For example, a typical ETL process might involve COPYing raw data into a staging table so that downstream ETL jobs can run transformations that calculate daily, weekly, and monthly aggregates. To speed up the COPY process (so that the downstream tasks can start in parallel sooner), the wlm_query_slot_count can be increased for this step.

Create a separate queue for reporting queries. Configure query monitoring rules on this queue to further manage long-running and expensive queries.

Take advantage of the dynamic memory parameters. They swap the memory from your ETL to your reporting queue after the ETL job has completed.

3. Perform table maintenance regularly

Amazon Redshift is a columnar database, which enables fast transformations for aggregating data. Performing regular table maintenance ensures that transformation ETLs are predictable and performant. To get the best performance from your Amazon Redshift database, you must ensure that database tables regularly are VACUUMed and ANALYZEd. The Analyze & Vacuum schema utility helps you automate the table maintenance task and have VACUUM & ANALYZE executed in a regular fashion.

Use VACUUM to sort tables and remove deleted blocks

During a typical ETL refresh process, tables receive new incoming records using COPY, and unneeded data (cold data) is removed using DELETE. New rows are added to the unsorted region in a table. Deleted rows are simply marked for deletion.

DELETE does not automatically reclaim the space occupied by the deleted rows. Adding and removing large numbers of rows can therefore cause the unsorted region and the number of deleted blocks to grow. This can degrade the performance of queries executed against these tables.

After an ETL process completes, perform VACUUM to ensure that user queries execute in a consistent manner. The complete list of tables that need VACUUMing can be found using the Amazon Redshift Util’s table_info script.

Use the following approaches to ensure that VACCUM is completed in a timely manner:

Use wlm_query_slot_count to claim all the memory allocated in the ETL WLM queue during the VACUUM process.

If your table has a compound sort key with only one sort column, try to load your data in sort key order. This helps reduce or eliminate the need to VACUUM the table.

Consider using time series This helps reduce the amount of data you need to VACUUM.

Use ANALYZE to update database statistics

Amazon Redshift uses a cost-based query planner and optimizer using statistics about tables to make good decisions about the query plan for the SQL statements. Regular statistics collection after the ETL completion ensures that user queries run fast, and that daily ETL processes are performant. The Amazon Redshift utility table_info script provides insights into the freshness of the statistics. Keeping the statistics off (pct_stats_off) less than 20% ensures effective query plans for the SQL queries.

4. Perform multiple steps in a single transaction

ETL transformation logic often spans multiple steps. Because commits in Amazon Redshift are expensive, if each ETL step performs a commit, multiple concurrent ETL processes can take a long time to execute.

To minimize the number of commits in a process, the steps in an ETL script should be surrounded by a BEGIN…END statement so that a single commit is performed only after all the transformation logic has been executed. For example, here is an example multi-step ETL script that performs one commit at the end:

5. Loading data in bulk

Amazon Redshift is designed to store and query petabyte-scale datasets. Using Amazon S3 you can stage and accumulate data from multiple source systems before executing a bulk COPY operation. The following methods allow efficient and fast transfer of these bulk datasets into Amazon Redshift:

Use temporary staging tables to hold the data for transformation. These tables are automatically dropped after the ETL session is complete. Temporary tables can be created using the CREATE TEMPORARY TABLE syntax, or by issuing a SELECT … INTO #TEMP_TABLE query. Explicitly specifying the CREATE TEMPORARY TABLE statement allows you to control the DISTRIBUTION KEY, SORT KEY, and compression settings to further improve performance.

User ALTER table APPEND to swap data from the staging tables to the target table. Data in the source table is moved to matching columns in the target table. Column order doesn’t matter. After data is successfully appended to the target table, the source table is empty. ALTER TABLE APPEND is much faster than a similar CREATE TABLE AS or INSERT INTO operation because it doesn’t involve copying or moving data.

6. Use UNLOAD to extract large result sets

Fetching a large number of rows using SELECT is expensive and takes a long time. When a large amount of data is fetched from the Amazon Redshift cluster, the leader node has to hold the data temporarily until the fetches are complete. Further, data is streamed out sequentially, which results in longer elapsed time. As a result, the leader node can become hot, which not only affects the SELECT that is being executed, but also throttles resources for creating execution plans and managing the overall cluster resources. Here is an example of a large SELECT statement. Notice that the leader node is doing most of the work to stream out the rows:

Use UNLOAD to extract large results sets directly to S3. After it’s in S3, the data can be shared with multiple downstream systems. By default, UNLOAD writes data in parallel to multiple files according to the number of slices in the cluster. All the compute nodes participate to quickly offload the data into S3.

If you are extracting data for use with Amazon Redshift Spectrum, you should make use of the MAXFILESIZE parameter to and keep files are 150 MB. Similar to item 1 above, having many evenly sized files ensures that Redshift Spectrum can do the maximum amount of work in parallel.

7. Use Redshift Spectrum for ad hoc ETL processing

Events such as data backfill, promotional activity, and special calendar days can trigger additional data volumes that affect the data refresh times in your Amazon Redshift cluster. To help address these spikes in data volumes and throughput, I recommend staging data in S3. After data is organized in S3, Redshift Spectrum enables you to query it directly using standard SQL. In this way, you gain the benefits of additional capacity without having to resize your cluster.

8. Monitor daily ETL health using diagnostic queries

Monitoring the health of your ETL processes on a regular basis helps identify the early onset of performance issues before they have a significant impact on your cluster. The following monitoring scripts can be used to provide insights into the health of your ETL processes:

Analyze the top transformation SQL and use EXPLAIN to find opportunities for tuning the query plan.

There are several other useful scripts available in the amazon-redshift-utils repository. The AWS Lambda Utility Runner runs a subset of these scripts on a scheduled basis, allowing you to automate much of monitoring of your ETL processes.

Example ETL process

The following ETL process reinforces some of the best practices discussed in this post. Consider the following four-step daily ETL workflow where data from an RDBMS source system is staged in S3 and then loaded into Amazon Redshift. Amazon Redshift is used to calculate daily, weekly, and monthly aggregations, which are then unloaded to S3, where they can be further processed and made available for end-user reporting using a number of different tools, including Redshift Spectrum and Amazon Athena.

Step 1: Extract from the RDBMS source to a S3 bucket

In this ETL process, the data extract job fetches change data every 1 hour and it is staged into multiple hourly files. For example, the staged S3 folder looks like the following:

Organizing the data into multiple, evenly sized files enables the COPY command to ingest this data using all available resources in the Amazon Redshift cluster. Further, the files are compressed (gzipped) to further reduce COPY times.

Step 2: Stage data to the Amazon Redshift table for cleansing

Ingesting the data can be accomplished using a JSON-based manifest file. Using the manifest file ensures that S3 eventual consistency issues can be eliminated and also provides an opportunity to dedupe any files if needed. A sample manifest20170702.json file looks like the following:

SET wlm_query_slot_count TO <<max available concurrency in the ETL queue>>;
COPY stage_tbl FROM 's3:// <<S3 Bucket>>/batch/manifest20170702.json' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' manifest;

Because the downstream ETL processes depend on this COPY command to complete, the wlm_query_slot_count is used to claim all the memory available to the queue. This helps the COPY command complete as quickly as possible.

As shown above, multiple steps are combined into one transaction to perform a single commit, reducing contention on the commit queue.

Step 4: Unload the daily dataset to populate the S3 data lake bucket

The transformed results are now unloaded into another S3 bucket, where they can be further processed and made available for end-user reporting using a number of different tools, including Redshift Spectrum and Amazon Athena.

Summary

Amazon Redshift lets you easily operate petabyte-scale data warehouses on the cloud. This post summarized the best practices for operating scalable ETL natively within Amazon Redshift. I demonstrated efficient ways to ingest and transform data, along with close monitoring. I also demonstrated the best practices being used in a typical sample ETL workload to transform the data into Amazon Redshift.

If you have questions or suggestions, please comment below.

About the Author

Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

Welcome to TimeShift This is an event-heavy issue of TimeShift. We’ve been busy prepping for the Grafana v5 Beta release and finalizing details for our upcoming GrafanaCon. Below you’ll find a presenation on Prometheus monitoring, tracking a problematic ADSL connection and how to use Elasticsearch as a data source in Grafana. Enjoy!
Latest Stable Release Grafana 4.6.3 is available. Check out the full release notes too see the features and fixes.

Anti-piracy campaigns come in all shapes and sizes, from oppressive and scary to the optimistically educational. It is rare for any to be labeled ‘brilliant’ but a campaign just revealed in Belgium hits really close to the mark.

According to an announcement by the Belgian Entertainment Association (BEA), Belgian Federation of Cinemas, together with film producers and distributors, cinemas and directors, a brand new campaign has been targeting those who download content from illegal sources. It is particularly innovative and manages to hit pirates in a way they can’t easily avoid.

Working on the premise that many locals download English language movies and then augment them with local language subtitles, a fiendish plot was hatched. Instead of a generic preaching video on YouTube or elsewhere, the movie companies decided to ‘infect’ pirate subtitles with messages of their own.

“Suddenly the story gets a surprising turn. With a playful wink it suddenly seems as if Samuel L. Jackson in The Hitman’s Bodyguard directly appeals to the illegal viewer and says that you should not download,” the group explains.

Samuel is watching…..

>

“I do not need any research to see that these are bad subtitles,” Jackson informs the viewer.

In another scene with Ryan Reynolds, Jackson notes that illegal downloading can have a negative effect on a person.

Don’t download…..

Don’t download…..

“And you wanted to become a policeman, until you started downloading,” he says.

The movie groups say that they also planted edited subtitles in The Bridge, with police officers in the show noting they’re on the trail of illegal downloaders. The movies Logan Lucky and The Foreigner got similar treatment.

It’s not clear on which sites these modified subtitles were distributed but according to the companies involved, they’ve been downloaded 10,000 times already.

“The viewer not only feels caught but immediately realizes that you do not necessarily get a real quality product through illegal sources,” the companies say.

The campaign is the work of advertising agency TBWA, which appropriately bills itself as the Disruption Company.

“We are not a traditional ad agency network — we are a radically open creative collective. We look at what everyone else is doing and strive to do something completely new,” the company says.

Coincidentally, the company refers to its staff as pirates who rewrite rules and have ideas to take on “conventionally-steered ships.”

“As creative director of communication agency TBWA, protecting creative work is very important to us,” says TBWA Creative Director Gert Pauwels. “That is precisely why we came up with the subtle prank to work together with the sector to tackle illegal downloading.”

Although framed as a joke, one which may even raise a wry smile and a nod of respect from some pirates, there’s an underlying serious message from the companies involved.

“Maybe many think that everything is possible on the internet and that downloading will remain without consequences,” says Pieter Swaelens, Managing Director of BEA. “That is not the case. Here too, many jobs are being challenged in Belgium and we have to tackle this behavior.”

It’s also worth noting that while this campaign is both innovative and light-hearted, at least one of the companies involved is also a supporter of much tougher action.

Dutch Filmworks recently obtained permission from the Dutch Data Authority to begin monitoring pirates. Once it has their IP addresses it will attempt to make contact, offering a cash settlement agreement to make a potential lawsuit disappear.

“We are pleased with the extra attention to the problem of downloading from illegal sources,” says René van Turnhout, COO Dutch FilmWorks. “Too many jobs in our sector have been lost. Moreover, piracy endangers the creativity and quality of the legal offer.”

2017 was a big year for the Prometheus project, as it published
its 2.0 release in November. The new release ships numerous
bug fixes, new features, and, notably, a new storage engine that brings major
performance improvements. This comes at the cost of incompatible changes to
the storage and configuration-file formats. An overview of
Prometheus and its new release was presented to the Kubernetes community in a talk
held during KubeCon
+ CloudNativeCon. This article covers what changed in this new release
and what is brewing next in the Prometheus community; it is a companion tothis article, which provided a general
introduction to monitoring with Prometheus.

When managing your AWS resources, you often need to grant one AWS service access to another to accomplish tasks. For example, you could use an AWS Lambdafunction to resize, watermark, and postprocess images, for which you would need to store the associated metadata in Amazon DynamoDB. You also could use Lambda, Amazon S3, and Amazon CloudFront to build a serverless website that uses a DynamoDB table as a session store, with Lambda updating the information in the table. In both these examples, you need to grant Lambda functions permissions to write to DynamoDB.

In this post, I demonstrate how to create an AWS Identity and Access Management (IAM) policy that will be attached to an IAM role. The role is then used to grant a Lambda function access to a DynamoDB table. By using an IAM policy and role to control access, I don’t need to embed credentials in code and can tightly control which services the Lambda function can access. The policy also includes permissions to allow the Lambda function to write log files to Amazon CloudWatch Logs. This allows me to view utilization statistics for your Lambda functions and to have access to additional logging for troubleshooting issues.

Solution overview

The following architecture diagram presents an overview of the solution in this post.

The architecture of this post’s solution uses a Lambda function (1 in the preceding diagram) to make read API calls such as GET or SCAN and write API calls such as PUT or UPDATE to a DynamoDB table (2). The Lambda function also writes log files to CloudWatch Logs (3). The Lambda function uses an IAM role (4) that has an IAM policy attached (5) that grants access to DynamoDB and CloudWatch.

Overview of the AWS services used in this post

I use the following AWS services in this post’s solution:

IAM – For securely controlling access to AWS services. With IAM, you can centrally manage users, security credentials such as access keys, and permissions that control which AWS resources users and applications can access.

DynamoDB – A fast and flexible NoSQL database service for all applications that need consistent, single-digit-millisecond latency at any scale.

Lambda – Run code without provisioning or managing servers. You pay only for the compute time you consume—there is no charge when your code is not running.

IAM access policies

I have authored an IAM access policy with JSON to grant the required permissions to the DynamoDB table and CloudWatch Logs. I will attach this policy to a role, and this role will then be attached to a Lambda function, which will assume the required access to DynamoDB and CloudWatch Logs

I will walk through this policy, and explain its elements and how to create the policy in the IAM console.

The following policy grants a Lambda function read and write access to a DynamoDB table and writes log files to CloudWatch Logs. This policy is called MyLambdaPolicy. The following is the full JSON document of this policy (the AWS account ID is a placeholder value that you would replace with your own account ID).

The first element in this policy is the Version, which defines the JSON version. At the time of this post’s publication, the most recent version of JSON is 2012-10-17.

The next element in this first policy is a Statement. This is the main section of the policy and includes multiple elements. This first statement is to Allow access to DynamoDB, and in this example, the elements I use are:

An Effect element – Specifies whether the statement results in an Allow or an explicit Deny. By default, access to resources is implicitly denied. In this example, I have used Allow because I want to allow the actions.

An Action element – Describes the specific actions for this statement. Each AWS service has its own set of actions that describe tasks that you can perform with that service. I have used the DynamoDB actions that I want to allow. For the definitions of all available actions for DynamoDB, see the DynamoDB API Reference.

A Resource element – Specifies the object or objects for this statement using Amazon Resource Names (ARNs). You use an ARN to uniquely identify an AWS resource. All Resource elements start with arn:aws and then define the object or objects for the statement. I use this to specify the DynamoDB table to which I want to allow access. To build the Resource element for DynamoDB, I have to specify:

The AWS service (dynamodb)

The AWS Region (eu-west-1)

The AWS account ID (123456789012)

The table (table/SampleTable)

The complete Resource element of the first statement is: arn:aws:dynamodb:eu-west-1:123456789012:table/SampleTable

In this policy, I created a second statement to allow access to CloudWatch Logs so that the Lambda function can write log files for troubleshooting and analysis. I have used the same elements as for the DynamoDB statement, but have changed the following values:

For the Action element, I used the CloudWatch actions that I want to allow. Definitions of all the available actions for CloudWatch are provided in the CloudWatch API Reference.

For the Resource element, I specified the AWS account to which I want to allow my Lambda function to write its log files. As in the preceding example for DynamoDB, you have to use the ARN for CloudWatch Logs to specify where access should be granted. To build the Resource element for CloudWatch Logs, I have to specify:

The AWS service (logs)

The AWS region (eu-west-1)

The AWS account ID (123456789012)

All log groups in this account (*)

The complete Resource element of the second statement is: arn:aws:logs:eu-west-1:123456789012:*

Create the IAM policy in your account

Before you can apply MyLambdaPolicy to a Lambda function, you have to create the policy in your own account and then apply it to an IAM role.

To create an IAM policy:

Navigate to the IAM console and choose Policies in the navigation pane. Choose Create policy.

Because I have already written the policy in JSON, you don’t need to use the Visual Editor, so you can choose the JSON tab and paste the content of the JSON policy document shown earlier in this post (remember to replace the placeholder account IDswith your own account ID). Choose Review policy.

Name the policy MyLambdaPolicy and give it a description that will help you remember the policy’s purpose. You also can view a summary of the policy’s permissions. Choose Create policy.

You have created the IAM policy that you will apply to the Lambda function.

Attach the IAM policy to an IAM role

To apply MyLambdaPolicy to a Lambda function, you first have to attach the policy to an IAM role.

To create a new role and attach MyLambdaPolicy to the role:

Navigate to the IAM console and choose Roles in the navigation pane. Choose Create role.

Choose AWS service and then choose Lambda. Choose Next: Permissions.

On the Attach permissions policies page, type MyLambdaPolicy in the Search box. Choose MyLambdaPolicy from the list of returned search results, and then choose Next: Review.

On the Review page, type MyLambdaRole in the Role name box and an appropriate description, and then choose Create role.

You have attached the policy you created earlier to a new IAM role, which in turn can be used by a Lambda function.

Apply the IAM role to a Lambda function

You have created an IAM role that has an attached IAM policy that grants both read and write access to DynamoDB and write access to CloudWatch Logs. The next step is to apply the IAM role to a Lambda function.

On the Create function page under Author from scratch, name the function MyLambdaFunction, and choose the runtime you want to use based on your application requirements. Lambda currently supports Node.js, Java, Python, Go, and .NET Core. From the Role dropdown, choose Choose an existing role, and from the Existing role dropdown, choose MyLambdaRole. Then choose Create function.

MyLambdaFunction now has access to CloudWatch Logs and DynamoDB. You can choose either of these services to see the details of which permissions the function has.

If you have any comments about this blog post, submit them in the “Comments” section below. If you have any questions about the services used, start a new thread in the applicable AWS forum: IAM, Lambda, DynamoDB, or CloudWatch.

It’s fair to say that of all video games anti-piracy technologies, Denuvo is perhaps the most hated of recent times. That hatred unsurprisingly stems from both its success and complexity.

Those with knowledge of the system say it’s fiendishly difficult to defeat but in recent times, cracks have been showing. In 2017, various iterations of the anti-tamper system were defeated by several cracking groups, much to the delight of the pirate masses.

Now, however, a new development has the potential to herald a new lease of life for the Austria-based anti-piracy company. A few moments ago it was revealed that the company has been bought by Irdeto, a global anti-piracy company with considerable heritage and resources.

“Irdeto has acquired Denuvo, the world leader in gaming security, to provide anti-piracy and anti-cheat solutions for games on desktop, mobile, console and VR devices,” Irdeto said in a statement.

“Denuvo provides technology and services for game publishers and platforms, independent software vendors, e-publishers and video publishers across the globe. Current Denuvo customers include Electronic Arts, UbiSoft, Warner Bros and Lionsgate Entertainment, with protection provided for games such as Star Wars Battlefront II, Football Manager, Injustice 2 and others.”

Irdeto says that Denuvo will “continue to operate as usual” with all of its staff retained – a total of 45 across Austria, Poland, the Czech Republic, and the US. Denuvo headquarters in Salzburg, Austria, will also remain intact along with its sales operations.

“The success of any game title is dependent upon the ability of the title to operate as the publisher intended,” says Irdeto CEO Doug Lowther.

“As a result, protection of both the game itself and the gaming experience for end users is critical. Our partnership brings together decades of security expertise under one roof to better address new and evolving security threats. We are looking forward to collaborating as a team on a number of initiatives to improve our core technology and services to better serve our customers.”

Denuvo was founded relatively recently in 2013 and employs less than 50 people. In contrast, Irdeto’s roots go all the way back to 1969 and currently has almost 1,000 staff. It’s a subsidiary of South Africa-based Internet and media group Naspers, a corporate giant with dozens of notable companies under its control.

While Denuvo is perhaps best known for its anti-piracy technology, Irdeto is also placing emphasis on the company’s ability to hinder cheating in online multi-player gaming environments. This has become a hot topic recently, with several lawsuits filed in the US by companies including Blizzard and Epic.

Denuvo CEO Reinhard Blaukovitsch

“Hackers and cybercriminals in the gaming space are savvy, and always have been. It is critical to implement robust security strategies to combat the latest gaming threats and protect the investment in games. Much like the movie industry, it’s the only way to ensure that great games continue to get made,” says Denuvo CEO Reinhard Blaukovitsch.

“In joining with Irdeto, we are bringing together a unique combination of security expertise, technology and enhanced piracy services to aggressively address security challenges that customers and gamers face from hackers.”

While it seems likely that the companies have been in negotiations for some, the timing of this announcement also coincides with negative news for Denuvo.

Yesterday it was revealed that the latest variant of its anti-tamper technology – Denuvo v4.8 – had been defeated by online cracking group CPY (Conspiracy). Version 4.8 had been protecting Sonic Forces since its release early November 2017 but the game was leaked out onto the Internet late Sunday with all protection neutralized.

It’s also noteworthy that Irdeto supplied behind-the-scenes support in two of the largest IPTV provider raids of recent times, one focused on Spain in 2017 and more recently in Cyprus, Bulgaria, Greece and the Netherlands (1,2,3).

GrafanaCon EU Tickets are Going Fast!

We’re six weeks from kicking off GrafanaCon EU! Join us for talks from Google, Bloomberg, Tinder, eBay and more! You won’t want to miss two great days of open source monitoring talks and fun in Amsterdam. Get your tickets before they sell out!

Grafana Plugins

We have a couple of plugin updates to share this week that add some new features and improvements. Updating your plugins is easy. For on-prem Grafana, use the Grafana-cli tool, or update with 1 click on your Hosted Grafana.

UPDATED PLUGIN

Druid Data Source – This new update is packed with new features. Notable enhancement include:

Breadcrumb Panel – The Breadcrumb Panel is a small panel you can include in your dashboard that tracks other dashboards you have visited – making it easy to navigate back to a previously visited dashboard. The latest release adds support for dashboards loaded from a file.

Upcoming Events

In between code pushes we like to speak at, sponsor and attend all kinds of conferences and meetups. We also like to make sure we mention other Grafana-related events happening all over the world. If you’re putting on just such an event, let us know and we’ll list it here.

FOSDEM | Brussels, Belgium – Feb 3-4, 2018: FOSDEM is a free developer conference where thousands of developers of free and open source software gather to share ideas and technology. There is no need to register; all are welcome.

Jfokus | Stockholm, Sweden – Feb 5-7, 2018:Carl Bergquist – Quickie: Monitoring? Not OPS ProblemWhy should we monitor our system? Why can’t we just rely on the operations team anymore? They use to be able to do that. What’s currently changing? Presentation content: – Why do we monitor our system – How did it use to work? – Whats changing – Why do we need to shift focus – Everyone should be on call. – Resilience is the goal (Best way of having someone care about quality is to make them responsible).

Jfokus | Stockholm, Sweden – Feb 5-7, 2018:Leonard Gram – Presentation: DevOps DeconstructedWhat’s a Site Reliability Engineer and how’s that role different from the DevOps engineer my boss wants to hire? I really don’t want to be on call, should I? Is Docker the right place for my code or am I better of just going straight to Serverless? And why should I care about any of it? I’ll try to answer some of these questions while looking at what DevOps really is about and how commodisation of servers through “the cloud” ties into it all. This session will be an opinionated piece from a developer who’s been on-call for the past 6 years and would like to convince you to do the same, at least once.

Stockholm Metrics and Monitoring | Stockholm, Sweden – Feb 7, 2018:Observability 3 ways – Logging, Metrics and Distributed TracingLet’s talk about often confused telemetry tools: Logging, Metrics and Distributed Tracing. We’ll show how you capture latency using each of the tools and how they work differently. Through examples and discussion, we’ll note edge cases where certain tools have advantages over others. By the end of this talk, we’ll better understand how each of Logging, Metrics and Distributed Tracing aids us in different ways to understand our applications.

OpenNMS – Introduction to “Grafana” | Webinar – Feb 21, 2018:IT monitoring helps detect emerging hardware damage and performance bottlenecks in the enterprise network before any consequential damage or disruption to business processes occurs. The powerful open-source OpenNMS software monitors a network, including all connected devices, and provides logging of a variety of data that can be used for analysis and planning purposes. In our next OpenNMS webinar on February 21, 2018, we introduce “Grafana” – a web-based tool for creating and displaying dashboards from various data sources, which can be perfectly combined with OpenNMS.

Grafana Labs is Hiring!

We are passionate about open source software and thrive on tackling complex challenges to build the future. We ship code from every corner of the globe and love working with the community. If this sounds exciting, you’re in luck – WE’RE HIRING!

Tags

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.