Use the Join button above to receive notification of new posts in this series.

In 2009, Google disclosed that they had 400 recruiters on staff working to hire nearly 10,000 people. Someday, that might be your challenge, but most companies in their early days are looking to hire a handful of people — the right people — each year. Assuming you are closer to startup stage than Google stage, let’s look at who you need to hire, when to hire them, where to find them (and how to help them find you), and how to get them to join your company.

Who Should Be Your First Hires

In later stage companies, the roles in the company have been well fleshed out, don’t change often, and each role can be segmented to focus on a specific area. A large company may have an entire department focused on just cubicle layout; at a smaller company you may not have a single person whose actual job encompasses all of facilities. At Backblaze, our CTO has a passion and knack for facilities and mostly led that charge. Also, the needs of a smaller company are quick to change. One of our first hires was a QA person, Sean, who ended up being 100% focused on data center infrastructure. In the early stage, things can shift quite a bit and you need people that are broadly capable, flexible, and most of all willing to pitch in where needed.

That said, there are times you may need an expert. At a previous company we hired Jon, a PhD in Bayesian statistics, because we needed algorithmic analysis for spam fighting. However, even that person was not only able and willing to do the math, but also code, and to not only focus on Bayesian statistics but explore a plethora of spam fighting options.

When To Hire

If you’ve raised a lot of cash and are willing to burn it with mistakes, you can guess at all the roles you might need and start hiring for them. No judgement: that’s a reasonable strategy if you’re cash-rich and time-poor.

If your cash is limited, try to see what you and your team are already doing and then hire people to take those jobs. It may sound counterintuitive, but if you’re already doing it presumably it needs to be done, you have a good sense of the type of skills required to do it, and you can bring someone on-board and get them up to speed quickly. That then frees you up to focus on tasks that can’t be done by someone else. At Backblaze, I ran marketing internally for years before hiring a VP of Marketing, making it easier for me to know what we needed. Once I was hiring, my primary goal was to find someone I could trust to take that role completely off of me so I could focus solely on my CEO duties

Where To Find the Right People

Finding great people is always difficult, particularly when the skillsets you’re looking for are highly in-demand by larger companies with lots of cash and cachet. You, however, have one massive advantage: you need to hire 5 people, not 5,000.

People You Worked With

The absolutely best people to hire are ones you’ve worked with before that you already know are good in a work situation. Consider your last job, the one before, and the one before that. A significant number of the people we recruited at Backblaze came from our previous startup MailFrontier. We knew what they could do and how they would fit into the culture, and they knew us and thus could quickly meld into the environment. If you didn’t have a previous job, consider people you went to school with or perhaps individuals with whom you’ve done projects previously.

People You Know

Hiring friends, family, and others can be risky, but should be considered. Sometimes a friend can be a “great buddy,” but is not able to do the job or isn’t a good fit for the organization. Having to let go of someone who is a friend or family member can be rough. Have the conversation up front with them about that possibility, so you have the ability to stay friends if the position doesn’t work out. Having said that, if you get along with someone as a friend, that’s one critical component of succeeding together at work. At Backblaze we’ve hired a number of people successfully that were friends of someone in the organization.

Friends Of People You Know

Your network is likely larger than you imagine. Your employees, investors, advisors, spouses, friends, and other folks all know people who might be a great fit for you. Make sure they know the roles you’re hiring for and ask them if they know anyone that would fit. Search LinkedIn for the titles you’re looking for and see who comes up; if they’re a 2nd degree connection, ask your connection for an introduction.

People You Know About

Sometimes the person you want isn’t someone anyone knows, but you may have read something they wrote, used a product they’ve built, or seen a video of a presentation they gave. Reach out. You may get a great hire: worst case, you’ll let them know they were appreciated, and make them aware of your organization.

Other Places to Find People

There are a million other places to find people, including job sites, community groups, Facebook/Twitter, GitHub, and more. Consider where the people you’re looking for are likely to congregate online and in person.

A Comment on Diversity

Hiring “People You Know” can often result in “Hiring People Like You” with the same workplace experiences, culture, background, and perceptions. Some studies have shown [1, 2, 3, 4] that homogeneous groups deliver faster, while heterogeneous groups are more creative. Also, “Hiring People Like You” often propagates the lack of women and minorities in tech and leadership positions in general. When looking for people you know, keep an eye to not discount people you know who don’t have the same cultural background as you.

Helping People To Find You

Reaching out proactively to people is the most direct way to find someone, but you want potential hires coming to you as well. To do this, they have to a) be aware of you, b) know you have a role they’re interested in, and c) think they would want to work there. Let’s tackle a) and b) first below.

Your Blog

I started writing our blog before we launched the product and talked about anything I found interesting related to our space. For several years now our team has owned the content on the blog and in 2017 over 1.5 million people read it. Each time we have a position open it’s published to the blog. If someone finds reading about backup and storage interesting, perhaps they’d want to dig in deeper from the inside. Many of the people we’ve recruited have mentioned reading the blog as either how they found us or as a factor in why they wanted to work here.[BTW, this is Gleb’s 200th post on Backblaze’s blog. The first was in 2008. — Editor]

Your Email List

In addition to the emails our blog subscribers receive, we send regular emails to our customers, partners, and prospects. These are largely focused on content we think is directly useful or interesting for them. However, once every few months we include a small mention that we’re hiring, and the positions we’re looking for. Often a small blurb is all you need to capture people’s imaginations whether they might find the jobs interesting or can think of someone that might fit the bill.

Your Social Involvement

Whether it’s Twitter or Facebook, Hacker News or Slashdot, your potential hires are engaging in various communities. Being socially involved helps make people aware of you, reminds them of you when they’re considering a job, and paints a picture of what working with you and your company would be like. Adam was in a Reddit thread where we were discussing our Storage Pods, and that interaction was ultimately part of the reason he left Apple to come to Backblaze.

Convincing People To Join

Once you’ve found someone or they’ve found you, how do you convince them to join? They may be currently employed, have other offers, or have to relocate. Again, while the biggest companies have a number of advantages, you might have more unique advantages than you realize.

Why Should They Join You

Here are a set of items that you may be able to offer which larger organizations might not:

Role: Consider the strengths of the role. Perhaps it will have broader scope? More visibility at the executive level? No micromanagement? Ability to take risks? Option to create their own role?

Compensation: In addition to salary, will their options potentially be worth more since they’re getting in early? Can they trade-off salary for more options? Do they get option refreshes?

Benefits: In addition to healthcare, food, and 401(k) plans, are there unique benefits of your company? One company I knew took the entire team for a one-month working retreat abroad each year.

Location: Most people prefer to work close to home. If you’re located outside of the San Francisco Bay Area, you might be at a disadvantage for not being in the heart of tech. But if you find employees close to you you’ve got a huge advantage. Sometimes it’s micro; even in the Bay Area the difference of 5 miles can save 20 minutes each way every day. We located the Backblaze headquarters in San Mateo, a middle-ground that made it accessible to those coming from San Jose and San Francisco. We also chose a downtown location near a train, restaurants, and cafes: all to make it easier and more pleasant. Also, are you flexible in letting your employees work remotely? Our systems administrator Elliott is about to embark on a long-term cross-country journey working from an RV.

Environment: Open office, cubicle, cafe, work-from-home? Loud/quiet? Social or focused? 24×7 or work-life balance? Different environments appeal to different people.

Team: Who will they be working with? A company with 100,000 people might have 100 brilliant ones you’d want to work with, but ultimately we work with our core team. Who will your prospective hires be working with?

Market: Some people are passionate about gaming, others biotech, still others food. The market you’re targeting will get different people excited.

Product: Have an amazing product people love? Highlight that. If you’re lucky, your potential hire is already a fan.

Mission: Curing cancer, making people happy, and other company missions inspire people to strive to be part of the journey. Our mission is to make storing data astonishingly easy and low-cost. If you care about data, information, knowledge, and progress, our mission helps drive all of them.

Culture: I left this for last, but believe it’s the most important. What is the culture of your company? Finding people who want to work in the culture of your organization is critical. If they like the culture, they’ll fit and continue it. We’ve worked hard to build a culture that’s collaborative, friendly, supportive, and open; one in which people like coming to work. For example, the five founders started with (and still have) the same compensation and equity. That started a culture of “we’re all in this together.” Build a culture that will attract the people you want, and convey what the culture is.

Writing The Job Description

Most job descriptions focus on the all the requirements the candidate must meet. While important to communicate, the job description should first sell the job. Why would the appropriate candidate want the job? Then share some of the requirements you think are critical. Remember that people read not just what you say but how you say it. Try to write in a way that conveys what it is like to actually be at the company. Ahin, our VP of Marketing, said the job description itself was one of the things that attracted him to the company.

Orchestrating Interviews

Much can be said about interviewing well. I’m just going to say this: make sure that everyone who is interviewing knows that their job is not only to evaluate the candidate, but give them a sense of the culture, and sell them on the company. At Backblaze, we often have one person interview core prospects solely for company/culture fit.

Onboarding

Hiring success shouldn’t be defined by finding and hiring the right person, but instead by the right person being successful and happy within the organization. Ensure someone (usually their manager) provides them guidance on what they should be concentrating on doing during their first day, first week, and thereafter. Giving new employees opportunities and guidance so that they can achieve early wins and feel socially integrated into the company does wonders for bringing people on board smoothly

In Closing

Our Director of Production Systems, Chris, said to me the other day that he looks for companies where he can work on “interesting problems with nice people.” I’m hoping you’ll find your own version of that and find this post useful in looking for your early and critical hires.

Of course, I’d be remiss if I didn’t say, if you know of anyone looking for a place with “interesting problems with nice people,” Backblaze is hiring. 😉

Microservice-application requirements have changed dramatically in recent years. These days, applications operate with petabytes of data, need almost 100% uptime, and end users expect sub-second response times. Typical N-tier applications can’t deliver on these requirements.

Reactive Manifesto, published in 2014, describes the essential characteristics of reactive systems including: responsiveness, resiliency, elasticity, and being message driven.

Being message driven is perhaps the most important characteristic of reactive systems. Asynchronous messaging helps in the design of loosely coupled systems, which is a key factor for scalability. In order to build a highly decoupled system, it is important to isolate services from each other. As already described, isolation is an important aspect of the microservices pattern. Indeed, reactive systems and microservices are a natural fit.

Many ad-tracking companies collect massive amounts of data in near-real-time. In many cases, these workloads are very spiky and heavily depend on the success of the ad-tech companies’ customers. Typically, an ad-tracking-data use case can be separated into a real-time part and a non-real-time part. In the real-time part, it is important to collect data as fast as possible and ask several questions including:, “Is this a valid combination of parameters?,””Does this program exist?,” “Is this program still valid?”

Because response time has a huge impact on conversion rate in advertising, it is important for advertisers to respond as fast as possible. This information should be kept in memory to reduce communication overhead with the caching infrastructure. The tracking application itself should be as lightweight and scalable as possible. For example, the application shouldn’t have any shared mutable state and it should use reactive paradigms. In our implementation, one main application is responsible for this real-time part. It collects and validates data, responds to the client as fast as possible, and asynchronously sends events to backend systems.

The non-real-time part of the application consumes the generated events and persists them in a NoSQL database. In a typical tracking implementation, clicks, cookie information, and transactions are matched asynchronously and persisted in a data store. The matching part is not implemented in this reference architecture. Many ad-tech architectures use frameworks like Hadoop for the matching implementation.

The system can be logically divided into the data collection partand the core data updatepart. The data collection part is responsible for collecting, validating, and persisting the data. In the core data update part, the data that is used for validation gets updated and all subscribers are notified of new data.

Components and Services

Main Application The main application is implemented using Java 8 and uses Vert.x as the main framework. Vert.x is an event-driven, reactive, non-blocking, polyglot framework to implement microservices. It runs on the Java virtual machine (JVM) by using the low-level IO library Netty. You can write applications in Java, JavaScript, Groovy, Ruby, Kotlin, Scala, and Ceylon. The framework offers a simple and scalable actor-like concurrency model. Vert.x calls handlers by using a thread known as an event loop. To use this model, you have to write code known as “verticles.” Verticles share certain similarities with actors in the actor model. To use them, you have to implement the verticle interface. Verticles communicate with each other by generating messages in a single event bus. Those messages are sent on the event bus to a specific address, and verticles can register to this address by using handlers.

With only a few exceptions, none of the APIs in Vert.x block the calling thread. Similar to Node.js, Vert.x uses the reactor pattern. However, in contrast to Node.js, Vert.x uses several event loops. Unfortunately, not all APIs in the Java ecosystem are written asynchronously, for example, the JDBC API. Vert.x offers a possibility to run this, blocking APIs without blocking the event loop. These special verticles are called worker verticles. You don’t execute worker verticles by using the standard Vert.x event loops, but by using a dedicated thread from a worker pool. This way, the worker verticles don’t block the event loop.

Our application consists of five different verticles covering different aspects of the business logic. The main entry point for our application is the HttpVerticle, which exposes an HTTP-endpoint to consume HTTP-requests and for proper health checking. Data from HTTP requests such as parameters and user-agent information are collected and transformed into a JSON message. In order to validate the input data (to ensure that the program exists and is still valid), the message is sent to the CacheVerticle.

This verticle implements an LRU-cache with a TTL of 10 minutes and a capacity of 100,000 entries. Instead of adding additional functionality to a standard JDK map implementation, we use Google Guava, which has all the features we need. If the data is not in the L1 cache, the message is sent to the RedisVerticle. This verticle is responsible for data residing in Amazon ElastiCache and uses the Vert.x-redis-client to read data from Redis. In our example, Redis is the central data store. However, in a typical production implementation, Redis would just be the L2 cache with a central data store like Amazon DynamoDB. One of the most important paradigms of a reactive system is to switch from a pull- to a push-based model. To achieve this and reduce network overhead, we’ll use Redis pub/sub to push core data changes to our main application.

The verticle subscribes to the appropriate Redis pub/sub-channel. If a message is sent over this channel, the payload is extracted and forwarded to the cache-verticle that stores the data in the L1-cache. After storing and enriching data, a response is sent back to the HttpVerticle, which responds to the HTTP request that initially hit this verticle. In addition, the message is converted to ByteBuffer, wrapped in protocol buffers, and send to an Amazon Kinesis Data Stream.

The following example shows a stripped-down version of the KinesisVerticle:

Kinesis Consumer This AWS Lambda function consumes data from an Amazon Kinesis Data Stream and persists the data in an Amazon DynamoDB table. In order to improve testability, the invocation code is separated from the business logic. The invocation code is implemented in the class KinesisConsumerHandler and iterates over the Kinesis events pulled from the Kinesis stream by AWS Lambda. Each Kinesis event is unwrapped and transformed from ByteBuffer to protocol buffers and converted into a Java object. Those Java objects are passed to the business logic, which persists the data in a DynamoDB table. In order to improve duration of successive Lambda calls, the DynamoDB-client is instantiated lazily and reused if possible.

Redis Updater From time to time, it is necessary to update core data in Redis. A very efficient implementation for this requirement is using AWS Lambda and Amazon Kinesis. New core data is sent over the AWS Kinesis stream using JSON as data format and consumed by a Lambda function. This function iterates over the Kinesis events pulled from the Kinesis stream by AWS Lambda. Each Kinesis event is unwrapped and transformed from ByteBuffer to String and converted into a Java object. The Java object is passed to the business logic and stored in Redis. In addition, the new core data is also sent to the main application using Redis pub/sub in order to reduce network overhead and converting from a pull- to a push-based model.

The following example shows the source code to store data in Redis and notify all subscribers:

Infrastructure as Code As already outlined, latency and response time are a very critical part of any ad-tracking solution because response time has a huge impact on conversion rate. In order to reduce latency for customers world-wide, it is common practice to roll out the infrastructure in different AWS Regions in the world to be as close to the end customer as possible. AWS CloudFormation can help you model and set up your AWS resources so that you can spend less time managing those resources and more time focusing on your applications that run in AWS.

You create a template that describes all the AWS resources that you want (for example, Amazon EC2 instances or Amazon RDS DB instances), and AWS CloudFormation takes care of provisioning and configuring those resources for you. Our reference architecture can be rolled out in different Regions using an AWS CloudFormation template, which sets up the complete infrastructure (for example, Amazon Virtual Private Cloud (Amazon VPC), Amazon Elastic Container Service (Amazon ECS) cluster, Lambda functions, DynamoDB table, Amazon ElastiCache cluster, etc.).

Conclusion In this blog post we described reactive principles and an example architecture with a common use case. We leveraged the capabilities of different frameworks in combination with several AWS services in order to implement reactive principles—not only at the application-level but also at the system-level. I hope I’ve given you ideas for creating your own reactive applications and systems on AWS.

About the Author

Sascha Moellering is a Senior Solution Architect. Sascha is primarily interested in automation, infrastructure as code, distributed computing, containers and JVM. He can be reached at [email protected]

Back in January 2017, China’s Ministry of Industry and Information Technology announced a 14-month campaign to crack down on ‘unauthorized’ Internet platforms.

China said that Internet technologies and services had been expanding in a “disorderly” fashion, so regulation was required. No surprise then that the campaign targeted censorship-busting VPN services, which are used by citizens and corporations to traverse the country’s Great Firewall.

Heralding a “nationwide Internet network access services clean-up”, China warned that anyone operating such a service would require a government telecommunications business license. It’s now been more than a year since that announcement and much has happened in the interim.

In July 2017, Apple removed 674 VPN apps from its App Store and in September, a local man was jailed for nine months for selling VPN software. In December, another man was jailed for five-and-a-half years for selling a VPN service without an appropriate license from the government.

This week the government provided an update on the crackdown, telling the media that it will begin forcing local and foreign companies and individuals to use only government-approved systems to access the wider Internet.

Ministry of Industry and Information Technology (MIIT) chief engineer Zhang Feng reiterated earlier comments that VPN operators must be properly licensed by the government, adding that unlicensed VPNs will be subjected to new rules which come into force on March 31. The government plans to block unauthorized VPN providers, official media reported.

“Any foreign companies that want to set up a cross-border operation for private use will need to set up a dedicated line for that purpose,” he said.

“They will be able to lease such a line or network legally from the telecommunications import and export bureau. This shouldn’t affect their normal operations much at all.”

Radio Free Asia reports that state-run telecoms companies including China Mobile, China Unicom, and China Telecom, which are approved providers, have all been ordered to prevent their 1.3 billion subscribers from accessing blocked content with VPNs.

So, it appears that VPN providers are still allowed in China, so long as they’re officially licensed and approved by the government. However, in order to get that licensing they need to comply with government regulations, which means that people cannot use them to access content restricted by the Great Firewall.

All that being said, Zhang is reported as saying that people shouldn’t be concerned that their data is insecure as a result – neither providers nor the government are able to access content sent over a state-approved VPN service, he claimed.

“The rights for using normal intentional telecommunications services is strictly protected,” said Zhang, adding that regulation means that communications are “secure”.

My career is a different story. Over the past two decades and a change, I went from writing CGI scripts and setting up WAN routers for a chain of shopping malls, to doing pentests for institutional customers, to designing a series of network monitoring platforms and handling incident response for a big telco, to building and running the product security org for one of the largest companies in the world. It’s been an interesting ride – and now that I’m on the hook for the well-being of about 100 folks across more than a dozen subteams around the world, I’ve been thinking a bit about the lessons learned along the way.

Of course, I’m a bit hesitant to write such a post: sometimes, your efforts pan out not because of your approach, but despite it – and it’s possible to draw precisely the wrong conclusions from such anecdotes. Still, I’m very proud of the culture we’ve created and the caliber of folks working on our team. It happened through the work of quite a few talented tech leads and managers even before my time, but it did not happen by accident – so I figured that my observations may be useful for some, as long as they are taken with a grain of salt.

But first, let me start on a somewhat somber note: what nobody tells you is that one’s level on the leadership ladder tends to be inversely correlated with several measures of happiness. The reason is fairly simple: as you get more senior, a growing number of people will come to you expecting you to solve increasingly fuzzy and challenging problems – and you will no longer be patted on the back for doing so. This should not scare you away from such opportunities, but it definitely calls for a particular mindset: your motivation must come from within. Look beyond the fight-of-the-day; find satisfaction in seeing how far your teams have come over the years.

With that out of the way, here’s a collection of notes, loosely organized into three major themes.

The curse of a techie leader

Perhaps the most interesting observation I have is that for a person coming from a technical background, building a healthy team is first and foremost about the subtle art of letting go.

There is a natural urge to stay involved in any project you’ve started or helped improve; after all, it’s your baby: you’re familiar with all the nuts and bolts, and nobody else can do this job as well as you. But as your sphere of influence grows, this becomes a choke point: there are only so many things you could be doing at once. Just as importantly, the project-hoarding behavior robs more junior folks of the ability to take on new responsibilities and bring their own ideas to life. In other words, when done properly, delegation is not just about freeing up your plate; it’s also about empowerment and about signalling trust.

Of course, when you hand your project over to somebody else, the new owner will initially be slower and more clumsy than you; but if you pick the new leads wisely, give them the right tools and the right incentives, and don’t make them deathly afraid of messing up, they will soon excel at their new jobs – and be grateful for the opportunity.

A related affliction of many accomplished techies is the conviction that they know the answers to every question even tangentially related to their domain of expertise; that belief is coupled with a burning desire to have the last word in every debate. When practiced in moderation, this behavior is fine among peers – but for a leader, one of the most important skills to learn is knowing when to keep your mouth shut: people learn a lot better by experimenting and making small mistakes than by being schooled by their boss, and they often try to read into your passing remarks. Don’t run an authoritarian camp focused on total risk aversion or perfectly efficient resource management; just set reasonable boundaries and exit conditions for experiments so that they don’t spiral out of control – and be amazed by the results every now and then.

Death by planning

When nothing is on fire, it’s easy to get preoccupied with maintaining the status quo. If your current headcount or budget request lists all the same projects as last year’s, or if you ever find yourself ending an argument by deferring to a policy or a process document, it’s probably a sign that you’re getting complacent. In security, complacency usually ends in tears – and when it doesn’t, it leads to burnout or boredom.

In my experience, your goal should be to develop a cadre of managers or tech leads capable of coming up with clever ideas, prioritizing them among themselves, and seeing them to completion without your day-to-day involvement. In your spare time, make it your mission to challenge them to stay ahead of the curve. Ask your vendor security lead how they’d streamline their work if they had a 40% jump in the number of vendors but no extra headcount; ask your product security folks what’s the second line of defense or containment should your primary defenses fail. Help them get good ideas off the ground; set some mental success and failure criteria to be able to cut your losses if something does not pan out.

Of course, malfunctions happen even in the best-run teams; to spot trouble early on, instead of overzealous project tracking, I found it useful to encourage folks to run a data-driven org. I’d usually ask them to imagine that a brand new VP shows up in our office and, as his first order of business, asks “why do you have so many people here and how do I know they are doing the right things?”. Not everything in security can be quantified, but hard data can validate many of your assumptions – and will alert you to unseen issues early on.

When focusing on data, it’s important not to treat pie charts and spreadsheets as an art unto itself; if you run a security review process for your company, your CSAT scores are going to reach 100% if you just rubberstamp every launch request within ten minutes of receiving it. Make sure you’re asking the right questions; instead of “how satisfied are you with our process”, try “is your product better as a consequence of talking to us?”

Whenever things are not progressing as expected, it is a natural instinct to fall back to micromanagement, but it seldom truly cures the ill. It’s probable that your team disagrees with your vision or its feasibility – and that you’re either not listening to their feedback, or they don’t think you’d care. It’s good to assume that most of your employees are as smart or smarter than you; barking your orders at them more loudly or more frequently does not lead anyplace good. It’s good to listen to them and either present new facts or work with them on a plan you can all get behind.

In some circumstances, all that’s needed is honesty about the business trade-offs, so that your team feels like your “partner in crime”, not a victim of circumstance. For example, we’d tell our folks that by not falling behind on basic, unglamorous work, we earn the trust of our VPs and SVPs – and that this translates into the independence and the resources we need to pursue more ambitious ideas without being told what to do; it’s how we game the system, so to speak. Oh: leading by example is a pretty powerful tool at your disposal, too.

The human factor

I’ve come to appreciate that hiring decent folks who can get along with others is far more important than trying to recruit conference-circuit superstars. In fact, hiring superstars is a decidedly hit-and-miss affair: while certainly not a rule, there is a proportion of folks who put the maintenance of their celebrity status ahead of job responsibilities or the well-being of their peers.

For teams, one of the most powerful demotivators is a sense of unfairness and disempowerment. This is where tech-originating leaders can shine, because their teams usually feel that their bosses understand and can evaluate the merits of the work. But it also means you need to be decisive and actually solve problems for them, rather than just letting them vent. You will need to make unpopular decisions every now and then; in such cases, I think it’s important to move quickly, rather than prolonging the uncertainty – but it’s also important to sincerely listen to concerns, explain your reasoning, and be frank about the risks and trade-offs.

Whenever you see a clash of personalities on your team, you probably need to respond swiftly and decisively; being right should not justify being a bully. If you don’t react to repeated scuffles, your best people will probably start looking for other opportunities: it’s draining to put up with constant pie fights, no matter if the pies are thrown straight at you or if you just need to duck one every now and then.

More broadly, personality differences seem to be a much better predictor of conflict than any technical aspects underpinning a debate. As a boss, you need to identify such differences early on and come up with creative solutions. Sometimes, all you need is taking some badly-delivered but valid feedback and having a conversation with the other person, asking some questions that can help them reach the same conclusions without feeling that their worldview is under attack. Other times, the only path forward is making sure that some folks simply don’t run into each for a while.

Finally, dealing with low performers is a notoriously hard but important part of the game. Especially within large companies, there is always the temptation to just let it slide: sideline a struggling person and wait for them to either get over their issues or leave. But this sends an awful message to the rest of the team; for better or worse, fairness is important to most. Simply firing the low performers is seldom the best solution, though; successful recovery cases are what sets great managers apart from the average ones.

Oh, one more thought: people in leadership roles have their allegiance divided between the company and the people who depend on them. The obligation to the company is more formal, but the impact you have on your team is longer-lasting and more intimate. When the obligations to the employer and to your team collide in some way, make sure you can make the right call; it might be one of the the most consequential decisions you’ll ever make.

It seems like only yesterday that we crossed the 350 petabyte mark. It was actually June 2017, but boy have we been growing since. In October 2017 we crossed 400 petabytes. Today, we’re proud to announce we’ve crossed the 500 petabyte mark. That’s a very healthy clip, see for yourself!

Whether you have 50 GB, 500 GB or are just an avid blog reader, thank you for being on this incredible journey with us through the years.

…we’re literally moving at 1,000,000 files per hour.

We’re extremely proud of our track record. Throughout these 11 years we’ve striven to be the simplest, fastest, and most affordable online backup (and now cloud storage) solution available. We’re not just focusing on data ingress, but also adhering to our original goal of making sure that “no one ever loses data again.” How quickly are we restoring data? On average, we’re literally moving at 1,000,000 files per hour.

Even after all these years, one of the most frequent questions asked is, “How has Backblaze maintained such affordable pricing, particularly when the industry continues to move away from unlimited data plans?”

The cloud storage industry is very competitive, with cloud sync, storage, and backup providers leaving the unlimited market every single day: OneDrive, Amazon Cloud Storage, and most recently CrashPlan. Other providers either have tiered pricing (iDrive), or charge almost double or even triple for all the features we provide for our unlimited backup service (Carbonite). So how do we do it?

The answer comes down to our relentless pursuit of lowering costs. Our open-source Backblaze Storage Pods comprise our Backblaze Vaults, and the less expensive and more performant our Storage Pods are, the better the service that we can provide. This all directly translates into the service and pricing we can offer you.

A key part of our service is to be as open as possible with our costs and structure. After all, you are entrusting us with some of your most valuable assets. Still, it is very difficult to find an apples to apples comparison to what our competitors are doing. For example, we can gain some insight from a 2011 interview with Carbonite’s CEO, who gave an interview in which he said Carbonite’s cost of storing a petabyte was $250,000. At the time, our cost to store a petabyte was $76,481 (more on that calculation can be found here and here). If Backblaze’s fundamental cost to store data is one-third that of Carbonite’s, it makes sense that Carbonite’s cost to its customers would be more than Backblaze’s. Today, Backblaze backup is $50/year and Carbonite’s equivalent service is $149.99.

Our continued focus on reducing costs has allowed us to maintain a healthy business. And after accepting customer data for almost 10 years, we sincerely want to thank you all for giving us your trust, and allowing us to protect your important data and memories for you. Here’s to the next 500 petabytes; they’ll be here before we know it.

Update 2/5/18

Since publishing this post, we have posted the latest in our series of Hard Drive Stats, in which we summarize the performance of the hard drives we used in our data centers in 2017 and previously.

The copyright holders of Dallas Buyers Club have sued thousands of BitTorrent users over the past few years.

The film company first obtains the identity of the Internet account holder believed to have pirated the movie, after which most cases are settled behind closed doors.

It doesn’t always go this easily though. A lawsuit in an Oregon federal court has been ongoing for nearly three years but in this case the defendant was running a Tor exit node, which complicates matters.

Tor is an anonymity tool and operating a relay or exit point basically means that the traffic of hundreds or thousands of users hit the Internet from your IP-address. When pirates use Tor, it will then appear as if the traffic comes from this connection.

The defendant in this lawsuit, John Huszar, has repeatedly denied that he personally downloaded a pirated copy of the film. However, he is now facing substantial damages because he failed to respond to a request for admissions, which stated that he distributed the film.

Not responding to such an admission means that the court can assume the statement is true.

“An admission, even an admission deemed admitted because of a failure to respond, is binding on the party at trial,” Dallas Buyers Club noted in a recent filing, demanding a summary judgment.

The unanswered admissions

Huszar was represented by various attorneys over the course of the lawsuit, but when the admissions were “deemed admitted” he was unrepresented and in poor health.

According to his lawyer, Ballas Buyers Club is using this to obtain a ruling in its favor. The film company argues that the Tor exit node operator admitted willful infringement, which could cost him up to $150,000 in damages.

The admissions present a serious problem. However, even if they’re taken as truth, they are not solid proof, according to the defense. For example, the portion of the film could have just been a trailer.

In addition, the defense responds with several damaging accusations of its own.

According to Huszar’s lawyer, it is unclear whether Dallas Buyers Club LLC has the proper copyrights to sue his client. In previous court cases in Australia and Texas, this ownership was put in doubt.

“In the case at bar, because of facts established in other courts, there is a genuine issue as to whether or not DBC owns the right to sue for copyright infringement,” the defense writes.

As licensing constructions can be quite complex, this isn’t unthinkable. Just last week another U.S. District Court judge told the self-proclaimed owners of the movie Fathers & Daughters that they didn’t have the proper rights to take an alleged pirate to trial.

Another issue highlighted by the defense is the reliability of witnesses Daniel Macek and Ben Perino. Both men are connected to the BitTorrent tracking outfit MaverickEye, and are not without controversy, as reported previously.

“[B]oth parties have previously been found to lack the qualifications, experience, education, and licenses to offer such forensic or expert testimony,” the defense writes, citing a recent case.

Finally, the defense also highlights that given the fact that Huszar operated a Tor exit-node, anyone could have downloaded the film.

The defense, therefore, asks the court to deny Dallas Buyers Club’s motion for summary judgment, or at least allow the defendant to conduct additional discovery to get to the bottom of the copyright ownership issue.

It’s not often that movies escape being pirated online but last weekend was a pretty miserable one for the people behind Thor:Ragnarok.

Just four months after the superhero movie’s theatrical debut, the Marvel hit was due to be released on disc February 26th, with digital distribution on iTunes planned for February 19th.

However, due to what appeared to be some kind of pre-order blunder, the $180 million movie was leaked online, resulting in a pirate frenzy that’s still ongoing.

But with the accidental early release of Thor:Ragnarok making waves within the torrent system and beyond, it seems ironic that its talented director actually has another relationship with piracy that most people aren’t aware of.

In an interview for ‘Q’, a show broadcast on Canada’s CBC radio, Taika Waititi noted that Thor: Ragnarok might be a “career ender” for him, something that was previously highlighted in the media.

However, the softly-spoken New Zealander also said some other things that flew completely under the radar but given recent developments, now have new significance.

Speaking with broadcaster Tom Power, Waititi revealed that when putting together his promotional showreel for Thor: Ragnarok, he obtained its source material from illegal sources.

Explaining the process used to acquire clips to create his ‘sizzle reel’ (a short video highlighting a director’s vision and tone for a proposed movie), Waititi revealed his less-than-official approach.

With Power quickly assuring the director that admitting doing something illegal was OK on air, Waititi perhaps realized it probably wasn’t.

“You can cut that out,” he suggested.

That Waititi took the ‘pirate’ approach to obtaining source material for his ‘sizzle reel’ isn’t really a surprise. Content is freely accessible online, crucially in easier to consume and edit formats than even Waititi has access to on short notice. And, since every film in memory is just a few clicks away, it’d be counter-intuitive not to use the resource in the name of creativity.

Overall then, it’s extremely unlikely that Waititi’s pirate confession will come to much. Two of his previous feature films, ‘Boy’ and ‘Hunt For The Wilderpeople’, held titles for the highest-grossing New Zealand film, the latter achieving the accolade in 2017.

Also in 2017, Waititi was named New Zealander of the Year in recognition of his “outstanding contribution to the well being of the nation.” Praise doesn’t come much higher than that.

How many torrent swarms he helped to keep healthy is destined remain a secret forever though, but as an emerging movie hero in his own right, people will forgive him that.

Late 2017, Boing Boing co-editor Xena Jardin posted an article in which he linked to an archive containing every Playboy centerfold image to date.

“Kind of amazing to see how our standards of hotness, and the art of commercial erotic photography, have changed over time,” Jardin noted.

While Boing Boing had nothing to do with the compilation, uploading, or storing of the Imgur-based archive, Playboy took exception to the popular blog linking to the album.

Noting that Jardin had referred to the archive uploader as a “wonderful person”, the adult publication responded with a lawsuit (pdf), claiming that Boing Boing had commercially exploited its copyrighted images.

Last week, with assistance from the Electronic Frontier Foundation, Boing Boing parent company Happy Mutants filed a motion to dismiss in which it defended its right to comment on and link to copyrighted content without that constituting infringement.

“This lawsuit is frankly mystifying. Playboy’s theory of liability seems to be that it is illegal to link to material posted by others on the web — an act performed daily by hundreds of millions of users of Facebook and Twitter, and by journalists like the ones in Playboy’s crosshairs here,” the company wrote.

EFF Senior Staff Attorney Daniel Nazer weighed in too, arguing that since Boing Boing’s reporting and commenting is protected by copyright’s fair use doctrine, the “deeply flawed” lawsuit should be dismissed.

Now, just a week later, Playboy has fired back. Opposing Happy Mutants’ request for the Court to dismiss the case, the company cites the now-famous Perfect 10 v. Amazon/Google case from 2007, which tried to prevent Google from facilitating access to infringing images.

Playboy highlights the court’s finding that Google could have been held contributorily liable – if it had knowledge that Perfect 10 images were available using its search engine, could have taken simple measures to prevent further damage, but failed to do so.

Turning to Boing Boing’s conduct, Playboy says that the company knew it was linking to infringing content, could have taken steps to prevent that, but failed to do so. It then launches an attack on the site itself, offering disparaging comments concerning its activities and business model.

“This is an important case. At issue is whether clickbait sites like Happy Mutants’ Boing Boing weblog — a site designed to attract viewers and encourage them to click on links in order to generate advertising revenue — can knowingly find, promote, and profit from infringing content with impunity,” Playboy writes.

“Clickbait sites like Boing Boing are not known for creating original content. Rather, their business model is based on ‘collecting’ interesting content created by others. As such, they effectively profit off the work of others without actually creating anything original themselves.”

Playboy notes that while sites like Boing Boing are within their rights to leverage works created by others, courts in the US and overseas have ruled that knowingly linking to infringing content is unacceptable.

Even given these conditions, Playboy argues, Happy Mutants and the EFF now want the Court to dismiss the case so that sites are free to “not only encourage, facilitate, and induce infringement, but to profit from those harmful activities.”

Claiming that Boing Boing’s only reason for linking to the infringing album was to “monetize the web traffic that over fifty years of Playboy photographs would generate”, Playboy insists that the site and parent company Happy Mutants was properly charged with copyright infringement.

Playboy also dismisses Boing Boing’s argument that a link to infringing content cannot result in liability due to the link having both infringing and substantial non-infringing uses.

First citing the Betamax case, which found that maker Sony could not be held liable for infringement because its video recorders had substantial non-infringing uses, Playboy counters with the Grokster decision, which held that a distributor of a product could be liable for infringement, if there was an intent to encourage or support infringement.

“In this case, Happy Mutants’ offending link — which does nothing more than support infringing content — is good for nothing but promoting infringement and there is no legitimate public interest in its unlicensed availability,” Playboy notes.

In its motion to dismiss, Happy Mutants also argued that unless Playboy could identify users who “in fact downloaded — rather than simply viewing — the material in question,” the case should be dismissed. However, Playboy rejects the argument, claiming it is based on an erroneous interpretation of the law.

Citing the Grokster decision once more, the adult publisher notes that the Supreme Court found that someone infringes contributorily when they intentionally induce or encourage direct infringement.

“The argument that contributory infringement only lies where the defendant’s actions result in further infringement ignores the ‘or’ and collapses ‘inducing’ and ‘encouraging’ into one thing when they are two distinct things,” Playboy writes.

As for Boing Boing’s four classic fair use arguments, the publisher describes these as “extremely weak” and proceeds to hit them one by one.

In respect of the purpose and character of the use, Playboy discounts Boing Boing’s position that the aim of its post was to show “how our standards of hotness, and the art of commercial erotic photography, have changed over time.” The publisher argues that is the exact same purpose of Playboy magazine, while highliting its publication Playboy: The Compete Centerfolds, 1953-2016.

Moving on to the second factor of fair use – the nature of the copyrighted work – Playboy notes that an entire album of artwork is involved, rather than just a single image.

On the third factor, concerning the amount and substantiality of the original work used, Playboy argues that in order to publish an opinion on how “standards of hotness” had developed over time, there was no need to link to all of the pictures in the archive.

“Had only representative images from each decade, or perhaps even each year, been taken, this would be a very different case — but Happy Mutants cannot dispute that it knew it was linking to an illegal library of ‘Every Playboy Playmate Centerfold Ever’ since that is what it titled its blog post,” Playboy notes.

Finally, when considering the effect of the use upon the potential market for or value of the copyrighted work, Playbody says its archive of images continues to be monetized and Boing Boing’s use of infringing images jeopardizes that.

“Given that people are generally not going to pay for what is freely available, it is disingenuous of Happy Mutants to claim that promoting the free availability of infringing archives of Playboy’s work for viewing and downloading is not going to have an adverse effect on the value or market of that work,” the publisher adds.

While it appears the parties agree on very little, there is agreement on one key aspect of the case – its wider importance.

On the one hand, Playboy insists that a finding in its favor will ensure that people can’t commercially exploit infringing content with impunity. On the other, Boing Boing believes that the health of the entire Internet is at stake.

“The world can’t afford a judgment against us in this case — it would end the web as we know it, threatening everyone who publishes online, from us five weirdos in our basements to multimillion-dollar, globe-spanning publishing empires like Playboy,” the company concludes.

Playboy’s opposition to Happy Mutants’ motion to dismiss can be found here (pdf)

An ETL (Extract, Transform, Load) process enables you to load data from source systems into your data warehouse. This is typically executed as a batch or near-real-time ingest process to keep the data warehouse current and provide up-to-date analytical data to end users.

Amazon Redshift is a fast, petabyte-scale data warehouse that enables you easily to make data-driven decisions. With Amazon Redshift, you can get insights into your big data in a cost-effective fashion using standard SQL. You can set up any type of data model, from star and snowflake schemas, to simple de-normalized tables for running any analytical queries.

To operate a robust ETL platform and deliver data to Amazon Redshift in a timely manner, design your ETL processes to take account of Amazon Redshift’s architecture. When migrating from a legacy data warehouse to Amazon Redshift, it is tempting to adopt a lift-and-shift approach, but this can result in performance and scale issues long term. This post guides you through the following best practices for ensuring optimal, consistent runtimes for your ETL processes:

COPY data from multiple, evenly sized files.

Use workload management to improve ETL runtimes.

Perform table maintenance regularly.

Perform multiple steps in a single transaction.

Loading data in bulk.

Use UNLOAD to extract large result sets.

Use Amazon Redshift Spectrum for ad hoc ETL processing.

Monitor daily ETL health using diagnostic queries.

1. COPY data from multiple, evenly sized files

Amazon Redshift is an MPP (massively parallel processing) database, where all the compute nodes divide and parallelize the work of ingesting data. Each node is further subdivided into slices, with each slice having one or more dedicated cores, equally dividing the processing capacity. The number of slices per node depends on the node type of the cluster. For example, each DS2.XLARGE compute node has two slices, whereas each DS2.8XLARGE compute node has 16 slices.

When you load data into Amazon Redshift, you should aim to have each slice do an equal amount of work. When you load the data from a single large file or from files split into uneven sizes, some slices do more work than others. As a result, the process runs only as fast as the slowest, or most heavily loaded, slice. In the example shown below, a single large file is loaded into a two-node cluster, resulting in only one of the nodes, “Compute-0”, performing all the data ingestion:

When splitting your data files, ensure that they are of approximately equal size – between 1 MB and 1 GB after compression. The number of files should be a multiple of the number of slices in your cluster. Also, I strongly recommend that you individually compress the load files using gzip, lzop, or bzip2 to efficiently load large datasets.

When loading multiple files into a single table, use a single COPY command for the table, rather than multiple COPY commands. Amazon Redshift automatically parallelizes the data ingestion. Using a single COPY command to bulk load data into a table ensures optimal use of cluster resources, and quickest possible throughput.

2. Use workload management to improve ETL runtimes

Use Amazon Redshift’s workload management (WLM) to define multiple queues dedicated to different workloads (for example, ETL versus reporting) and to manage the runtimes of queries. As you migrate more workloads into Amazon Redshift, your ETL runtimes can become inconsistent if WLM is not appropriately set up.

I recommend limiting the overall concurrency of WLM across all queues to around 15 or less. This WLM guide helps you organize and monitor the different queues for your Amazon Redshift cluster.

When managing different workloads on your Amazon Redshift cluster, consider the following for the queue setup:

Create a queue dedicated to your ETL processes. Configure this queue with a small number of slots (5 or fewer). Amazon Redshift is designed for analytics queries, rather than transaction processing. The cost of COMMIT is relatively high, and excessive use of COMMIT can result in queries waiting for access to the commit queue. Because ETL is a commit-intensive process, having a separate queue with a small number of slots helps mitigate this issue.

Claim extra memory available in a queue. When executing an ETL query, you can take advantage of the wlm_query_slot_count to claim the extra memory available in a particular queue. For example, a typical ETL process might involve COPYing raw data into a staging table so that downstream ETL jobs can run transformations that calculate daily, weekly, and monthly aggregates. To speed up the COPY process (so that the downstream tasks can start in parallel sooner), the wlm_query_slot_count can be increased for this step.

Create a separate queue for reporting queries. Configure query monitoring rules on this queue to further manage long-running and expensive queries.

Take advantage of the dynamic memory parameters. They swap the memory from your ETL to your reporting queue after the ETL job has completed.

3. Perform table maintenance regularly

Amazon Redshift is a columnar database, which enables fast transformations for aggregating data. Performing regular table maintenance ensures that transformation ETLs are predictable and performant. To get the best performance from your Amazon Redshift database, you must ensure that database tables regularly are VACUUMed and ANALYZEd. The Analyze & Vacuum schema utility helps you automate the table maintenance task and have VACUUM & ANALYZE executed in a regular fashion.

Use VACUUM to sort tables and remove deleted blocks

During a typical ETL refresh process, tables receive new incoming records using COPY, and unneeded data (cold data) is removed using DELETE. New rows are added to the unsorted region in a table. Deleted rows are simply marked for deletion.

DELETE does not automatically reclaim the space occupied by the deleted rows. Adding and removing large numbers of rows can therefore cause the unsorted region and the number of deleted blocks to grow. This can degrade the performance of queries executed against these tables.

After an ETL process completes, perform VACUUM to ensure that user queries execute in a consistent manner. The complete list of tables that need VACUUMing can be found using the Amazon Redshift Util’s table_info script.

Use the following approaches to ensure that VACCUM is completed in a timely manner:

Use wlm_query_slot_count to claim all the memory allocated in the ETL WLM queue during the VACUUM process.

If your table has a compound sort key with only one sort column, try to load your data in sort key order. This helps reduce or eliminate the need to VACUUM the table.

Consider using time series This helps reduce the amount of data you need to VACUUM.

Use ANALYZE to update database statistics

Amazon Redshift uses a cost-based query planner and optimizer using statistics about tables to make good decisions about the query plan for the SQL statements. Regular statistics collection after the ETL completion ensures that user queries run fast, and that daily ETL processes are performant. The Amazon Redshift utility table_info script provides insights into the freshness of the statistics. Keeping the statistics off (pct_stats_off) less than 20% ensures effective query plans for the SQL queries.

4. Perform multiple steps in a single transaction

ETL transformation logic often spans multiple steps. Because commits in Amazon Redshift are expensive, if each ETL step performs a commit, multiple concurrent ETL processes can take a long time to execute.

To minimize the number of commits in a process, the steps in an ETL script should be surrounded by a BEGIN…END statement so that a single commit is performed only after all the transformation logic has been executed. For example, here is an example multi-step ETL script that performs one commit at the end:

5. Loading data in bulk

Amazon Redshift is designed to store and query petabyte-scale datasets. Using Amazon S3 you can stage and accumulate data from multiple source systems before executing a bulk COPY operation. The following methods allow efficient and fast transfer of these bulk datasets into Amazon Redshift:

Use temporary staging tables to hold the data for transformation. These tables are automatically dropped after the ETL session is complete. Temporary tables can be created using the CREATE TEMPORARY TABLE syntax, or by issuing a SELECT … INTO #TEMP_TABLE query. Explicitly specifying the CREATE TEMPORARY TABLE statement allows you to control the DISTRIBUTION KEY, SORT KEY, and compression settings to further improve performance.

User ALTER table APPEND to swap data from the staging tables to the target table. Data in the source table is moved to matching columns in the target table. Column order doesn’t matter. After data is successfully appended to the target table, the source table is empty. ALTER TABLE APPEND is much faster than a similar CREATE TABLE AS or INSERT INTO operation because it doesn’t involve copying or moving data.

6. Use UNLOAD to extract large result sets

Fetching a large number of rows using SELECT is expensive and takes a long time. When a large amount of data is fetched from the Amazon Redshift cluster, the leader node has to hold the data temporarily until the fetches are complete. Further, data is streamed out sequentially, which results in longer elapsed time. As a result, the leader node can become hot, which not only affects the SELECT that is being executed, but also throttles resources for creating execution plans and managing the overall cluster resources. Here is an example of a large SELECT statement. Notice that the leader node is doing most of the work to stream out the rows:

Use UNLOAD to extract large results sets directly to S3. After it’s in S3, the data can be shared with multiple downstream systems. By default, UNLOAD writes data in parallel to multiple files according to the number of slices in the cluster. All the compute nodes participate to quickly offload the data into S3.

If you are extracting data for use with Amazon Redshift Spectrum, you should make use of the MAXFILESIZE parameter to and keep files are 150 MB. Similar to item 1 above, having many evenly sized files ensures that Redshift Spectrum can do the maximum amount of work in parallel.

7. Use Redshift Spectrum for ad hoc ETL processing

Events such as data backfill, promotional activity, and special calendar days can trigger additional data volumes that affect the data refresh times in your Amazon Redshift cluster. To help address these spikes in data volumes and throughput, I recommend staging data in S3. After data is organized in S3, Redshift Spectrum enables you to query it directly using standard SQL. In this way, you gain the benefits of additional capacity without having to resize your cluster.

8. Monitor daily ETL health using diagnostic queries

Monitoring the health of your ETL processes on a regular basis helps identify the early onset of performance issues before they have a significant impact on your cluster. The following monitoring scripts can be used to provide insights into the health of your ETL processes:

Analyze the top transformation SQL and use EXPLAIN to find opportunities for tuning the query plan.

There are several other useful scripts available in the amazon-redshift-utils repository. The AWS Lambda Utility Runner runs a subset of these scripts on a scheduled basis, allowing you to automate much of monitoring of your ETL processes.

Example ETL process

The following ETL process reinforces some of the best practices discussed in this post. Consider the following four-step daily ETL workflow where data from an RDBMS source system is staged in S3 and then loaded into Amazon Redshift. Amazon Redshift is used to calculate daily, weekly, and monthly aggregations, which are then unloaded to S3, where they can be further processed and made available for end-user reporting using a number of different tools, including Redshift Spectrum and Amazon Athena.

Step 1: Extract from the RDBMS source to a S3 bucket

In this ETL process, the data extract job fetches change data every 1 hour and it is staged into multiple hourly files. For example, the staged S3 folder looks like the following:

Organizing the data into multiple, evenly sized files enables the COPY command to ingest this data using all available resources in the Amazon Redshift cluster. Further, the files are compressed (gzipped) to further reduce COPY times.

Step 2: Stage data to the Amazon Redshift table for cleansing

Ingesting the data can be accomplished using a JSON-based manifest file. Using the manifest file ensures that S3 eventual consistency issues can be eliminated and also provides an opportunity to dedupe any files if needed. A sample manifest20170702.json file looks like the following:

SET wlm_query_slot_count TO <<max available concurrency in the ETL queue>>;
COPY stage_tbl FROM 's3:// <<S3 Bucket>>/batch/manifest20170702.json' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' manifest;

Because the downstream ETL processes depend on this COPY command to complete, the wlm_query_slot_count is used to claim all the memory available to the queue. This helps the COPY command complete as quickly as possible.

As shown above, multiple steps are combined into one transaction to perform a single commit, reducing contention on the commit queue.

Step 4: Unload the daily dataset to populate the S3 data lake bucket

The transformed results are now unloaded into another S3 bucket, where they can be further processed and made available for end-user reporting using a number of different tools, including Redshift Spectrum and Amazon Athena.

Summary

Amazon Redshift lets you easily operate petabyte-scale data warehouses on the cloud. This post summarized the best practices for operating scalable ETL natively within Amazon Redshift. I demonstrated efficient ways to ingest and transform data, along with close monitoring. I also demonstrated the best practices being used in a typical sample ETL workload to transform the data into Amazon Redshift.

If you have questions or suggestions, please comment below.

About the Author

Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

Abstract: This report assesses the impact disclosure of data breaches has on the total returns and volatility of the affected companies’ stock, with a focus on the results relative to the performance of the firms’ peer industries, as represented through selected indices rather than the market as a whole. Financial performance is considered over a range of dates from 3 days post-breach through 6 months post-breach, in order to provide a longer-term perspective on the impact of the breach announcement.

Key findings:

While the difference in stock price between the sampled breached companies and their peers was negative (1.13%) in the first 3 days following announcement of a breach, by the 14th day the return difference had rebounded to + 0.05%, and on average remained positive through the period assessed.

For the differences in the breached companies’ betas and the beta of their peer sets, the differences in the means of 8 months pre-breach versus post-breach was not meaningful at 90, 180, and 360 day post-breach periods.

For the differences in the breached companies’ beta correlations against the peer indices pre- and post-breach, the difference in the means of the rolling 60 day correlation 8 months pre- breach versus post-breach was not meaningful at 90, 180, and 360 day post-breach periods.

In regression analysis, use of the number of accessed records, date, data sensitivity, and malicious versus accidental leak as variables failed to yield an R2 greater than 16.15% for response variables of 3, 14, 60, and 90 day return differential, excess beta differential, and rolling beta correlation differential, indicating that the financial impact on breached companies was highly idiosyncratic.

Based on returns, the most impacted industries at the 3 day post-breach date were U.S. Financial Services, Transportation, and Global Telecom. At the 90 day post-breach date, the three most impacted industries were U.S. Financial Services, U.S. Healthcare, and Global Telecom.

The market isn’t going to fix this. If we want better security, we need to regulate the market.

Note: The article is behind a paywall. An older version is here. A similar article is here.

AWS recently announced the general availability of Windows container management for Amazon Elastic Container Service (Amazon ECS). Docker containers and Amazon ECS make it easy to run and scale applications on a virtual machine by abstracting the complex cluster management and setup needed.

Classic .NET applications are developed with .NET Framework 4.7.1 or older and can run only on a Windows platform. These include Windows Communication Foundation (WCF), ASP.NET Web Forms, and an ASP.NET MVC web app or web API.

Why classic ASP.NET?

ASP.NET MVC 4.6 and older versions of ASP.NET occupy a significant footprint in the enterprise web application space. As enterprises move towards microservices for new or existing applications, containers are one of the stepping stones for migrating from monolithic to microservices architectures. Additionally, the support for Windows containers in Windows 10, Windows Server 2016, and Visual Studio Tooling support for Docker simplifies the containerization of ASP.NET MVC apps.

Getting started

In this post, you pick an ASP.NET 4.6.2 MVC application and get step-by-step instructions for migrating to ECS using Windows containers. The detailed steps, AWS CloudFormation template, Microsoft Visual Studio solution, ECS service definition, and ECS task definition are available in the aws-ecs-windows-aspnet GitHub repository.

To help you getting started running Windows containers, here is the reference architecture for Windows containers on GitHub: ecs-refarch-cloudformation-windows. This reference architecture is the layered CloudFormation stack, in that it calls the other stacks to create the environment. The CloudFormation YAML template in this reference architecture is referenced to create a single JSON CloudFormation stack, which is used in the steps for the migration.

Steps for Migration

Your development environment needs to have the latest version and updates for Visual Studio 2017, Windows 10, and Docker for Windows Stable.

Next, containerize the ASP.NET application and test it locally. The size of Windows container application images is generally larger compared to Linux containers. This is because the base image of the Windows container itself is large in size, typically greater than 9 GB.

After the application is containerized, the container image needs to be pushed to Amazon Elastic Container Registry (Amazon ECR). Images stored in ECR are compressed to improve pull times and reduce storage costs. In this case, you can see that ECR compresses the image to around 1 GB, for an optimization factor of 90%.

Create a CloudFormation stack using the template in the ‘CloudFormation template’ folder. This creates an ECS service, task definition (referring the containerized ASP.NET application), and other related components mentioned in the ECS reference architecture for Windows containers.

After the stack is created, verify the successful creation of the ECS service, ECS instances, running tasks (with the threshold mentioned in the task definition), and the Application Load Balancer’s successful health check against running containers.

Navigate to the Application Load Balancer URL and see the successful rendering of the containerized ASP.NET MVC app in the browser.

Key Notes

Generally, Windows container images occupy large amount of space (in the order of few GBs).

All the task definition parameters for Linux containers are not available for Windows containers. For more information, see Windows Task Definitions.

An Application Load Balancer can be configured to route requests to one or more ports on each container instance in a cluster. The dynamic port mapping allows you to have multiple tasks from a single service on the same container instance.

IAM roles for Windows tasks require extra configuration. For more information, see Windows IAM Roles for Tasks. For this post, configuration was handled by the CloudFormation template.

Summary

In this post, you migrated an ASP.NET MVC application to ECS using Windows containers.

The logical next step is to automate the activities for migration to ECS and build a fully automated continuous integration/continuous deployment (CI/CD) pipeline for Windows containers. This can be orchestrated by leveraging services such as AWS CodeCommit, AWS CodePipeline, AWS CodeBuild, Amazon ECR, and Amazon ECS. You can learn more about how this is done in the Set Up a Continuous Delivery Pipeline for Containers Using AWS CodePipeline and Amazon ECS post.

Many customers ask for guidance on migrating end-to-end solutions running on virtual machines over to AWS. This post provides an overview of moving a common WordPress blog running on a virtualized platform to AWS, including re-pointing the DNS records associated to with the website.

AWS Server Migration Service (AWS SMS) is an agentless service that makes it easier and faster for you to migrate thousands of on-premises workloads to AWS. AWS SMS allows you to automate, schedule, and track incremental replications of live server volumes, making it easier for you to coordinate large-scale server migrations.

Walkthrough

The key elements of this migration process include the following steps:

Before you start, ensure that your source systems OS and vCenter version are supported by AWS. For more information, see the Server Migration Service FAQ.

Establish your AWS environment

For this walkthrough, your WordPress blog is currently running as a two-tier LAMP stack in a corporate data center. You have a frontend running Apache and PHP, plus a backend database running on MySQL. All systems are hosted on a virtualized platform.

First, establish your AWS environment. If your organization is new to AWS, this may include account or subaccount creation, a new virtual private cloud (VPC), and associated subnets, route tables, internet gateways, and so on. Think of this phase as setting up your software-defined data center. For more information, see Getting Started with Amazon EC2.

The blog is a two-tier stack, so go with two private subnets. Because you want it to be highly available, use multiple Availability Zones. A zone resides within an AWS Region. Each zone is isolated, but the zones within a region are connected through low-latency links. This allows architects and solution designers to build highly available solutions.

Replicate your database

WordPress uses a MySQL relational database. You could continue to manage MySQL and the associated EC2 instances associated with maintaining and scaling a database. For this walkthrough, use this opportunity to migrate to an RDS instance of Amazon Aurora, as it is a MySQL compliant database. Not only is Amazon Aurora a high-performant database engine but it frees you up to focus on application development by managing time-consuming database administration tasks, including backups, software patching, monitoring, scaling, and replication.

Use AWS Database Migration Service to migrate your MySQL database to Amazon Aurora easily and securely. After a database migration instance has been instantiated, configure the source and destination endpoints and create a replication task.

Install and configure the SMS Connector appliance

Launch a new VM based on the SMS Connector that you downloaded. To configure the connector, connect to it via HTTPS. You can obtain the SMS Connector IP address from your hypervisor.

Connect to the SMS Connector via HTTPS. In the example above, the connector IP address is 10.0.0.31. In your browser, enter https://10.0.0.31.

Configure the connector with the IAM and hypervisor credentials that you created earlier.

After it’s configured, and the associated connectivity and authentication checks have passed, return to the console and view your connector in AWS SMS.

Import your virtual machine inventory and create a replication job After validating that the SMS Connector is in a “HEALTHY” state, import your server catalog to AWS SMS. This process can take up to a minute.

Select the server to migrate and choose Create replication job. The console guides you through the process. The time that the initial replication task takes to complete is dependent on the available bandwidth and the size of your VM. After the initial seed replication, network bandwidth is minimized as AWS SMS replicates only incremental changes occurring on the VM.

Launch your EC2 instance

When your replication task is complete, the artifact created by AWS SMS is a custom AMI that you can use to deploy an EC2 instance. Follow the usual process to launch your EC2 instance, noting that you may need to replace any host-based firewalls with security groups and NACLs.

When you create an EC2 instance, ensure that you pick the most suitable EC2 instance type and size to match your performance requirements while optimizing for cost.

While your new EC2 instance is a replica of your on-premises VM, you should always validate that applications are functioning. How you do this differs on an application-by-application basis. You can use a combination of approaches, such as editing a local host file and testing your application, SSH, or Telnet.

From the RDS console, get your connection string details and update your WordPress configuration file to point to the Amazon Aurora database. As WordPress is expecting a MySQL database and Amazon Aurora is MySQL-compliant, this change of database engine is transparent to WordPress.

You have validated that your WordPress application is running correctly, as you are still receiving changes from your on-premises data center via AWS DMS into your Amazon Aurora database.

You can now update your DNS zone file using Amazon Route 53. Route 53 can be driven by multiple methods: console, SDK, or AWS CLI.

For this walkthrough, update your DNS zone file via the AWS CLI. The JSON example shows upserting the A record in your zone to resolve to your EC2 instance.

Use the AWS CLI to execute the request and update the record in your zone file. The cut-over period between the original off-cloud location and AWS is defined by the TTL in the SOA (statement of authority) in your DNS zone. During this period, any requests resolving to your off-cloud server that result in database writes are automatically replicated to your Amazon Aurora instance via AWS DMS.

You have now successfully migrated your WordPress blog to AWS. Based on the TTL of your DNS zone file, end users slowly resolve the WordPress blog to AWS.

Summary

In this post, you moved a WordPress blog to AWS, using AWS SMS and AWS DMS to re-point the associated DNS records.

Many architectures can be extended to use many of the inherent benefits of AWS, with little effort. For example, by using Amazon CloudWatch metrics to drive Auto Scaling policies, you can use an Application Load Balancer as your frontend. This removes the single point of failure for a single Amazon EC2 instance and ensures that your deployed capacity closely follows customer demand. Think big and get building!

At inception, Backblaze was a consumer company. Thousands upon thousands of individuals came to our website and gave us $5/mo to keep their data safe. But, we didn’t sell business solutions. It took us years before we had a sales team. In the last couple of years, we’ve released products that businesses of all sizes love: Backblaze B2 Cloud Storage and Backblaze for Business Computer Backup. Those businesses want to integrate Backblaze deeply into their infrastructure, so it’s time to hire our first Sales Engineer!

Company Description:
Founded in 2007, Backblaze started with a mission to make backup software elegant and provide complete peace of mind. Over the course of almost a decade, we have become a pioneer in robust, scalable low cost cloud backup. Recently, we launched B2 – robust and reliable object storage at just $0.005/gb/mo. Part of our differentiation is being able to offer the lowest price of any of the big players while still being profitable.

We’ve managed to nurture a team oriented culture with amazingly low turnover. We value our people and their families. Don’t forget to check out our “About Us” page to learn more about the people and some of our perks.

We have built a profitable, high growth business. While we love our investors, we have maintained control over the business. That means our corporate goals are simple – grow sustainably and profitably.

Some Backblaze Perks:

Competitive healthcare plans

Competitive compensation and 401k

All employees receive Option grants

Unlimited vacation days

Strong coffee

Fully stocked Micro kitchen

Catered breakfast and lunches

Awesome people who work on awesome projects

Childcare bonus

Normal work hours

Get to bring your pets into the office

San Mateo Office – located near Caltrain and Highways 101 & 280.

Backblaze B2 cloud storage is a building block for almost any computing service that requires storage. Customers need our help integrating B2 into iOS apps to Docker containers. Some customers integrate directly to the API using the programming language of their choice, others want to solve a specific problem using ready made software, already integrated with B2.

At the same time, our computer backup product is deepening it’s integration into enterprise IT systems. We are commonly asked for how to set Windows policies, integrate with Active Directory, and install the client via remote management tools.

We are looking for a sales engineer who can help our customers navigate the integration of Backblaze into their technical environments.

Are you 1/2” deep into many different technologies, and unafraid to dive deeper?

Can you confidently talk with customers about their technology, even if you have to look up all the acronyms right after the call?

Are you excited to setup complicated software in a lab and write knowledge base articles about your work?

Then Backblaze is the place for you!

Enough about Backblaze already, what’s in it for me?
In this role, you will be given the opportunity to learn about the technologies that drive innovation today; diverse technologies that customers are using day in and out. And more importantly, you’ll learn how to learn new technologies.

Just as an example, in the past 12 months, we’ve had the opportunity to learn and become experts in these diverse technologies:

How to setup VM servers for lab environments, both on-prem and using cloud services.

Create an automatically “resetting” demo environment for the sales team.

How to install and monitor online backup installations using RMM tools, like JAMF.

Tape (LTO) systems. (Yes – people still use tape for storage!)

How can I know if I’ll succeed in this role?

You have:

Confidence. Be able to ask customers questions about their environments and convey to them your technical acumen.

Curiosity. Always want to learn about customers’ situations, how they got there and what problems they are trying to solve.

Organization. You’ll work with customers, integration partners, and Backblaze team members on projects of various lengths. You can context switch and either have a great memory or keep copious notes. Your checklists have their own checklists.

You are versed in:

The fundamentals of Windows, Linux and Mac OS X operating systems. You shouldn’t be afraid to use a command line.

Building, installing, integrating and configuring applications on any operating system.

Novice development skills in any programming/scripting language. Have basic understanding of data structures and program flow.

Your background contains:

Bachelor’s degree in computer science or the equivalent.

2+ years of experience as a pre or post-sales engineer.

The right extra credit:
There are literally hundreds of previous experiences you can have had that would make you perfect for this job. Some experiences that we know would be helpful for us are below, but make sure you tell us your stories!

Experience using or programming against Amazon S3.

Experience with large on-prem storage – NAS, SAN, Object. And backing up data on such storage with tools like Veeam, Veritas and others.

Experience with photo or video media. Media archiving is a key market for Backblaze B2.

Experience with Windows Servers, Active Directory, Group policies and the like.

What’s it like working with the Sales team?
The Backblaze sales team collaborates. We help each other out by sharing ideas, templates, and our customer’s experiences. When we talk about our accomplishments, there is no “I did this,” only “we”. We are truly a team.

We are honest to each other and our customers and communicate openly. We aim to have fun by embracing crazy ideas and creative solutions. We try to think not outside the box, but with no boxes at all. Customers are the driving force behind the success of the company and we care deeply about their success.

Foundational changes in the protocol

While MQTT 5 is a major update to the existing protocol specification, the new version of the lightweight IoT protocol is more of an evolution rather than a revolution and retained all characteristics that contributed to its success: Its lightweightness, push communication, unique features, ease of use, extreme scalability, suitability for mobile networks and decoupling of communication participants.

Although some foundational mechanics were added or changed slightly, the new version still feels like MQTT and it sticks to its principles that made it the most popular Internet of Things protocol to date. This blog post will analyze everything you need to know about the foundational changes in version 5 of the MQTT specification before digging deep into the details of the new features during the next weeks.

MQTT is still MQTT. Mostly.

The good news: If you’re familiar with MQTT 3.1.1 (if you aren’t – start here and read the MQTT Essentials first!), then all principles and features you know about MQTT are still valid for MQTT 5. Some details of former features like Last Will and Testament changed a bit or some features were extended and additional popular features implemented by HiveMQ like TTL or Shared Subscriptions were added to the new specification.

The protocol also slightly changed on the wire and an additional control packet (AUTH) was added. But all in all, the MQTT version 5 is still clearly recognizable as MQTT.

Properties in the MQTT Header & Reason Codes

One of the most exciting and most flexible new MQTT 5 features is the possibility to add custom key-value properties in the MQTT header. This feature deserves its own blog post as it is a game changer for many deployments. Similar to protocols like HTTP, MQTT clients and brokers can add an arbitrary number of custom (or pre-defined) headers to carry metadata. This metadata can be used for application specific data. Pre-defined headers are used for the implementation of most of the new MQTT features.

Many MQTT packets now also include Reason Codes. A Reason Code indicates that a pre-defined protocol error occurred. These reason codes are typically carried on acknowledgement packets and allow client and broker to interpret error conditions (and potentially work around them). The Reason Codes are sometimes called Negative Acknowledgements. The following MQTT packets can carry Reason Codes:

CONNACH

PUBACK

PUBREC

PUBREL

PUBCOMP

SUBACK

UNSUBACK

AUTH

DISCONNECT

Reason Codes for negative acknowledgements range from “Quota Exceeded” to “Protocol Error”. Clients and brokers are responsible for interpreting these new Reason Codes.

CONNACK Return Codes for unsupported features

With the popularity of MQTT, a lot of MQTT implementations were created and offered by companies. Not all of these implementations are completely MQTT specification compatible since sometimes features are not implemented like QoS 2, retained messages or persistent sessions. On a side note, HiveMQ is of course fully MQTT specification conformal and supports all features.

MQTT 5 provides a way for incomplete MQTT implementations (as often found in SaaS offerings) to indicate that the broker does not support specific features. It’s the client’s job to make sure that none of the unsupported features are used. The broker implementation uses pre-defined headers in the CONNACK packet (which is sent by the broker after the client sent a CONNECT packet) to indicate that specific features are not supported. These headers can of course also be used to send a notice to the client that it has no permission to use specific features.

The following pre-defined headers for indicating unimplemented features (or features not permitted for use by the client) are available in MQTT 5:

Pre-Defined Header

Data Type

Description

Retain Available

Boolean

Are retained messages available?

Maximum QoS

Boolean

The maximum QoS the client is allowed to use for publishing messages or subscribing to topics

These return codes are a major step forward for communicating the permissions of individual MQTT clients in heterogeneous environments. The downside of this new functionality is that MQTT clients need to implement the interpretation of these codes themselves and need to make sure that application programmers don’t use features that are not supported by the broker (or the client has no permission for).
HiveMQ is going to support all MQTT 5 features 100%, so these custom headers are only expected to be used if the administrator choses to use them for permissions in deployments.

Clean Session is now Clean Start

A popular MQTT 3.1.1 functionality is the use of clean sessions by MQTT clients, which have temporary connections or don’t subscribe to messages at all. When connecting to the broker, the client had to choose to send a CONNECT packet with the cleanSession flag enabled or disabled. With a clean session, a MQTT client indicates that the broker should discard any data for the client as soon as the underlying TCP connection breaks or if the client decides to disconnect from the broker. Also, if there was a previous session associated with the client identifier on the broker, a cleanSession CONNECT packet forced the broker to delete the previous data.

With MQTT 5, a client can choose to use a Clean Start (indicated by the Clean Start flag in the CONNECT message). When using this flag, the broker discards any previous session data and the client starts with a fresh session. The session won’t be cleaned automatically after the TCP connection was closed between client and server. To trigger the deletion of the session after the client disconnected, a new header field called “Session Expiry Interval” must be set to the value 0.

The new Clean Start streamlines and simplifies the session handling of MQTT, as it allows more flexibility and is easier to implement than the cleanSession/persistent session concept. With MQTT 5 all session are persistent unless the “Session Expiry Interval” is 0. Deletion of a session occurs after the timeout or when the client reconnects with Clean Start.

Additional MQTT Packet

MQTT 5 introduces a new MQTT Packet: The AUTH packet. This new packet is extremely useful for implementing non-trivial authentication mechanisms and we expect that this packet will be used a lot in production environments. The exact semantics are covered in a future blog post.

For the moment, it’s important to realize that this new packet can be sent by brokers and clients after connection establishment to use complex challenge/response authentication methods (like SCRAM or Kerberos as defined in the SAML framework, but can also be used for state-of-the-art authentication methods for IoT like OAuth. This packet also allows re-authentication of MQTT clients without closing the connection. Stay tuned for a deep dive into the packet very soon!

New Data Type: UTF-8 String pairs

The advent of custom headers also required a new data type to be introduced. UTF-8 string pairs. This string pair is essentially a key-value structure with both, key and value, as String data type. This data type is currently only used for custom headers.

With this new data type, MQTT uses 7 different data types that are used on the wire:

Bit

Two Byte Integer

Four Byte Integer

UTF-8 Encoded String

Variable Byte Integer

Binary Data

UTF-8 String Pair

Most application users typically use Binary Data and UTF-8 encoded Strings in the APIs of their MQTT library. With MQTT 5, UTF-8 String Pairs may also be used frequently. All other data types are hidden from the user but are used on the wire to craft valid MQTT packets by the MQTT client libraries and brokers.

Bi-directional DISCONNECT packets

With MQTT 3.1.1, the client could indicate that it wanted to disconnect gracefully by sending a DISCONNECT packet prior to closing the underlying TCP connection. There was no way for the MQTT broker to notify a MQTT client that something bad happened and that the broker is going to close the TCP connection. This changed with the new protocol version.
The broker is now allowed to send a MQTT DISCONNECT packet prior to closing the socket. The client is now able to interpret the reason why it was disconnected and take the respective action. A broker is not required to indicate the exact reason (e.g. for security reasons), but at least for developing applications this helps a lot to figure out why a connection was closed by the broker.

Of course DISCONNECT packets can carry Reason Codes, so it’s easy to indicate what was the reason for the disconnection (e.g. in case of invalid permissions).

No retry for QoS 1 and 2 messages

MQTT clients use standing TCP (or similar protocols with the same guarantees) connections as underlying transport. A healthy TCP connection gives bi-directional connectivity with exactly-once and in-order guarantees, so all MQTT packets sent by clients or brokers will arrive on the other end. In case the TCP connection breaks, while the message is in-flight, QoS 1 and 2 give message delivery guarantees over multiple TCP connections.

MQTT 3.1.1 allowed the re-delivery of MQTT messages while the TCP connection is healthy. In practice, this is a very bad idea, since overloaded MQTT clients may get overloaded even more. Just imagine a case where a MQTT client receives a message from a MQTT broker and needs 11 seconds to process the message (and would acknowledge the packet after the processing). Now imagine the broker would retransmit the message after a 10 second timeout. There is no advantage to this approach and it just uses precious bandwidth and overloads the MQTT client even more.

With MQTT 5, brokers and clients are not allowed to retransmit MQTT messages for healthy TCP connections. The brokers and clients must re-send unacknowledged packets when the TCP connection was closed, though. So the QoS 1 and 2 guarantees are as important as with MQTT 3.1.1.

In case you rely on retransmission packets in your use case (e.g. because your implementation does not acknowledge packets in certain cases), we suggest reconsidering this decision before upgrading to MQTT 5.

Using passwords without usernames

MQTT 3.1.1 required the MQTT client to send a username when using a password in the CONNECT packet. For certain use cases this was very inconvenient in case there was no username. A good example would be the use of OAuth, which uses a JSON web token as the only authentication and authorization information. When using such a token with MQTT 3.1.1, static usernames were used frequently, as the only relevant information was in the password field.

While there are more elegant ways in MQTT 5 to carry tokens (e.g. via the AUTH packet), it is still possible to utilize the password field of the CONNECT packet. Now users can just use the password field and don’t need to fill out the username anymore.

Conclusion

Although the MQTT protocol stays basically the same, there are some small changes under the hood, which are enablers for many of the new features in version 5 of the popular IoT protocol. As a user of MQTT libraries, most of these changes are good to know but do not affect how MQTT is used. Developers for MQTT libraries and brokers need to take care of the changes, especially on the changes on the protocol nitty-grittys. There were also some small changes in the specified behavior (e.g. retransmission of messages), so it’s good to revisit design decisions of deployments when upgrading to MQTT 5.

What do you like most of the foundational changes in MQTT? Let us know in the comments!

This is all aimed at frivolous pursuits like video games. Hell, even video games where money is at stake should be deferring to someone who knows way more than I do. Otherwise you might find out that your deck shuffles in your poker game are woefully inadequate and some smartass is cheating you out of millions. (If your random number generator has fewer than 226 bits of state, it can’t even generate every possible shuffling of a deck of cards!)

Say you want to pitch up a sound by a random amount, perhaps up to an octave. Your audio API probably has a way to do this that takes a pitch multiplier, where I say “probably” because that’s how the only audio API I’ve used works.

Easy peasy. If 1 is unchanged and 2 is pitched up by an octave, then all you need is rand() + 1. Right?

No! Pitch is exponential — within the same octave, the “gap” between C and C♯ is about half as big as the gap between B and the following C. If you pick a pitch multiplier uniformly, you’ll have a noticeable bias towards the higher pitches.

One octave corresponds to a doubling of pitch, so if you want to pick a random note, you want 2 ** rand().

For two dimensions, you can just pick a random angle with rand() * TAU.

If you want a vector rather than an angle, or if you want a random direction in three dimensions, it’s a little trickier. You might be tempted to just pick a random point where each component is rand() * 2 - 1 (ranging from −1 to 1), but that’s not quite right. A direction is a point on the surface (or, equivalently, within the volume) of a sphere, and picking each component independently produces a point within the volume of a cube; the result will be a bias towards the corners of the cube, where there’s much more extra volume beyond the sphere.

No? Well, just trust me. I don’t know how to make a diagram for this.

Anyway, you could use the Pythagorean theorem a few times and make a huge mess of things, or it turns out there’s a really easy way that even works for two or four or any number of dimensions. You pick each coordinate from a Gaussian (normal) distribution, then normalize the resulting vector. In other words, using Python’s random module:

Note that it is possible to get zero (or close to it) for every component, in which case the result is nonsense. You can re-roll all the components if necessary; just check that the magnitude (or its square) is less than some epsilon, which is equivalent to throwing away a tiny sphere at the center and shouldn’t affect the distribution.

Since I brought it up: the Gaussian distribution is a pretty nice one for choosing things in some range, where the middle is the common case and should appear more frequently.

That said, I never use it, because it has one annoying drawback: the Gaussian distribution has no minimum or maximum value, so you can’t really scale it down to the range you want. In theory, you might get any value out of it, with no limit on scale.

In practice, it’s astronomically rare to actually get such a value out. I did a hundred million trials just to see what would happen, and the largest value produced was 5.8.

But, still, I’d rather not knowingly put extremely rare corner cases in my code if I can at all avoid it. I could clamp the ends, but that would cause unnatural bunching at the endpoints. I could reroll if I got a value outside some desired range, but I prefer to avoid rerolling when I can, too; after all, it’s still (astronomically) possible to have to reroll for an indefinite amount of time. (Okay, it’s really not, since you’ll eventually hit the period of your PRNG. Still, though.) I don’t bend over backwards here — I did just say to reroll when picking a random direction, after all — but when there’s a nicer alternative I’ll gladly use it.

And lo, there is a nicer alternative! Enter the beta distribution. It always spits out a number in [0, 1], so you can easily swap it in for the standard normal function, but it takes two “shape” parameters α and β that alter its behavior fairly dramatically.

With α = β = 1, the beta distribution is uniform, i.e. no different from rand(). As α increases, the distribution skews towards the right, and as β increases, the distribution skews towards the left. If α = β, the whole thing is symmetric with a hump in the middle. The higher either one gets, the more extreme the hump (meaning that value is far more common than any other). With a little fiddling, you can get a number of interesting curves.

Screenshots don’t really do it justice, so here’s a little Wolfram widget that lets you play with α and β live:

Note that if α = 1, then 1 is a possible value; if β = 1, then 0 is a possible value. You probably want them both greater than 1, which clamps the endpoints to zero.

Also, it’s possible to have either α or β or both be less than 1, but this creates very different behavior: the corresponding endpoints become poles.

Anyway, something like α = β = 3 is probably close enough to normal for most purposes but already clamped for you. And you could easily replicate something like, say, NetHack’s incredibly bizarre rnz function.

Say you want some event to have an 80% chance to happen every second. You (who am I kidding, I) might be tempted to do something like this:

1
2

ifrandom()<0.8*dt:do_thing()

In an ideal world, dt is always the same and is equal to 1 / f, where f is the framerate. Replace that 80% with a variable, say P, and every tic you have a P / f chance to do the… whatever it is.

Each second, f tics pass, so you’ll make this check f times. The chance that any check succeeds is the inverse of the chance that every check fails, which is \(1 – \left(1 – \frac{P}{f}\right)^f\).

For P of 80% and a framerate of 60, that’s a total probability of 55.3%. Wait, what?

Consider what happens if the framerate is 2. On the first tic, you roll 0.4 twice — but probabilities are combined by multiplying, and splitting work up by dt only works for additive quantities. You lose some accuracy along the way. If you’re dealing with something that multiplies, you need an exponent somewhere.

But in this case, maybe you don’t want that at all. Each separate roll you make might independently succeed, so it’s possible (but very unlikely) that the event will happen 60 times within a single second! Or 200 times, if that’s someone’s framerate.

If you explicitly want something to have a chance to happen on a specific interval, you have to check on that interval. If you don’t have a gizmo handy to run code on an interval, it’s easy to do yourself with a time buffer:

Using while means rolls still happen even if you somehow skipped over an entire second.

(For the curious, and the nerds who already noticed: the expression \(1 – \left(1 – \frac{P}{f}\right)^f\) converges to a specific value! As the framerate increases, it becomes a better and better approximation for \(1 – e^{-P}\), which for the example above is 0.551. Hey, 60 fps is pretty accurate — it’s just accurately representing something nowhere near what I wanted. Er, you wanted.)

Of course, you can fuss with the classic [0, 1] uniform value however you want. If I want a bias towards zero, I’ll often just square it, or multiply two of them together. If I want a bias towards one, I’ll take a square root. If I want something like a Gaussian/normal distribution, but with clearly-defined endpoints, I might add together n rolls and divide by n. (The normal distribution is just what you get if you roll infinite dice and divide by infinity!)

It’d be nice to be able to understand exactly what this will do to the distribution. Unfortunately, that requires some calculus, which this post is too small to contain, and which I didn’t even know much about myself until I went down a deep rabbit hole while writing, and which in many cases is straight up impossible to express directly.

Here’s the non-calculus bit. A source of randomness is often graphed as a PDF — a probability density function. You’ve almost certainly seen a bell curve graphed, and that’s a PDF. They’re pretty nice, since they do exactly what they look like: they show the relative chance that any given value will pop out. On a bog standard bell curve, there’s a peak at zero, and of course zero is the most common result from a normal distribution.

(Okay, actually, since the results are continuous, it’s vanishingly unlikely that you’ll get exactly zero — but you’re much more likely to get a value near zero than near any other number.)

For the uniform distribution, which is what a classic rand() gives you, the PDF is just a straight horizontal line — every result is equally likely.

If there were a calculus bit, it would go here! Instead, we can cheat. Sometimes. Mathematica knows how to work with probability distributions in the abstract, and there’s a free web version you can use. For the example of squaring a uniform variable, try this out:

(The \[Distributed] is a funny tilde that doesn’t exist in Unicode, but which Mathematica uses as a first-class operator. Also, press shiftEnter to evaluate the line.)

This will tell you that the distribution is… \(\frac{1}{2\sqrt{u}}\). Weird! You can plot it:

1

Plot[%, {u, 0, 1}]

(The % refers to the result of the last thing you did, so if you want to try several of these, you can just do Plot[PDF[…], u] directly.)

The resulting graph shows that numbers around zero are, in fact, vastly — infinitely — more likely than anything else.

What about multiplying two together? I can’t figure out how to get Mathematica to understand this, but a great amount of digging revealed that the answer is -ln x, and from there you can plot them both on Wolfram Alpha. They’re similar, though squaring has a much better chance of giving you high numbers than multiplying two separate rolls — which makes some sense, since if either of two rolls is a low number, the product will be even lower.

What if you know the graph you want, and you want to figure out how to play with a uniform roll to get it? Good news! That’s a whole thing called inverse transform sampling. All you have to do is take an integral. Good luck!

This is all extremely ridiculous. New tactic: Just Simulate The Damn Thing. You already have the code; run it a million times, make a histogram, and tada, there’s your PDF. That’s one of the great things about computers! Brute-force numerical answers are easy to come by, so there’s no excuse for producing something like rnz. (Though, be sure your histogram has sufficiently narrow buckets — I tried plotting one for rnz once and the weird stuff on the left side didn’t show up at all!)

By the way, I learned something from futzing with Mathematica here! Taking the square root (to bias towards 1) gives a PDF that’s a straight diagonal line, nothing like the hyperbola you get from squaring (to bias towards 0). How do you get a straight line the other way? Surprise: \(1 – \sqrt{1 – u}\).

I don’t claim to have a very firm grasp on this, but I had a hell of a time finding it written out clearly, so I might as well write it down as best I can. This was a great excuse to finally set up MathJax, too.

Say \(u(x)\) is the PDF of the original distribution and \(u\) is a representative number you plucked from that distribution. For the uniform distribution, \(u(x) = 1\). Or, more accurately,

Remember that \(x\) here is a possible outcome you want to know about, and the PDF tells you the relative probability that a roll will be near it. This PDF spits out 1 for every \(x\), meaning every number between 0 and 1 is equally likely to appear.

We want to do something to that PDF, which creates a new distribution, whose PDF we want to know. I’ll use my original example of \(f(u) = u^2\), which creates a new PDF\(v(x)\).

The trick is that we need to work in terms of the cumulative distribution function for \(u\). Where the PDF gives the relative chance that a roll will be (“near”) a specific value, the CDF gives the relative chance that a roll will be less than a specific value.

The conventions for this seem to be a bit fuzzy, and nobody bothers to explain which ones they’re using, which makes this all the more confusing to read about… but let’s write the CDF with a capital letter, so we have \(U(x)\). In this case, \(U(x) = x\), a straight 45° line (at least between 0 and 1). With the definition I gave, this should make sense. At some arbitrary point like 0.4, the value of the PDF is 1 (0.4 is just as likely as anything else), and the value of the CDF is 0.4 (you have a 40% chance of getting a number from 0 to 0.4).

Calculus ahoy: the PDF is the derivative of the CDF, which means it measures the slope of the CDF at any point. For \(U(x) = x\), the slope is always 1, and indeed \(u(x) = 1\). See, calculus is easy.

Okay, so, now we’re getting somewhere. What we want is the CDF of our new distribution, \(V(x)\). The CDF is defined as the probability that a roll \(v\) will be less than \(x\), so we can literally write:

$$V(x) = P(v \le x)$$

(This is why we have to work with CDFs, rather than PDFs — a PDF gives the chance that a roll will be “nearby,” whatever that means. A CDF is much more concrete.)

What is \(v\), exactly? We defined it ourselves; it’s the do something applied to a roll from the original distribution, or \(f(u)\).

$$V(x) = P\!\left(f(u) \le x\right)$$

Now the first tricky part: we have to solve that inequality for \(u\), which means we have to do something, backwards to \(x\).

$$V(x) = P\!\left(u \le f^{-1}(x)\right)$$

Almost there! We now have a probability that \(u\) is less than some value, and that’s the definition of a CDF!

$$V(x) = U\!\left(f^{-1}(x)\right)$$

Hooray! Now to turn these CDFs back into PDFs, all we need to do is differentiate both sides and use the chain rule. If you never took calculus, don’t worry too much about what that means!

Wait! Where did that absolute value come from? It takes care of whether \(f(x)\) increases or decreases. It’s the least interesting part here by far, so, whatever.

There’s one more magical part here when using the uniform distribution — \(u(\dots)\) is always equal to 1, so that entire term disappears! (Note that this only works for a uniform distribution with a width of 1; PDFs are scaled so the entire area under them sums to 1, so if you had a rand() that could spit out a number between 0 and 2, the PDF would be \(u(x) = \frac{1}{2}\).)

$$v(x) = \left|\frac{d}{dx}f^{-1}(x)\right|$$

So for the specific case of modifying the output of rand(), all we have to do is invert, then differentiate. The inverse of \(f(u) = u^2\) is \(f^{-1}(x) = \sqrt{x}\) (no need for a ± since we’re only dealing with positive numbers), and differentiating that gives \(v(x) = \frac{1}{2\sqrt{x}}\). Done! This is also why square root comes out nicer; inverting it gives \(x^2\), and differentiating that gives \(2x\), a straight line.

Incidentally, that method for turning a uniform distribution into any distribution — inverse transform sampling — is pretty much the same thing in reverse: integrate, then invert. For example, when I saw that taking the square root gave \(v(x) = 2x\), I naturally wondered how to get a straight line going the other way, \(v(x) = 2 – 2x\). Integrating that gives \(2x – x^2\), and then you can use the quadratic formula (or just ask Wolfram Alpha) to solve \(2x – x^2 = u\) for \(x\) and get \(f(u) = 1 – \sqrt{1 – u}\).

Multiply two rolls is a bit more complicated; you have to write out the CDF as an integral and you end up doing a double integral and wow it’s a mess. The only thing I’ve retained is that you do a division somewhere, which then gets integrated, and that’s why it ends up as \(-\ln x\).

And that’s quite enough of that! (Okay but having math in my blog is pretty cool and I will definitely be doing more of this, sorry, not sorry.)

Sometimes, random isn’t actually what you want. We tend to use the word “random” casually to mean something more like chaotic, i.e., with no discernible pattern. But that’s not really random. In fact, given how good humans can be at finding incidental patterns, they aren’t all that unlikely! Consider that when you roll two dice, they’ll come up either the same or only one apart almost half the time. Coincidence? Well, yes.

If you ask for randomness, you’re saying that any outcome — or series of outcomes — is acceptable, including five heads in a row or five tails in a row. Most of the time, that’s fine. Some of the time, it’s less fine, and what you really want is variety. Here are a couple examples and some fairly easy workarounds.

The nature of games is such that NPCs will eventually run out of things to say, at which point further conversation will give the player a short brush-off quip — a slight nod from the designer to the player that, hey, you hit the end of the script.

Some NPCs have multiple possible quips and will give one at random. The trouble with this is that it’s very possible for an NPC to repeat the same quip several times in a row before abruptly switching to another one. With only a few options to choose from, getting the same option twice or thrice (especially across an entire game, which may have numerous NPCs) isn’t all that unlikely. The notion of an NPC quip isn’t very realistic to start with, but having someone repeat themselves and then abruptly switch to something else is especially jarring.

The easy fix is to show the quips in order! Paradoxically, this is more consistently varied than choosing at random — the original “order” is likely to be meaningless anyway, and it already has the property that the same quip can never appear twice in a row.

If you like, you can shuffle the list of quips every time you reach the end, but take care here — it’s possible that the last quip in the old order will be the same as the first quip in the new order, so you may still get a repeat. (Of course, you can just check for this case and swap the first quip somewhere else if it bothers you.)

Random drops are often implemented as a flat chance each time. Maybe enemies have a 5% chance to drop health when they die. Legally speaking, over the long term, a player will see health drops for about 5% of enemy kills.

Over the short term, they may be desperate for health and not survive to see the long term. So you may want to put a thumb on the scale sometimes. Games in the Metroid series, for example, have a somewhat infamous bias towards whatever kind of drop they think you need — health if your health is low, missiles if your missiles are low.

I can’t give you an exact approach to use, since it depends on the game and the feeling you’re going for and the variables at your disposal. In extreme cases, you might want to guarantee a health drop from a tough enemy when the player is critically low on health. (Or if you’re feeling particularly evil, you could go the other way and deny the player health when they most need it…)

The problem becomes a little different, and worse, when the event that triggers the drop is relatively rare. The pathological case here would be something like a raid boss in World of Warcraft, which requires hours of effort from a coordinated group of people to defeat, and which has some tiny chance of dropping a good item that will go to only one of those people. This is why I stopped playing World of Warcraft at 60.

Dialing it back a little bit gives us Enter the Gungeon, a roguelike where each room is a set of encounters and each floor only has a dozen or so rooms. Initially, you have a 1% chance of getting a reward after completing a room — but every time you complete a room and don’t get a reward, the chance increases by 9%, up to a cap of 80%. Once you get a reward, the chance resets to 1%.

The natural question is: how frequently, exactly, can a player expect to get a reward? We could do math, or we could Just Simulate The Damn Thing.

We’ve got kind of a hilly distribution, skewed to the left, which is up in this histogram. Most of the time, a player should see a reward every three to six rooms, which is maybe twice per floor. It’s vanishingly unlikely to go through a dozen rooms without ever seeing a reward, so a player should see at least one per floor.

Of course, this simulated a single continuous playthrough; when starting the game from scratch, your chance at a reward always starts fresh at 1%, the worst it can be. If you want to know about how many rewards a player will get on the first floor, hey, Just Simulate The Damn Thing.

Cool. Though, that’s assuming exactly 12 rooms; it might be worth changing that to pick at random in a way that matches the level generator.

(Enter the Gungeon does some other things to skew probability, which is very nice in a roguelike where blind luck can make or break you. For example, if you kill a boss without having gotten a new gun anywhere else on the floor, the boss is guaranteed to drop a gun.)

Say you have a battle sim where every attack has a 6% chance to land a devastating critical hit. Presumably the same rules apply to both the player and the AI opponents.

Consider, then, that the AI opponents have exactly the same 6% chance to ruin the player’s day. Consider also that this gives them an 0.4% chance to critical hit twice in a row. 0.4% doesn’t sound like much, but across an entire playthrough, it’s not unlikely that a player might see it happen and find it incredibly annoying.

Perhaps it would be worthwhile to explicitly forbid AI opponents from getting consecutive critical hits.

An emerging theme here has been to Just Simulate The Damn Thing. So consider Just Simulating The Damn Thing. Even a simple change to a random value can do surprising things to the resulting distribution, so unless you feel like differentiating the inverse function of your code, maybe test out any non-trivial behavior and make sure it’s what you wanted. Probability is hard to reason about.

For developers, tracing and instrumenting code is one of the most valuable tools when debugging code. When you are developing locally, you can use local debugging and profiling tools, but when you deploy an application to the cloud, the task is more challenging. In this blog post, we will look at a new way to instrument your application using AWS X-Ray without adding tracing code to your business logic.

AWS X-Ray

Released to the public earlier this year, AWS X-Ray provides a mechanism for developers to instrument and trace their code while running on AWS. AWS X-Ray enables developers to analyze and debug distributed applications through the entire AWS stack. Developers and operations personnel can also follow the flow of a request through the entire AWS infrastructure. X-Ray works well in a monolithic or microservices model. Either way, developers can get a complete view of their application’s performance and behavior.

X-Ray provides two key mechanisms for analyzing an application:

The service map.

The trace view.

The service map, shown here, provides a high-level view of the services consumed by an application and their relative health:

Application developers often look for a more detailed view of their application to help them answer questions like “Which functions are my bottlenecks?” and “Where is the most latency in the application?” The trace view allows you to see the flow of an application through service calls. It shows you latency between services and the exact execution time of each X-Ray segment.

Similar to other logging and metrics frameworks, X-Ray requires the developer to insert specific code throughout the application. This might cause issues because the logging and metrics code has the potential to increase the complexity of the application code. Increased complexity leads, in turn, to increased maintenance and testing costs. Ideally, developers should add X-Ray tracing to an application in a non-invasive manner, one that does not affect the underlying business logic.

Aspect-oriented programming and the Spring Framework

Aspect-oriented programming (AOP) is a mechanism by which code runs either before, after, or around a target function. The code that runs outside the target code is called an aspect. An aspect provides the ability to perform actions like logging, transaction management, and method retries. The goal is for the aspect to provide these capabilities without affecting the target code. Pointcuts define where these aspects should act in the code. AOP allows developers to leverage powerful functionality without affecting their business logic.

An aspect-oriented approach is a perfect way to implement AWS X-Ray because it keeps the underlying code clean and provides non-invasive, reusable tracing logic.

Simply creating an aspect to invoke X-Ray is not enough. We also have to create pointcuts that tell the aspect where to act. These pointcuts will define which methods we wrap with tracing logic. After they are defined, the aspect sends the entire call stack to X-Ray for visualization.

The Spring Framework is a common application framework for the development of Java software. It provides an extensive programming and config method for modern Java applications.

Spring provides facilities for a range of application functions, including web applications, messaging applications, and streaming data applications. Spring applications are capable of being cloud-native from the onset.

AOP is one of the core components of the Spring Framework. Spring’s implementation of AOP “weaves” application code at runtime. Load-time weaving is also available, but it requires extra configuration at compile or application runtime to work properly. The Spring runtime weaving does not require any special compilation or agents.

AWS X-Ray Spring extensions

Starting with version 1.3.0, the AWS X-Ray SDK lets you use AOP in the Spring Framework to instrument code with no change to the application’s business logic. This means that there is now a non-invasive way to instrument your applications running remotely in AWS.

To include the extension in the code, first add the dependency to the application. If you are using Maven, you add the dependency this way:

After you’ve included this in your application, there are a couple of steps that need configuration before tracing is enabled. First, classes must either be annotated with the @XRayEnabled annotation, or implement the XRayTraced interface. This tells the AOP system to wrap the functions of the affected class for X-Ray instrumentation.

Second, you need an interceptor to actually wrap the code. This involves extending an abstract class, AbstractXRayInterceptor, to activate X-Ray tracing in the application. The AbstractXRayInterceptor contains methods that must be overridden:

generateMetadata – This function allows customization of the metadata attached to the current function’s trace. By default, the class name of the executing function is recorded in the metadata. You can add more data if you need additional insights.

xrayEnabledClasses – This function is empty, and should remain so. It serves as the host for a pointcut instructing the interceptor about which methods to wrap. The developer should define the pointcut. You can specify which classes that are annotated with XRayEnabled you want traced. A pointcut statement of @Pointcut(“@within(com.amazonaws.xray.spring.aop.XRayEnabled) && bean(*Controller)”) tells the interceptor to wrap all controller beans annotated with the @XRayEnabled annotation.

By default, the AbstractXRayInterceptor instruments around all Spring data repository instances.

After it’s configured, the Spring application runs as normal. The X-Ray interceptor picks up annotated classes automatically and builds a trace of the call stack. You can view this call stack in the Traces section of the X-Ray console.

Looking at the trace, you will see the complete call stack of the application, from the controller down through the service calls. Stack traces are arranged in a hierarchy in the same way as typical X-Ray traces. Any exceptions that occur in the call stack are added to the trace by the interceptor automatically. This gives you a complete view of the application’s functionality, performance, and error states. X-Ray provides the convenience of a managed service of the tracing and reporting engine. Without it, the developer would have to manage the tracing and reporting infrastructure.

Conclusion

On its own, X-Ray provides powerful functionality to trace your applications running on AWS. When you combine the service with the ease of use of AOP and the Spring Framework, it is a natural fit.

The demo app code is available for download. Download it today and start integrating deep tracing into your Spring applications.

The Pirate Bay has been both an early adopter and a pioneer when it comes to cryptocurrencies.

Earlier this year the site made headlines when it started to mine cryptocurrency through its visitors, which proved to be a controversial move. Still, many sites followed Pirate Bay’s example.

Pirate Bay’s interest in cryptocurrency wasn’t new though.

The torrent site first allowed people to donate Bitcoin five years ago, which paid off right away. In little more than a day, 73 transactions were sent to Pirate Bay’s address, adding up to a healthy 5.56 BTC, roughly $700 at the time.

Today, the site still accepts Bitcoin donations. While it doesn’t bring in enough to pay all the bills, it doesn’t hurt either.

Around Christmas, The Pirate Bay decided to expand its cryptocurrency donation options. In addition to the traditional Bitcoin address, the torrent site added a Bitcoin Segwit Bech32 option, plus Litecoin and Monero addresses.

While the new donation options show that The Pirate Bay has faith in multiple currencies, the site doesn’t appear to be a fan of them all. The Bitcoin fork “Bitcoin Cash” is also listed, for example, but in a rather unusual way.

“BCH: Bcash. LOL,” reads a mention posted on the site.

BCH: Bcash. LOL

Those who are following the cryptocurrency scene will know that there has been quite a bit of infighting between some supporters of the Bitcoin Cash project and those of the original Bitcoin in recent weeks.

Several high-profile individuals have criticized Bitcoin’s high transaction fees and limitations, while others have very little faith in the future of the Bitcoin Cash alternative.

Although there are not a lot of details available, the “LOL” mention suggests that the TPB team is in the latter camp.

In recent years The Pirate Bay has received a steady but very modest flow of Bitcoin donations. Lasy year we calculated that it ‘raked’ in roughly $9 per day.

However, with the exponential price increase recently, the modest donations now look pretty healthy. Since 2013 The Pirate Bay received well over 135 BTC in donations, which is good for $2 million today. LOL.

As companies mature in their cloud journey, they implement layered security capabilities and practices in their cloud architectures. One such practice is to continually assess golden Amazon Machine Images (AMIs) for security vulnerabilities. AMIs provide the information required to launch an Amazon EC2 instance, which is a virtual server in the AWS Cloud. A golden AMI is an AMI that contains the latest security patches, software, configuration, and software agents that you need to install for logging, security maintenance, and performance monitoring. You can build and deploy golden AMIs in your environment, but the AMIs quickly become dated as new vulnerabilities are discovered.

A security best practice is to perform routine vulnerability assessments of your golden AMIs to identify if newly found vulnerabilities apply to them. If you identify a vulnerability, you can update your golden AMIs with the appropriate security patches, test the AMIs, and deploy the patched AMIs in your environment. In this blog post, I demonstrate how to use Amazon Inspector to set up such continuous vulnerability assessments to scan your golden AMIs routinely.

Solution overview

Amazon Inspector performs security assessments of Amazon EC2 instances by using AWS managed rules packages such as the Common Vulnerabilities and Exposures (CVEs) package. The solution in this post creates EC2 instances from golden AMIs and then runs an Amazon Inspector security assessment on the created instances. When the assessment results are available, the solution consolidates the findings and advises you about next steps. Furthermore, the solution schedules an Amazon CloudWatch Events rule to run the golden AMI vulnerability assessments on a regular basis.

The following solution diagram illustrates how this solution works.

Here’s how this solution works, as illustrated in the preceding diagram:

Later in this blog post, I provide instructions for creating this JSON parameter.

For each AMI specified in the JSON parameter, the Lambda function creates an EC2 instance. When each instance starts, it installs the Amazon Inspector agent by using the user-data script provided in the JSON. The Lambda function then copies each golden AMI’s tags (you will assign custom metadata in the form of tags to each golden AMI when you set up the solution) to the corresponding EC2 instance. The function also adds a tag with the key of continuous-assessment-instance and value as true. This tag identifies EC2 instances that require regular security assessments. The Lambda function copies the AMI’s tags to the instance (and later, to the security findings found for the instance) to help you identify the golden AMIs for each security finding. After you analyze security findings, you can patch your golden AMIs.

The first time the StartContinuousAssessment function runs, it creates:

An Amazon Inspector assessment template: The template contains a reference to the Amazon Inspector assessment target created in the preceding step and the following AWS managed rules packages to evaluate:

For subsequent assessments, the StartContinuousAssessment function reuses the target and the template created during the first run of StartContinuousAssessment function.

Note: Amazon Inspector can start an assessment only after it finds at least one running Amazon Inspector agent. To allow EC2 instances to boot and the Amazon inspector agent to start, the Lambda function waits four minutes. Because the assessment runs for approximately one hour and boot time for EC2 instances typically takes a few minutes, all Amazon Inspector agents start before the assessment ends.

The Lambda function then runs the assessment. The Amazon Inspector agents collect behavior and configuration data, and pass it to Amazon Inspector. Amazon Inspector analyzes the data and generates Amazon Inspector findings, which are possible security findings you may need to address.

After the Lambda function completes the assessment, Amazon Inspector publishes an assessment-completion notification message to an Amazon SNS topic called ContinuousAssessmentCompleteTopic. SNS uses topics, which are communication channels for sending messages and subscribing to notifications.

The notification message published to SNS triggers the AnalyzeInspectorFindings Lambda function, which performs the following actions:

Associates the tags of each EC2 instance with security findings found for that EC2 instance. This enables you to identify the security findings using the app-name tag you specified for your golden AMIs. You can use the information provided in the findings to patch your golden AMIs.

Terminates all instances associated with the continuous-assessment-instance=true tag.

Aggregates the number of findings found for each EC2 instance by severity and then publishes a consolidated result to an SNS topic called ContinuousAssessmentResultsTopic.

How to deploy the solution

To deploy this solution, you must set it up in the AWS Region where you build your golden AMIs. If that AWS Region does not support Amazon Inspector, at the end of your continuous integration pipeline, you can copy your AMIs to an AWS Region where Amazon Inspector assessments are supported. To learn more about continuous integration pipelines, see What is Continuous Integration?

Run the supplied AWS CloudFormation template and subscribe to an SNS topic to receive assessment results – Set up the infrastructure required to run vulnerability assessments and subscribe to an SNS topic to receive assessment results via email.

Test golden AMI vulnerability assessments – Ensure you have successfully set up the required resources to run vulnerability assessments.

Set up a CloudWatch Events rule for triggering continuous golden AMI vulnerability assessments – Schedule the execution of vulnerability assessments on a regular basis.

1. Tag your golden AMIs

You can search assessment findings based on golden AMI tags after Amazon Inspector completes an assessment.

Choose your AMI from the list, and then choose Actions > Add/Edit Tags.

Choose Create Tag. In the Key column, type app-name. In the Value column, type your application name. Following the same steps, create the app-version and app-environment tags. Choose Save.

Now that you have tagged your golden AMIs, you need to create golden AMI metadata, which will be read by the StartContinuousAssessment function to initiate vulnerability assessments. You will store the golden AMI metadata in the Systems Manager Parameter Store.

This solution reads golden AMI metadata from a parameter stored in the Systems Manager Parameter Store. The metadata must be in JSON format and must contain the following information for each golden AMI:

Ami-Id

InstanceType

UserData

Step A: Find the AMI ID of your golden AMI.

An AMI ID uniquely identifies an AMI in an AWS Region and is a required parameter for launching an EC2 instance from a golden AMI. To find the AMI ID of your golden AMI:

The user-data script automates the installation of software packages when an EC2 instance launches for the first time. In this step, you create an operating system specific, JSON-compatible user-data script that installs and starts the Amazon Inspector agent.

Based on Running Commands on Your Linux Instance at Launch, you make a Linux shell script user-data compatible by prefixing it with a #!/bin/bash. In this step, you add the #!/bin/bash prefix to the script from the preceding step. The following is the user-data compatible version of the script from the preceding step.

The user-data script provided in the JSON metadata must be JSON-compatible, which you will do next.

Make the user-data script JSON compatible

To make the user-data script JSON compatible, you must replace all new-line characters with a \r\n\r\n sequence. The following is the JSON-compatible user-data script that you specify for your Amazon Linux-based golden AMI in Step D.

Repeat Steps A, B, and C to find the Ami Id, InstanceType, and UserData for each of your golden AMIs. When you have this metadata, you can create the JSON document of metadata for all your golden AMIs. The StartContinuousAssessment Lambda function reads this JSON to start golden AMI vulnerability assessments.

Replace all placeholder values with values corresponding to your first golden AMI. If your golden AMI is Amazon Linux-based, you can specify the userData as the JSON-compatible-user-data-for-Amazon-Linux-AMI from Step C.5. Next, replace the placeholder values for your second golden AMI. You can add more entries to your JSON document, if you have more than two golden AMIs.

Note: The total number of characters in the JSON document must be fewer than or equal to 4,096 characters, and the number of golden AMIs must be fewer than 500. You must verify whether your account has permissions to run one on-demand EC2 instance for each of your golden AMIs. For information about how to verify service limits, see Amazon EC2 Service Limits.

Now that you have created the JSON document of your golden AMIs, you will store the JSON document in a Systems Manager parameter. The StartContinuousAssessment Lambda function will read the metadata from this parameter.

Choose Create subscription. In the Topic ARN field, paste the ARN of ContinuousAssessmentResultsTopic that you noted in the previous section.

In the Protocol drop-down, choose Email.

In the Endpoint box, type the email address where you will receive notifications.

Choose Create subscription.

Navigate to your email application and open the message from AWS Notifications. Click the link to confirm your subscription to the SNS topic.

4. Test golden AMI vulnerability assessments

Before you schedule vulnerability assessments, you should test the process by running the StartContinuousAssessment function. In this test, you trigger a security assessment and monitor it. You then receive an email after the assessment has completed, which shows that vulnerability assessments have been successfully set up.

On Dashboard under Recent Assessment Runs, you will see an entry with the status, Collecting Data. This status indicates that Amazon Inspector agents are collecting data from instances running your golden AMIs. The agents collect data for an hour and then Amazon Inspector analyzes the collected data.

After Amazon Inspector completes the assessment, the status in the console changes to Analysis complete. Amazon Inspector then publishes an SNS message that triggers the AnalyzeInspectionReports Lambda function. When AnalyzeInspectionReports publishes results, you will receive an email containing consolidated assessment results. You also will be able to see the findings.

In the navigation pane, choose Assessment Runs. In the table on the Amazon Inspector – Assessment Runs page, choose the findings of the latest assessment run.

Choose the settings () icon and choose the appropriate tags to see the details of findings, as shown in the following screenshot. The findings also contain information about how you can address each underlying vulnerability.

Having verified that you have successfully set up all components of golden AMI vulnerability assessments, you now will schedule the vulnerability assessments to run on a regular basis to give you continual insight into the health of instances created from your golden AMIs.

For Rule definition, type ContinuousGoldenAMIAssessmentTrigger for the name, and type as the description, This rule triggers the continuous golden AMI vulnerability assessment process.

Choose Create rule.

The vulnerability assessments are executed on the first occurrence of the schedule you chose while setting up the CloudWatch Events rule. After the vulnerability assessment is executed, you will receive an email to indicate that your continuous golden AMI vulnerability assessments are set up.

Summary

To get visibility into the security of your EC2 instances created from your golden AMIs, it is important that you perform security assessments of your golden AMIs on a regular basis. In this blog post, I have demonstrated how to set up vulnerability assessments, and the results of these continuous golden AMI vulnerability assessments can help you keep your environment up to date with security patches. To learn how to patch your golden AMIs, see Streamline AMI Maintenance and Patching Using Amazon EC2 Systems Manager.

If you have comments about this blog post, submit them in the “Comments” section below. If you have questions about implementing the solution in this post, start a new thread on the Amazon Inspector forum or contact AWS Support.

The Paris Region will benefit from three AWS Direct Connect locations. Telehouse Voltaire is available today. AWS Direct Connect will also become available at Equinix Paris in early 2018, followed by Interxion Paris.

All AWS infrastructure regions around the world are designed, built, and regularly audited to meet the most rigorous compliance standards and to provide high levels of security for all AWS customers. These include ISO 27001, ISO 27017, ISO 27018, SOC 1 (Formerly SAS 70), SOC 2 and SOC 3 Security & Availability, PCI DSS Level 1, and many more. This means customers benefit from all the best practices of AWS policies, architecture, and operational processes built to satisfy the needs of even the most security sensitive customers.

AWS is certified under the EU-US Privacy Shield, and the AWS Data Processing Addendum (DPA) is GDPR-ready and available now to all AWS customers to help them prepare for May 25, 2018 when the GDPR becomes enforceable. The current AWS DPA, as well as the AWS GDPR DPA, allows customers to transfer personal data to countries outside the European Economic Area (EEA) in compliance with European Union (EU) data protection laws. AWS also adheres to the Cloud Infrastructure Service Providers in Europe (CISPE) Code of Conduct. The CISPE Code of Conduct helps customers ensure that AWS is using appropriate data protection standards to protect their data, consistent with the GDPR. In addition, AWS offers a wide range of services and features to help customers meet the requirements of the GDPR, including services for access controls, monitoring, logging, and encryption.

From Our Customers Many AWS customers are preparing to use this new Region. Here’s a small sample:

Societe Generale, one of the largest banks in France and the world, has accelerated their digital transformation while working with AWS. They developed SG Research, an application that makes reports from Societe Generale’s analysts available to corporate customers in order to improve the decision-making process for investments. The new AWS Region will reduce latency between applications running in the cloud and in their French data centers.

SNCF is the national railway company of France. Their mobile app, powered by AWS, delivers real-time traffic information to 14 million riders. Extreme weather, traffic events, holidays, and engineering works can cause usage to peak at hundreds of thousands of users per second. They are planning to use machine learning and big data to add predictive features to the app.

Radio France, the French public radio broadcaster, offers seven national networks, and uses AWS to accelerate its innovation and stay competitive.

Les Restos du Coeur, a French charity that provides assistance to the needy, delivering food packages and participating in their social and economic integration back into French society. Les Restos du Coeur is using AWS for its CRM system to track the assistance given to each of their beneficiaries and the impact this is having on their lives.

AlloResto by JustEat (a leader in the French FoodTech industry), is using AWS to to scale during traffic peaks and to accelerate their innovation process.

AWS Consulting and Technology Partners We are already working with a wide variety of consulting, technology, managed service, and Direct Connect partners in France. Here’s a partial list:

AWS in France We have been investing in Europe, with a focus on France, for the last 11 years. We have also been developing documentation and training programs to help our customers to improve their skills and to accelerate their journey to the AWS Cloud.

As part of our commitment to AWS customers in France, we plan to train more than 25,000 people in the coming years, helping them develop highly sought after cloud skills. They will have access to AWS training resources in France via AWS Academy, AWSome days, AWS Educate, and webinars, all delivered in French by AWS Technical Trainers and AWS Certified Trainers.

Use it Today The EU (Paris) Region is open for business now and you can start using it today!

Tags

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.