She went on to say how much has changed in the last decade. We talked about how the database administrator, as a career role, wasn’t really being hired for much these days. Things had changed. Evolved a lot.

How do you keep up with all the new technology, she asked?

I went on to talk about Amazon RDS, EC2, lambda & serverless as really exciting stuff. And lets not forget terraform (I wrote a howto on terraform), ansible, jenkins and all the other deployment automation technologies.

By the way, I write an article like this every month, covering consulting lessons, tech trends, cloud and startup innovation. Get the next one via email.

We talked about Redshift too. It seems to be everywhere these days and starting to supplant hadoop as the warehouse of choice for analytics.

It was a great conversation, and afterward I decided to summarize my thoughts. Here’s how I think automation and the cloud are impacting the dba role.

My career pivots

Over the years I’ve poured all those computer science algorithms, coding & hardware skills into a lot of areas. Tools & popular language change. Frameworks change. But solid deductive reasoning remains priceless.

o C++ Developer

Fresh out of college I was doing Object Oriented Programming on the Macintosh with Codewarrior & powerplant. C++ development is no joke, and daily coding builds strength in a lot of areas. Turns out he application was a database application, so I was already getting my feet wet with databases.

o Jack of all trades developer & Unix admin

One type of job role that I highly recommend early on is as a generalist. At a small startup with less than ten employees, you become the primary technology solutions architect. So any projects that come along you get your hands dirty with. I was able to land one of these roles. I got to work on Windows one day, Mac programming another & Unix administration & Oracle yet another day.

o Oracle DBA

The third pivot was to work primarily on Oracle. I attended Oracle conferences & my peers were Oracle admins. Interestingly, many of the Oracle “experts” came from more of a business background, not computer science. So to have a more technical foundation really made you stand out.

For the startups I worked with, I was a performance guru, scalability expert. Managers may know they have Oracle in the mix, but ultimately the end goal is to speed up the website & make the business run. The technical nuts & bolts of Oracle DBA were almost incidental.

o MySQL & Postgres

As Linux matured, so did a lot of other open source projects. In particular the two big Open Source databases, MySQL & Postgres became viable.

Suddenly startups were willing to put their businesses on these technologies. They could avoid huge fees in Oracle licenses. Still there were not a lot of career database experts around, so this proved a good niche to focus on.

o RDS & Redshift on Amazon Cloud

Fast forward a few more years and it’s my fifth career pivot. Amazon Web Services bursts on the scene. Every startup is deploying their applications in the cloud. And they’re using Amazon RDS their managed database service to do it. That meant the traditional DBA role was less crucial. Sure the business still needed data expertise, but usually not as a dedicated role.

Time to shift gears and pour all of that Linux & server building experience into cloud deployments & migrating to the cloud.

o Devops, data, scalability & performance

Now of course the big sysadmin type role is usually called an SRE or Devops role. SRE being site reliability engineer. New name but many of the same responsibilities.

Now though infrastructure as code becomes front & center. Tools like CloudFormation & Terraform, plus Ansible, Chef & Jenkins are all quite mature, and being used everywhere.

Checkout your infrastructure code from git, and run terraform apply. And minutes later you have rebuilt your entire stack from bare metal to fully functioning & autoscaling application. Cool!

Cons of automation & databases

However these days cloud, automation & microservices have brought a lot of madness too! Don’t believe me check out this piece on microservice madness.

With microservices you have more databases across the enterprise, on more platforms. How do you restore all at the same time? How do you do point-in-time recovery? What if your managed service goes down?

Migration scripts have become popular to make DDL changes in the database. Going forward (adding columns or tables) is great. But should we be letting our deployment automation roll *BACK* DDL changes? Remember that deletes data right? 🙂

What about database drop & rebuild? Or throwing databases in a docker container? No bueno. But we’re seeing this more and more. New performance problems are cropping up because of that.

What about when your database upgrades automatically? Remember when you use a managed service, it is build for 1000 users, not one. So if your use case is different you may struggle.

In my experience upgrading RDS was a nightmare. Database as a service upgrades lack visibility. You don’t have OS or SSH access so you can’t keep track of things. You just simply wait.

No longer do we have “zero downtime”. With amazon RDS you have guarenteed downtime upgrades. No seriously.

As the field of databases fragments, we are wearing many more hats. If you like this challenge & enjoy being a generalist, you may feel at home here. But it is a long way from one platform one skill set career path.

Also fragmented db platforms means more complex recovery. I can’t stress this enough. It would become practically impossible to restore all microservices, all their underlying databases & all systems to one single point in time, if you need to.

Hi, I’m currently an IT professional and I’m training for AWS Solutions Architect – Associate exam. My question is how to gain some valuable hands-on experience without quitting my well-paying consulting gig I currently have which is not cloud based. I was thinking, perhaps I could do some cloud work part time after I get certified.

I work in the public sector and the IT contract prohibits the agency from engaging any cloud solutions until the current contract expires in 2019. But I can’t just sit there without using these new skills – I’ll lose it. And if I jump ship I’ll loose $$$ because I don’t have the cloud experience.

Hi George,

Here’s what I’d suggest:

1. Setup your AWS account

From there be sure to enable two factor authentication. Then stop using your root account! Create a new IAM user with permissions to command line & API. Then use that to authenticate. You’ll be using the awscli python package.

2. Automatic deployments

You can also alternatively use Amazon’s own CodeCommit which is a drop-in replacement for github and works fine too. Get your code in there.

Next setup codedeploy so that you can deploy that application to your EC2 instance with one command.

But you’re not done yet. Now automate the spinup of the EC2 instance itself with Ansible. If you’re comfortable with shell scripts, or other operational tools, the learning curve should be pretty easy for you.

4. Version your infrastructure

Amazon provides CloudFormation as it’s foundational templating system. You can use JSON or YAML. Basically you can describe every object in your account, from IAM users, to VPCs, RDS instances to EC2, lambda code & on & on all inside of a template file written in JSON.

Terraform is a sort of cloud-agnostic version of the same thing. It’s also more feature rich & has got a huge following. All reasons to consider it.

Once you’ve got all your objects in templates, you can checkin these files into your git or CodeCommit repository. Then updating infrastructure is like updating any other pieces of code. Now you’re self-documenting, and you can roll-forward & backward if you make a mistake!

6. Bonus: database skills

J. Learn RDS – MySQL, Postgres, Aurora, Oracle, SQLServer etc

For a bonus page on your resume, dig into Amazon Relational Database Service or RDS. The platform supports various databases, so try out the ones you know already first. You’ll find that there are a few surprises. I wrote Is upgrading RDS like a sh*t storm that will not end?. That was after a very frustrating weekend upgrading a customers production RDS instance. 🙂

7. Bonus: Data warehousing

K. Redshift, Spectrum, Glue, Quicksight etc

If you’re interested in the data side of the house, there is a *LOT* happening at AWS. From their spectrum technology which allows you to keep most of your data in S3 and still query it, to Glue which provides an ETL as a service offering.

You can also use a world-class columnar storage database called Redshift. This is purpose built for reporting & batch jobs. It’s not going to meet your transactional web-backend needs, but it will bring up those Tableau reports blazingly fast!

8. Now go find that cloud deployment job!

With the above under your belt there’s plenty of work for you. There is tons of demand right now for this stuff.

Did you do learn all that? You’ve now got very very in-demand skills. The recruiters will be chomping at the bit. Update those buzzwords (I mean keywords). This will help match you with folks looking for someone just like you!

Let’s say you’re hiring a devops & you want to suss out their database knowledge? Or you’re hiring a professional services firm or freelance consultant. Whatever the case you’ll need to sift through for the best people. Here’s how.

What database does Amazon support for caching?

Caching is a popular way to speed up access to your backend database. Put Amazon’s elasticache behind your webserver, and you can reduce load on your database by 90%. Nice!

The two types that amazon supports are Memcache & Redis. Memcache is historically more popular. These days Redis seems a clear winner. It’s faster, and can maintain your cached data between restarts. That will save you I promise!

How can I store big data in AWS?

When you’re doing large reports for your business intelligence team, you don’t want to bog down your backend relational database. Redshift is purpose built for this use case.

I’ve see a report that took over 8 hours in MySQL return in under 60 seconds in Redshift!

A new offering is Amazon Spectrum. This tech is super cool. Load up all your data into S3, in standard CSV format. Then without even loading it into Redshift, you can query the S3 data directly. This is super useful. Firstly because S3 is 1/10th the price. But also because it allows you to stage your data before loading into Redshift itself. Goodbye Google Big Query! I talked about spectrum here.

What relational database options are there on Amazon?

Amazon supports a number of options through it’s Relational Database Service or RDS. This is managed databases, which means less work on your DBAs shoulders. It also may make upgrades slower and harder with more downtime, but you get what you pay for.

There are a lot of platforms available. As you might guess MySQL & Postgres are there. Great! Even better you can use MariaDB if that’s your favorite. You can also go with Aurora which is Amazon’s own home-brew drop in replacement for MySQL that promises greater durability and some speedups.

If you’re a glutton for punishment, you can even get Oracle & SQL Server working on RDS. Very nice!

How do I do ETL & migrate data to AWS?

Let’s be honest, Amazon wants to make this really easy. The quicker & simpler it is to get your data there, that more you’ll buy!

Amazon’s Database Migration Service or DMS allows you to configure your old database as a data source, then choose a Amazon db solution as destination, then just turn on the spigot and pump your data in!

ETL is extract transform and load, data warehouse terminology for slicing and dicing data before you load it into your warehouse. Many of todays warehouses are being built with the data lake model, because databases like Redshift have gotten so damn fast. That model means you stage all your source data as-is in your warehouse, then build views & summary tables as needed to speed up queries & reports. Even better you might look a tool like xplenty.

It may be you have 10-zillion rows of source data and don’t want or need to get all of that into Redshift and keep it there. But it would be nice to have access to it when you do.

Enter EXTERNAL tables, aka Spectrum. Now you can keep all your raw data in S3, an in place operational datastore of data before it’s been reworked and transformed. Use SQL to access it right where it sits.

Get all the advantages of lifecycle management in S3, and don’t pay all the redshift costs for data you don’t need all the time. Cool!

Let’s see how it works.

What is an EXTERNAL table?

Spectrum is Amazon’s rebranding of an old database technology called EXTERNAL TABLES. Back in the 90’s Oracle pioneered this work, allowing you to essentially map a CSV file, that sits outside the database proper. This means you can query all that juicy data sitting in flat files. Cool!

Athena allows you to query this stuff as a service, native to AWS. Spectrum allows you to create those external tables inside of Redshift.

A reader pointed out that some No-SQL databases do support joins. Huh? My face contorts in total confusion. How? Why?

For years SQL was misunderstood & unloved

Relational databases have been around for decades. 43 years according to the StackExchange article. That’s a lot of years. I’ve spent a few years as a dba, aka database administrator. The role can be distilled down as a herder of sorts. Keep all the data bits in the right boxes with the right labels. A digital librarian, that makes sure the books don’t get lost.

Of course patrons don’t always put books back where they should, and strict rules get put in place to avoid losing that one volume of shakespear in miles of shelves.

In the fast moving world of web + mobile, product is king, and agile rules the day. And anything that can make us more agile also wins. SQL, much maligned & misunderstood, was not one of those things.

NoSQL burst on the scene with much fanfare

With all that pressure, it was no wonder engineers thought, there must be a better way. Then along comes the No SQL database. I mean just the name speaks volumes about the design goal.

We’ll sacrifice anything, just please don’t make me write SQL!

The promises…

1. Never have to deal with pesky SQL that we don’t understand!
2. Interact with the database like any other data structure in our code!
3. Be schemaless! Crotchety Database Administrators be damned!
4. Be distributed. Be everywhere consistent! Be indistinguishable from magic!
5. Always be fast.

In that rush into the abyss, we lost track of durability. And down the rabbithole we went!

Relational databases tried to be key-value

Then I started hearing about crazy things, like MySQL providing a memcache plugin, so you could use it albeit lightening fast, as a key-value store. You could sidestep that pesky SQL engine, and get right down to the bare metal. But why? Memcache & Redis were already doing that & purpose built. Why indeed?

Meanwhile hadoop is losing ground. Bigquery & Redshift both speak SQL

But then something funny started to happen. It seemed there was a backlash against Mongodb. A lot of customers were losing data. (Yep that’s what durability means guys…) And the hype started reversing. Even the mighty hadoop has been losing popularity of late. How long does it take to write an EMR job versus an SQL query. Let’s be honest?

Engineering truth versus fashion

43 years is a lot of years. And when we drop all the fashion trends in tech, and the new database du jour, what do we find?

There is room for No SQL databases. Yep. And the do certain things, and solve certain types of problems well. But their not general workhorses, nor can they slice and dice your data however you like. And when you get to that point in your project, you’re going to want to ask interesting questions of your data.

Answer the questions you haven’t thought of

No-SQL databases are great if you know how you want to access the data. Users come from the users table, and that’s that!

But if later on you want to ask questions like, which users watched this video, which users are active, which users spent $100 in January? These questions may not be possible because NoSQL can’t join those other tables.

Relational databases shine when you need to aggregate your data, reorganize it, or ask unanticipated questions. And aren’t those most of the interesting questions?

Rugged, Proven & Open

SQL having been around so long is a fairly open standard. Sure there are extensions of it, but most of the basic stuff is there in all the products. That means you learn it once, and can interact with databases across the spectrum. That’s a win for everybody.

Business users can write it

Another under appreciated feature though is that basic queries are easy to write. They don’t require complex syntax like a hadoop job, or your favorite imperative programming language. The queries are readable, almost english-like sentences.

Given all that, it seems SQL is likely to be around for a long time to come!

Everyone is hot under the collar again. So-called serverless or no-ops services are popping up everywhere allowing you to deploy “just code” into the cloud. Not only won’t you have to login to a server, you won’t even have to know they’re there.

As your code is called, but cloud events such a file upload, or hitting an http endpoint, your code runs. Behind the scene through the magic of containers & autoscaling, Amazon & others are able to provision in milliseconds.

Pretty cool. Yes even as it outsources the operations role to invisible teams behind Amazon Lambda, Google Cloud Functions or Webtask it’s also making companies more agile, and allowing startup innovation to happen even faster.

Believe it or not I’m a fan too.

That said I thought it would be fun to poke a hole in the bubble, and throw some criticisms at the technology. I mean going serverless today is still bleeding edge, and everyone isn’t cut out to be a pioneer!

With that, here’s 30 questions to throw on the serverless fanboys (and ladies!)…

1. Security

o Are you comfortable removing the barrier around your database?
o With more services, there is more surface area. How do you prevent malicious code?
o How do you know your vendor is doing security right?
o How transparent is your vendor about vulnerabilities?

2. Testing

o How do you do integration testing with multiple vendor service components?
o How do you test your API Gateway configurations?
o Is there a way to version control changes to API Gateway configs?
o Can Terraform or CloudFormation help with this?
o How do you do load testing with a third party db backend?
o Are your QA tests hitting the prod backend db?
o Can you easily create & destroy test dbs?

3. Management

o How do you do zero downtime deployments with Lambda?
o Is there a way to deploy functions in groups, all at once?
o How do you manage vendor lock-in at the monitoring & tools level but also code & services?
o How do you mitigate your vendors maintenance? Downtime? Upgrades?
o How do you plan for move to alternate vendor? Database import & export may not be ideal, plus code & infrastructure would need to be duplicated.
o How do you manage a third party service for authentication? What are the pros & cons there?
o What are the pros & cons of using a service-based backend database?
o How do you manage redundancy of code when every client needs to talk to backend db?

4. Monitoring & debugging

o How do you build a third-party monitoring tool? Where are the APIs?
o When you’re down, is it your app or a system-wide problem?
o Where is the New Relic for Lambda?
o How do you degrade gracefully when using multiple vendors?
o How do you monitor execution duration so your function doesn’t fail unexpectedly?
o How do you monitor your account wide limits so dev deploy doesn’t take down production?

5. Performance

o How do you handle startup latency?
o How do you optimize code for mobile?
o Does battery life preclude a large codebase on client?
o How do you do caching on server when each invocation resets everything?
o How do you do database connection pooling?

2. Query S3 with Athena

Chances are if you’re using AWS for anything, you’ve got data in S3. And wouldn’t it be nice to pick that apart and dig through it, where it sits?

Oracle had a feature called “external tables” and MySQL had something similar. Now Amazon is offering that native within it’s own cloud universe. Thanks to some tricky lambda code, now you can do that. Don’t worry how they did it, because it’s been packaged into a nice easy service for your use!

4. Expanded RDS

If you hadn’t heard, there is now MariaDB support. And with it, there’s a migration from MySQL to Mariadb as well.

Using Mariadb may bring you performance advantages & improvements. But RDS may mitigate this by productize & standarizing things.

You can also now move encrypted snapshots across regions. In my view this isn’t really a new feature, but rather fixing something that was broken before. The previous limitation was really more a symptom of their global network of data centers, than any built feature per se.

Like every other startup, you’ve built everything on AWS or are moving there quickly.

So it makes sense & is an easy win to build your analytics & bigdata engine on AWS too. Enter Redshift to the rescue.

Unlike the old days of Oracle where you had career DBAs to handle your data & hold the keys to the data kingdom, these days nobody is managing the data. What makes matters worse is the documentation is weak, and it hasn’t been out long enough to warrant good books on the topic.

But there’s hope!

Here are five great tools to help you in your day-to-day management of Redshift

1. Slow queries

Just like MySQL, Postgres & Oracle, your database performance & responsiveness lives & dies by the SQL that you write. And just like all those databases, you want to look at the slowest queries, to know where to tune. Spend your time on the worst offenders first.

2. Fragmented Tables

You add data, you delete data. And just like all the other relational databases we know & love, this process leaves gaps. New data is still added at the high water mark, and full table scans still read those empty blocks.

The solution is the vacuum command, but which tables should we run it on?

3. Analyze Schema Compression

Getting your schema compression right on your data in Redshift turns out to be essential to performance too. Each datatype has a recommended compression type that works best for it. You can let Redshift decide when it first loads data, but it doesn’t always guess correctly.

This tool will help you analyze your data, and set compression types properly.

4. Audit Users

Once you have your Redshift cluster up & running, you’ll setup users for applications to connect as. Each will have different privileges for objects & schemas in your database. You’re auditing those right? 🙂

This script will output who can read & write what. Useful to ensure your application isn’t overly loose with permissions. Shoot for least privileges where possible.

5. Lambda Data Loader

Want to automatically load data into Redshift? Download this handy Lambda based tool to respond to S3 events. Whenever a new file appears, it’s data will get loaded into Redshift. Complete with metadata configuration control in Dynamodb.

Everybody is hot under the collar this data over Redshift. I heard one customer say, a query that took 10 Hours before now finishes in under a minute. Without modification. When businesses see 600 times speedup, that can change the way they do business.

What’s more Redshift is easy to deploy. No complicated licenses like the Oracle days. No hardware, just create your cluster & go.

So you’ve made the decision, and you have data in your transactional database, MySQL RDS or Postgres. Now what?

Here are some systems that will help you synchronize data on the regular. And keep it in sync. Most of these are near real-time, so you can expect reports to be looking at the data your business created today.

1. RJ Metrics Pipeline

One of the simplest options, RJ Metrics Pipeline. Setup a trial account, configure your Redshift credentials in the warehouse section (port, user, password, endpoint) and save. Then configure your data source. For MySQL specify hostname, user, password & port. You get the option to go through an ssh tunnel for security. That’s good. You’ll also be given the grant code to create a user in MySQL for RJM.

RJM uses a primary or unique key to figure out which rows have changed. Well that’s not completely true. Only if you’re using incremental refresh. If you’re using complete refresh, then it just selects all the data & replaces it each time.

The user interface is a bit clunky. You have to go in and CONFIGURE EACH TABLE you want to replicate. There’s no REPLICATE-ALL option. This is a pain. If you have 500 tables, it might take hours to configure them all.

Also since RJM isn’t CDC (change data capture) based, it won’t be as close to real-time as some of the other options.

2. xplenty

xplenty is really a lot more than just a sync tool. It’s a full featured ETL system. Want to avoid writing tons of python jobs to convert datatypes, transform 0 to paid & 1 to free, things like that? Well xplenty is made to allow building ETL systems without code.

It’s a bit complex to setup at first, but very full featured. It is the DIY developer or DBAs tool of the bunch. If you need hardcore functionality, xplenty seems to have it.

3. Alooma

Alooma might possibly be the most interesting of the bunch.

After a few stumbles during the setup process, we managed to get this up and running smoothly. Again as with xplenty & Fivetran, it uses CDC to grab changes from the MySQL binlogs. That means you get near realtime.

Although it’s a bit more complex to setup than Fivetran, it gives you a lot more. There’s excellent visibility around data errors, which you *will* have. Knowing where they happen, means your data team can be very proactive. This is great for the business.

What’s more there is a python based Code Engine which allows you to write bits of code that transform data in the pipeline. That’s huge! Want to do some simple ETL, this is a way to do that. Also you can send notifications, or requeue events. All this means you get state of the art pipeline, with good configurability & logging.

4. Fivetran

Fivetran is super point-n-click. It is CDC based like Flydata & Alooma, so you’re gonna get near realtime sync with low overhead. It monitors your binlogs for changed data, and ships it to Redshift. No mess.

The dashboard is simple, the setup is trivial, and it just seems to work. Least pain, best bang.

This will do a *full* reload each night, so if your db is too big for that, it might need modifications. Also if you’re using MySQL as source db you’ll need to change code. One thing I found in there was Perl & Sed commands to transform your source schema CREATE & ALTER statements into Redshift compatible ones. That in itself is worth a look.

Lambda to the rescue

The awslabs github team has put together a lambda-based redshift loader. This might be just what you need. Remember thought that’ll you’ll need to deliver your source data in CSV files to S3 on the regular. So you’ll need some method to dump it. But if you have that half of the equation, this is ideal.

Data Migration Serve or DMS

This appears to have supported Redshift early on, but does not appear to do so now. I’ve gotten conflicting reports, so I should dig a bit more. Anybody want to comment on this one?