Category Archives: Technology

Years ago, before great services like Dropbox were available, I purchased a NAS device for my home network. If you aren’t familiar, the idea is essentially to hook up a storage device that’s available while you’re at home, so that

All your home computers can access any files or data that you’ve stored there, over an a fast local network.

All of your data and files are backed up redundantly, to multiple disks. These devices mirror (copy and synchronize) files across two or more hard disk drives. They’re designed so that if one disk goes bad, all of your data is still OK, and you can replace the bad drive while the device is running and fully functional.

You get a lot of headroom in terms of storage capacity. Many of these devices allow you to store petabytes (a shiz ton) of data.

I set up my NAS with two 500G drives, and proceeded to copy over my 20-year old CD collection (in MP3 format), all of my photos, and a lot of other important data. I have multiple computers, on various operating systems, in various rooms at my house, and they could all access my music and photos over both wireless and wired connections – it worked pretty well for a while.

Then one day, one of my drives went bad. I got on Amazon, searched for one of the drives on the very specific compatibility list the vendor published, found, purchased and installed it. It worked. I hot-swapped the drive for the bad one and everything continued to work great.

Being a generally paranoid person, and having experienced pains in this arena before, I also wanted all of my stuff backed-up somewhere outside of my house. So I shopped around, and decided to use the Memopal service. There were a number of nascent services coming on to the market at the time, and theirs seemed the most reputable and solid. I set things up so that the Memopal software was synchronizing files from my NAS to their cloud, set the account to auto-renew yearly, and for the most part left it alone. There were a few times when I checked on it, or when I had to fuss with their software to get it to work again, but otherwise it seemed to work pretty well.

Years went by, and in the last three years I got more and more busy with the start-up I helped co-found, BigDoor. My personal email queue was the last thing on the list to maintain, unfortunately.

A couple of weeks back, a very unlikely event occurred: both of the 500G NAS disks had problems. The manufacturer of the device, Infrant, was purchased by Netgear in 2007. I’ve been working with their support on this issue, and have high confidence that I’ll be able to recover my data after spending more money. Their support has been pretty good so far, and I’ll report back here if it goes south.

On the flip side, when I contacted Memopal support, I learned that even though they’d taken my money last year and this year via auto-renew, since I didn’t enter the licence code into their software last year, all of my data was deleted and can’t be recovered. This approach is new to me; I’m used to the standard “if you pay for something, we’ll at the very least not destroy what you’ve paid for” vs. “to prevent the irreparable deletion of all your data, it’s not enough just to pay us, you have to put the code we emailed you into our software”. Below is the support thread, in case you’re as incredulous as I continue to be.

With product decisions like this, it’s no wonder they’re getting their asses kicked by Dropbox and other new services. I’m curious to hear if you can think of a good reason why an online data back-up service would collect payment, but then delete your data (without reasonable warning), because you didn’t enter a licence code.

2/1/2013 Update : fairness in reporting; below is how their customer service responded. Amazing.

Like this:

A friend was recently asking about our backend database systems. Our systems are able to successfully handle high-volume transactional traffic through our API coming from various customers, having vastly different spiking patterns, including traffic from a site that’s in the top-100 list for highest traffic on the net. Don’t get me wrong; I don’t want to sound overly impressed with what our little team has been able to accomplish, we’re not perfect by any means and we’re not talking about Google or Facebook traffic levels. But serving requests to over one million unique users in an hour, and doing 50K database queries per second isn’t trivial, either.

I responded to my friend along the following lines:

If you’re going with an RDBMS, MySQL is the right, best choice in my opinion. It’s worked very well for us over the years.

The obvious : you’ll be able to do shard-count more reads and writes that you’d otherwise be able to do with a monolithic, non-sharded backend (approximately).

Alternatively, with a single-primary read-write or write-only node, and multi-secondary read-only nodes you could scale reads to some degree.

But be prepared to manage the complexities that come along with eventual read-consistency, including replication-lag instrumentation and discovery, beyond any user notifications around data not being up-to-date (if needed).

It was built by folks who have only been thinking about sharding and its complexities, for many years

who have plans on their roadmap to fill any gaps with their current product

gaps that will start to appear quickly, to anyone trying to build their own sharding solution.

In other words, do-it-yourself-ers will at some point be losing a race with CodeFutures to close the same gaps, while already trying to win the race against their market competitors.

It’s in Java, vs. some other non-performant or obscure (syntactically or otherwise) language.

It allows for multiple shard trees; if you want (or have to) trade in other benefits for sharding on more than one key, you can.

Benefits of just sharding on one key include, amongst other things, knowing that if you have 16 shards, and one is unavailable, and the rest of the cluster is available, 1/16th of your data is unavailable.

With more than one shard tree, good luck doing that kind of math.

It provides a solution for the auto-increment or “I need unique key IDs” problem.

It provides a solution for the “I need connection pooling that’s balanced to shard and node count” problem.

It provides a solution for the “I want an algorithm for balancing shard reads and writes”.

Additionally, “I want the shard key to be based on a column I’m populating with the concatenated result of two other string keys”.

Streaming agents allow you to plug into the update/insert stream, and do what you like with changes to data.

We use this to stream data into Redis, amongst other things. Redis has worked out very well for us thus far, by the way.

Other dbShards customers use this to replicate to other DBMS engines, managed by dbShards or not, such as a column store like MonetDb, InfoBright, even a single standalone MySQL server if it can handle the load.

It supports consistent writes to global tables; when a write is done to a global table, its guaranteed to have been done on all global tables.

It doesn’t rely on MySQL’s replication and its shortcomings, but rather on its own robust, low-maintenance and flexible replication model.

Its command-line console provides a lot of functionality you’d rather not have to build.

Allows you to run queries against the shard cluster, like you were at the MySQL command line.

Soon they’re releasing a new plug-compatible version of the open source MyOSP driver, so we’ll be able to use the same mysql command line to access both dbShards and non-dbShards managed MySQL databases.

Its web console provides a lot of functionality you’d rather not have to build.

As you can see, some of these are on the pro list too, double-edged swords.

Cost – it’s not free obviously, nor is it open source.

Weigh the cost against market opportunity, and/or the additional headcount required to take a different approach.

It’s in Java, vs. Python (snark). Good thing we’ve got a multi-talented, kick-ass engineer who is now writing Java plugins when needed.

Doesn’t rely on MySQL replication, which has its annoyances but has been under development for a long time.

Nor is there enough instrumentation around lag. What’s needed is a programmatic way to find this out.

Allows for multiple shard trees.

I’m told many businesses need this as a P0, and that might be true, even for us.

But I’d personally prefer to jump through fire in order to have a single shard tree, if at all possible.

The complexities of multiple shard trees, particularly when it comes to HA, are too expensive to justify unless absolutely necessary, in my humble opinion.

Better monitoring instrumentation is needed, ideally we’d have a programmatic way to determine various states and metrics.

Command line console needs improvement, not all standard SQL is supported.

That said, we’ve managed to get by with it, only occasionally using it for diagnostics.

Can’t do SQL JOINs from between shard trees. I’ve heard this is coming in a future release.

This can be a real PITA, but it’s a relatively complex feature.

Another reason not to have multiple shard trees, if you can avoid them.

Go-fish queries are very expensive, and can slow performance to a halt, across the board.

We’re currently testing a hot-fix that makes this much less severe.

But slow queries can take down MySQL (e.g. thread starvation), sharding or no.

HA limitations, gaps that are on their near-term roadmap, I think to be released this year:

No support for eventually-consistent writes to global tables means all primaries must be available for global writes.

Async, eventually consistent writes should be available as a feature in their next build, by early October.

Fail-over to secondaries or back to primaries can only happen if both nodes are responding.

in other words, you can’t say via the console:

‘ignore the unresponsive primary, go ahead and use the secondary’

or:

‘stand me up a new EC2 instance for a secondary, in this zone/region, sync it with the existing primary, and go back into production with it’

Reliable replication currently requires two nodes to be available.

In other words, if a single host goes down, writes for its shard are disallowed.

In the latest versions, there’s a configuration “switch” that allows for failing-down to primary

But not fail down to secondary. This is expected in an early Q4 2012 version release.

dbsmanage host must be available.

dbShards can run without it or a bit, but stats/alerts will be unavailable for that period.

Shard 1 must be available for new auto-increment batch requests.

go-fish queries depend on all primaries (or maybe all secondaries via configuration, but not some mix of the two as far as I’m aware) to be available

DYI

I can rattle off the names of a number of companies who have done this, and it took many months longer than our deployment of dbShards (about six weeks, largely due to the schema being largely ready for it).

Given a lot of time to do it, appeals to me even now, but I still wouldn’t go this route, given the pros/cons above.

The latest release of MySQL Cluster may be an option for you, it wasn’t for us back with MySQL 5.0, and not likely now, due to its limitations (e.g. no InnoDB).

AWS RDS was an option for us from the onset, and I chose to manage our own instances running MySQL, before deciding how we’d shard.

For the following reasons:

I wanted ownership/control around the replication stream, which RDS doesn’t allow for (last I looked) for things like:

BI/reporting tools that don’t require queries to be run against secondary hosts.

This hasn’t panned out as planned, but could still be implemented, and I’m happy we have this option, hope to get to it sometime soon.

Asynchronous post-transaction data processing.

This has worked out very well, particularly with dbShards, which allows you to build streaming plugins and do whatever you want when data changes, with that data.

Event-driven model.

Better for us than doing it at the app layer, which would increase latencies to our API.

Concern that the critical foundational knobs and levers would be out of our reach.

Can’t say for sure, but this has likely been a good choice for our particular use-case; without question we’ve been able to see and pull levers that we otherwise wouldn’t have been able to, in some cases saving our bacon.

Their uptime SLAs, which hinted at unacceptable downtime for our use-case.

Perhaps the biggest win on the decision not to use RDS; they’ve had a lot of down-time with this service.

Ability to run tools, like mk-archiver (which we use extensively for data store size management), on a regular basis without a hitch. Not 100% sure, but I don’t think you can do this with RDS.

CloudWatch metrics/graphing is a very bad experience, and want/need better operational insights to what it provides. Very glad we don’t depend on CW for this.

All of these reasons have come at considerable cost to us as well, of course.

Besides the obvious host management cycles, we have to manage :

MySQL configurations, that have to map to instance sizes.

Optimization and tuning of the configurations, poor-performance root-cause analysis,

MySQL patches/upgrades.

maybe more of the backup process than we’d like to.

maybe more HA requirements than we’d like to; although I’m glad we have more control over this, per my earlier comment regarding downtime.

maybe more of the storage capacity management than we’d like to.

DBA headcount costs.

We’ve gone through two very expensive and hard-to-find folks on this front, plus costly and often not-helpful, cycle-costing out-sourced DBA expertise.

Currently getting by with a couple of experienced engineers in-house and support from CodeFutures as-needed.

As I’ve seen numerous times in the past, AWS ends up building in features that fill gaps that we’ve either developed solutions for, or worked around.

So if some of the RDS limitations can be worked-around, there’s a good chance that the gaps will be filled by AWS in the future.

Like this:

I’ve never been a big fan of meetings, so naturally conferences were on my no-fly list for a long time: a big building with a big meeting in the early AM, followed by a mitosis into smaller meetings, followed by more and more meetings, all gradually shrinking throughout the day, finally to be absorbed by the nearest bar once some sort of conferential Hayflick limit is reached.

I’m happy to say that I was wrong about them; over the last few years I’ve attended some fantastic, rewarding conferences. I’ve also attended some anti-awesome ones. Being keen on agile retrospectives and data-driven decision-making, I’ll posit the following as a formula for measuring the efficacy and ROI of any given conference, spec-style. You may be privy to my Ruger Fault Equivalency, this is my Conference Non-Con Postulation:

notes : total line count of notes taken during a conference. A good conference causes me to write furiously; even though there may be slides or notes offered online afterwards, this is the best way for me to internalize, to any degree.

refactor index : number of minutes I spend cleaning up my notes, so that I can share them. A good conference will cause me to review my notes, clean and boil some of the salient moments up into a handful of takeaways.

note virality : number of people I share my notes with afterward. A good conference will inspire me to inflict my notes on my team at work, at which point they will be thankful for a high note refactor index.

players : people I had the pleasure of spending time with, who also impressed, inspired, or gave me a laugh at the conference. Expressed as a quality score, 1 to 3. A good conference will even have a few folks that exhibit all of these traits (e.g. Keith Smith, Brad Feld, Ryan McIntyre), that would incur a score of 3.

injuries : number of bodily injuries incurred at said conference, not caused by dude-hold-my-beer moments or other virtuous activity. Good to have a low count of these.

bullshit : number of times I look at the ceiling, for reasons other than math or loudspeaker brand detection.

Given those input definitions, the Conference Non-Con Postulation is as follows:

A few things to note:

The number of seconds spent refactoring my notes provides diminishing returns. Also, it’s no coincidence that the refactor index summation will result in a harmonic number.

Note that virality has an exponential effect.

Player count wraps everything with an even greater exponential effect, and it only takes one great player to make a huge difference.

While bullshit and injuries ultimately decrease overall ROI, they will only have material effect when other inputs (e.g. notes, virality) are low.

Like this:

Ever since we registered our startup’s domain with them years ago, we’ve been anxious to get off the free DNS provided by GoDaddy at the least, and ideally change registrars as well. With all the issues other companies have had with them + their political positioning … we just want out. It’s actually embarrassing to admit we were in this situation for so long, but I’m swallowing my pride in hopes that this will help others out – open-sourced embarrassment (O-ASSMENT). Until recent, we really haven’t had the time/resources to tackle it without affecting product development efforts and higher priorities. One of our senior guys has been exploring options for weeks, and we thought we were in a good position to make a change.

There are two parts of this puzzle that need to be fit: GoDaddy is (was) the registrar of our domain, and they also are hosting DNS for us. That’s a typical set-up when you first register your domain these days; most registrars also offer managed DNS. But it’s not a good practice to leave your DNS hosted with your registrar – it’s better to separate them right when you register the domain, if you can.

The two parts (registrar and managed DNS) are intertwined; I’m trying to avoid DNS details for the non-technical, but essentially/simply/horribly put: DNS is much like a big phone book that yourDomain.com has a page in, that page maps IP addresses to friendly names like http://www.yourDomain.com and api.yourDomain.com. One particularly critical mapping provides the IPs pointing to our authoritative name servers. This mapping is also stored in the index of the phone book, by a higher DNS authority…like Elvis. Servers that need to know where http://www.yourDomain.com is (in other words, its IP address) look in the index if they need to, and then get the IP from our page in the book. This is where the registrar comes in – you can only change the IP of the authoritative name servers through the registrar of the domain. Otherwise, with regard to DNS/WHOIS records, the registrar is just a text string, a name without a number.

But this makes registrars ultimately all-powerful; you can make all the DNS changes you want, but if the authoritative name servers are changed and pointed to hosts that don’t have our DNS information, or don’t have the right information – you’re totally FUBARD.

We shopped around for a different registrar, and at one point were ready to sign an expensive deal with MarkMonitor, who from all accounts is the market leader in terms of locking things down from a security standpoint. But they couldn’t seem to get their act together fast enough and were too expensive for our growth stage anyway. We decided to go with NetworkSolutions, the “first” registry operator and registrar for the com, net, and org registries.

GoDaddy offers free DNS when you register your domain with them, but they also offer Premium DNS. We upgraded to premium weeks ago, to get a better idea for our DNS traffic and to price out competitors. To be totally clear, at this point in the story we’re paying GoDaddy for their premium DNS hosting option. GoDaddy offers this to their customers as a stand-alone service; in other words, you can use GoDaddy just as a managed DNS provider (as long as you have a domain or two registered with them, I’d assume ).

So, given that we wanted to move our registrar (because we didn’t want GoDaddy to own the gate to our authoritative name servers), and our DNS, we had a few options:

Try to move both at once. Not a good-feeling option for probably obvious reasons.

Move to a different managed DNS provider, then once that’s complete, move registrars. Moving DNS is more complicated and in theory (or logically) more risky than moving registrars.

Move registrars, and once that’s complete move to a different managed DNS provider. This seemed like the lowest risk option, given all the inputs at the time, and it’s what we tried to do.

Here’s the relative timeline, what happened, and what we expected/should have happened:

Our senior engineer talked on the phone with folks at GoDaddy and our new registrar, NetworkSolutions, both of which confirmed our understanding and expectation that during and after the change, the name server addresses would remain pointed at GoDaddy’s name servers until we took action to change them. The only thing that was supposed to change was the registrar’s name. We reiterated with them that downtime wasn’t an option for us, and they reassured.

Our engineer initiated the transfer song-and-dance. The first thing he noticed was that we couldn’t get to any DNS information in GoDaddy anymore, including the NS records. OK….kinda makes sense to prevent changes while the transfer is happening I suppose, but we should at least have read-access to the current records, right?

So he called GoDaddy, who pointed us to a page where we could access the current DNS records if we did a ‘view source’ (!), and also pointed us to a ‘pending transfers’ section of their site that would expedite the acceptance process. No email or other instructions about this bit were previously given; this whole process normally takes place mostly via automated email, and registrar documentation on all of this is sh** across the market.

Then we took a step that I’m quite glad about : we saved all of our zone file information and DNS records to a spreadsheet. Go do this now for yourself, if you are in the same kind of situation. Seriously.

As instructed by someone at GoDaddy, we then ‘accepted’ the registrar transfer on their site.

At this point, we’re thinking that our premium DNS is going to sit there untouched, and that it’s going to be five days before the registrar is transferred. 5 days because the registry operator – the root authority for the .com domain – has that as a grace period before making the change, in case any party cancels the transfer. Wrong on both counts.

Shortly after accepting the transfer in GoDaddy’s web interface, they deleted our DNS records. We had a short time-to-live setting on the records, so after 30 minutes, hosts aren’t able to look up what IP to use for any ourDomain.com services. The name server entries weren’t changed of course, because GoDaddy is no longer in a position to do so. But the information sitting on those name servers that pointed IPs to our services was gone. That meant that slowly, across the net, customers stopped being able to access services onourDomain.com – including email.

Our engineer called them immediately, described what happened, and asked why our DNS records disappeared. Answer : “because you moved your domain to another registrar“.

OK, can we get that re-instated? Answer : “I can give you access to the DNS manager page again for this domain, but you have to put all the information in yourself.” I’m pretty sure they’re required to keep this information for some period of time, to be in compliance with their registrar agreement with ICAAN.

So our engineer gets out our trusty spreadsheet, and manually copies the information back in. Shortly thereafter we start to see a gradual recovery, as clients start to be able to resolve hostnames to IP addresses again.

That whole escapade pretty much escalated the priority of us getting off their managed DNS, which we did in the next week. After looking at various (mostly expensive) options, we moved over to Amazon’s AWS Route53, which went relatively seamlessly. The nice thing about Route53 is that it’s accessible programmatically and can be managed via scripts just like the rest of our AWS resources.

I totally get that the herky-jerky that comes with WHOIS-on-first; name server and DNS transfer of ownership puts registrars in an odd situation, one that requires competitors to coordinate if they’re going to act in the best interest of their soon-to-be/just-cancelled customers. But there’s got to be a better way than this ridiculous bullsh** we just went through. Registrars who offer DNS hosting as a service have an obligation to publish the ‘how do I get out without getting ass-f*****’ instructions at the very least. Better yet, for a grace period, leave DNS the way it is until an NS record gets changed at the root level, messaging their customers about what’s coming in the meanwhile. I know that some registrars do provide a grace period like this.

I’m obviously not a registrar, and admit that my proposed solutions may not be tenable. But there’s got to be a better way.

We’re not the only startup in the bus that’s running over GoDaddy, there’s pretty much wide agreement on this topic. I’m glad we’re over that speed-bump and the startup bus is barreling forward at high-speed as usual.

I’m tempted to turn this into an ICANN complaint – any input on whether that would hold up, or be worthwhile? ( to comment you have to be on this post’s page, rather than the blog home page)

Update : in case it’s helpful for anyone, I’ve started gathering some numbers on what some other friend’s startups are using (without major complaint) for registrar and hosted DNS, and will update here for now. Please email me directly if you’d like me to add something to this list.