Posted
by
timothyon Thursday February 21, 2013 @03:04PM
from the mostly-for-the-web-scale-of-it dept.

Nerval's Lobster writes "Software developer Jeff Cogswell, who matched up Java and C# and peeked under the hood of Facebook's Graph Search, is back with a new tale: why his team decided to go with Amazon's DynamoDB over MongoDB when it came to building a highly customized content system, even though his team specialized in MongoDB. While DynamoDB did offer certain advantages, it also came with some significant headaches, including issues with embedded data structures and Amazon's sometimes-confusing billing structure. He offers a walkthrough of his team's tips and tricks, with some helpful advice on avoiding pitfalls for anyone interested in considering DynamoDB. 'Although I'm not thrilled about the additional work we had to do (at times it felt like going back two decades in technology by writing indexes ourselves),' he writes, 'we did end up with some nice reusable code to help us with the serialization and indexes and such, which will make future projects easier.'"

Where I work, the most senior and backstabby developer saddles the worst tools he can find on the rest of the team, and then blames them (behind their backs of course) for the results of his poor decision making.

An ORM isn't a silver bullet. You still need to understand how your objects map onto the database or you're back to square one with a poorly performing database. In fact it's probably worse as you then need to figure out what the database *and* ORM are doing.

"Our client is paying less than $100 per month for the data. Yes, there are MongoDB hosting options for less than this; but as I mentioned earlier, those tend to be shared options where your data is hosted alongside other data."

Having actually RTFA, it just enforces how poorly most programmers understand relational databases and shouldn't be let near them. It's so consistently wrong it could be just straight trolling (which given it's posted to post-Taco Slashdot, is likely).

"However, the articles also contained data less suited to a traditional database. For example, each article could have multiple authors, so there were actually more authors than there were articles."

This is completely wrong, that's a text book case of something perfectly suited to traditional (relational) database.

After a few years in other fields, I'm doing some serious coding again. Postgres and Doctrine. I can do in a few lines of code and SQL what would take a small program or module to do without the power of SQL and an ORM.

Anyone who reinvents that wheel because he thinks he can do the 2% he recoded better is a moron.

I'm pretty sure that my coding does not satisfy some theoretical top-of-the-mountain coding structure fanatics. But that is "wrong" in the sense that it does not satisfy opinions. And when it comes to coding styles, in the end they're just opinions and ten years from now we'll laugh about most of todays patterns.

And then there is programmatically correct, not unnecessarily wasteful with resources and easy to understand. Those are no opinions - your code either gives the rig

I jest slightly. Certainly there are applications where SQL and relational systems in general are overkill, or where they do not solve certain kinds of problems well. But I'll be frank, they're pretty rare. I will use binary search/sort mechanisms for simple hashes and other similar two column key-value problems, mainly because there's absolutely no need to truck along gazillions of bytes worth of RDBMS where quicksort and a binary search is all that is needed. But if you get beyond that, you're almost inevitably going to start wishing you had JOIN? And then you end up having to implement such functionality.

Every tool for the job, to be sure, but I just happen to think there are far fewer problems that nosql style systems solve than some like to think.

Every tool for the job, to be sure, but I just happen to think there are far fewer problems that nosql style systems solve than some like to think.

I strongly agree with this, and because of that I've been severely chastised by quite a few kool-aid drinkers. On my current job we have a NoSQL database (a MongoDB one, actually) and we indeed have had to reinvent some SQL here and there, including a few manual joins. The job would just have been far smoother (and faster to develop), and surely more performant, if we used a well-established SQL database, but someone decided that it wasn't buzzwordy enough.

"However, the articles also contained data less suited to a traditional database. For example, each article could have multiple authors, so there were actually more authors than there were articles."

Good god, how would he model invoices with multiple line items? Where, you know, there were actually more line items than invoices?! Mind blown.

Or customers that might belong to zero more demographics? There could be more customers than defined demographics to tag them with... or less... we don't even know and it could change as more of either are added!!

We need a whole new database paradigm!

Or the sample Northwind database that's been shipping with access since the 90's.

No, it's completely right: the traditional way to use a database is to blob everything together in to one huge table, preferably with many NULLs, then limit your query to SELECT * FROM Table; and finally process the results directly in VB6, with bonus points for a buggy parser for unpicking comma separated fields.

Note: he said "traditional" not "sane relational".

Sarcasm aside, his reason for not using a relational database is that he'd need to use more than one table and then he'd ha

No, no, no, you let your tedious "DBAs" think they're right and do all that "normalization" and "tuning" shit they keep yammering on about (whatevs), then get the new shiny [microsoft.com] so you can blob the whole fucker up and never have to worry about anything but said "SELECT * FROM FOO." It's great because our developers no longer have to talk to our DBAs about "optimizing" all that dynamic SQL our webforms were generating. The DBAs are now screaming about resource utilization, but, HELLO, they're the ones who insiste

Entity Framework is good when you use it with a properly designed database. It saves a lot of work which, correct me if I'm wrong, is the whole point of computers. There are so many times that people forget that very simple fact in their rush to wave their e-peen around.

For normalized databases, this is often considered a best practice, although another option would be to store multiple author IDs in the article tables—something that would require extra fields, since most articles had more than one author. That would also require that we anticipate the maximum of author fields needed, which could lead to problems down the road.

A single field with delimited index keys pointing to an author table. I learned that in 1996. Then compressing the field with a dictionary, increasing the number of keys that can fit and speed up searches through it. Learned that in 1998.

Having actually RTFA, it just enforces how poorly most programmers understand relational databases and shouldn't be let near them. It's so consistently wrong it could be just straight trolling (which given it's posted to post-Taco Slashdot, is likely).

"However, the articles also contained data less suited to a traditional database. For example, each article could have multiple authors, so there were actually more authors than there were articles."

This is completely wrong, that's a text book case of something perfectly suited to traditional (relational) database.

Well, based on how many things are wrong in the Java vs C# comparison, too, one can only guess that the "software developer" is just some hack who is comped by Slashdot to drive clicks to their sub-sites.

Man this place has really gone to shit in the last year -- just a waste of time to read. Sucks its hard to break 15 years of habit...

MongoDB would have been perfect based on the structure of the data, but the client didn't want to pay for setup and hosting costs, DynamoDB was the cheaper alternative, but more of a pain in the ass to implement. Makes we wonder if the hosting cost savings offset the additional development time.

As someone whose work and thinking are firmly planted in traditional RDMS, a few of those decisions did not make sense.

I understand what he's saying about normalized tables for author, keywords, and categories. But then when he has to build and maintain index tables for author, keyword, and categories, doesn't that negate any advantage of not having those tables?

I understand he's designed things to easy retrieval of articles, but it seems the trade-offs on other functions are too great. It's nice an author's bio is right there in the article object, but when it's time to update the bio, that does mean going through and touching every article by that author?

I've I got a bunch of similar examples, and I would not be at all surprised if they all boiled down to 'I don't understand what this guy is doing,' but basically, isn't NoSQL strength in dealing with dynamic content and in this example, serving static articles, the choice between NoSQL and traditional RDMS essentially up to personal preference?

Maybe you should factor in the usage pattern and instance counts as well.

Someone's bio might appear in how many articles? A few hundred? And how often will the bio be updated? A couple of times a year? So, updating a bio comes down to touching a few hundred records a few times a year. Compare that with thousands of accesses per day and you've suddenly tipped the scale.

Oh come on now. Play fair. If you start throwing around advanced database features like materialized views then you will immediately invalidate 90% of the use cases commonly used for choosing NoSQL over relational databases. That is just mean.

Maybe you should factor in the usage pattern and instance counts as well.

Someone's bio might appear in how many articles? A few hundred? And how often will the bio be updated? A couple of times a year? So, updating a bio comes down to touching a few hundred records a few times a year. Compare that with thousands of accesses per day and you've suddenly tipped the scale.

That's exactly the sort of answer I was looking for. Thank you. (Actually, I'd expect most bios get updated only a handful of times over the life of the author. You start with first publications as a grad student, then you leave school, maybe change jobs a couple of times, maybe a few notable achievements, then the author dies.)

That is the sort of design considerations I'd like to read about. That would give a useful comparison between platforms. As it is, this article boils down to "I went NoSQL over RDMS, because...well, just because. I went Amazon over something else because it's easier for my idiot client to administer."

Someone's bio might appear in how many articles? A few hundred? And how often will the bio be updated? A couple of times a year? So, updating a bio comes down to touching a few hundred records a few times a year. Compare that with thousands of accesses per day and you've suddenly tipped the scale.

That would make sense if you had to pull bios with an article, which should hardly be the case. At most, you'd have to pull in current authors' affiliations. A bio would ideally stay behind an author link, and be pulled in quite rarely. I for one would much rather have a list of authors immediately followed by the abstract than having to move through several pages of biographies for an article with 4-5 authors in order to find the abstract an the actual article. So for me the decision to put every bio in ev

Don't try to actually make sense of the decisions made in the article. I am glad that he summed up all of the reasons why he didn't go with a relational database early in the article, so I didn't have to bother reading the rest. I am an advocate of NoSQL, but this whole article is describing a project that is almost perfect for a relational database.

But considering this author's previous analysis of Java vs C#, I am not surprised that this article was hardly worth the time to read.

Don't try to actually make sense of the decisions made in the article. I am glad that he summed up all of the reasons why he didn't go with a relational database early in the article, so I didn't have to bother reading the rest. I am an advocate of NoSQL, but this whole article is describing a project that is almost perfect for a relational database.

Heck yeah, it reminds me of a project I did in 2004 or 2005, which stored over a hundred thousands of articles (some of them more than 64Kb!) with multiple auth

In my opinion, you must have a VERY good reason before even considering giving up ACID transactions. If your RDBMS isn't fast enough, almost certainly it's because you're doing it wrong, not because there's anything fundamentally wrong with the tool.

It's nice an author's bio is right there in the article object, but when it's time to update the bio, that does mean going through and touching every article by that author?

Actually, you don't update the biographical information for an article. The biographical information in the article is supposed to reflect the biographical information for the author at the time at which the article is published. When you update the biographical information, it goes into any articles published after the bio is updated.

Throughout the article the client says they don't want full-text search. The author says he can "add it later," then compresses the body text field. Metadata like authorship information is also stored in a nasty JSON format—so say goodbye to being able to search that later, too!

About that compression...

That compression proved to be important due to yet another shortcoming of DynamoDB, one that nearly made me pull my hair out and encourage the team to switch back to MongoDB. It turns out the maximum record size in DynamoDB is 64K. That’s not much, and it takes me back to the days of 16-bit Windows where the text field GUI element could only hold a maximum of 64K. That was also, um, twenty years ago.

Which is a limit that, say, InnoDB in MySQL also has. So, let's tally it up:

There's no way at all to search article text.

Comma-separated lists must be parsed to query by author name.

The same applies to keywords...

And categories...

So what the hell is this database for? It's unusable, unsearchable, and completely pointless. You have to know the title of the article you're interested in to query it! It sounds, honestly, like this is a case where the client didn't know what they needed. I really, really am hard-pressed to fathom a repository for scientific articles where they store the full text but only need to look up titles. With that kind of design, they could drop their internal DB and just use PubMed or Google Scholar... and get way better results!

I think the author and his team failed the customer in this case by providing them with an inflexible system. Either they forced the client into accepting these horrible limitations so they could play with new (and expensive!) toys, or the client just flat-out doesn't need this database for anything (in which case it's a waste of money.) This kind of data absolutely needs to be kept in a relational database to be useful.

OK, that's what I thought. Well, first, for anyone who hasn't read or doesn't remember that "Java vs. C#" thing, don't go back and read it now. Save your time, it's horrible.

Now, for the current article, isn't designing a database all about trade-offs? E.g. Indexes make it easier to find stuff, but then make extra work (updating indexes) when adding stuff. It's about balancing reading and writing, speed and maintenance, etc. And it seems like this guy has only thought about pulling out a single article to the exclusion of everything else.

Do we just not understand DynamoDB? How does this system pull all the articles by a certain author or with a certain keyword? What if they need to update an author's bio? With categories stored within the article object, how does he enforce integrity, so all "general relativity" articles end up with "general relativity" and not a mix of GR, Gen Rel, g relativity, etc?

What happens when they want to add full text search? Or pictures to articles? That 64k limit would seem like a deal breaker. 64k that includes EVERYTHING about an article--abstract, full text, authors and bios, etc.

My first thought was, this does not make much sense. Then I thought, well, I work with old skool RDMS, and I just don't get NoSQL. But now I think, naw, this guy really doesn't know enough to merit the level of attention his blatherings get on/.

That compression proved to be important due to yet another shortcoming of DynamoDB, one that nearly made me pull my hair out and encourage the team to switch back to MongoDB. It turns out the maximum record size in DynamoDB is 64K. That’s not much, and it takes me back to the days of 16-bit Windows where the text field GUI element could only hold a maximum of 64K. That was also, um, twenty years ago.

I didn't understand why he dismissed S3 to store his documents in the first place:

Amazon has their S3 storage, but that’s more suited to blob data—not ideal for documents

Why wouldn't an S3 blob be an ideal place to store a document of unknown size that you don't care about indexing? Later he says "In the DynamoDB record, simply store the identifier for the S3 object. That doesn’t sound like much fun, but it would be doable" -- is storing an S3 pointer worse than deploying a solution that will fail on the first document that exceeds 64KB, at which point he'll need to come up with a scheme to split large docs across multiple records? Especially when DynamoDB storage costs 10 times more than S3 storage ($1/GB/month vs $0.095/GB/month)

The AWS platform and the ease of scaling it offers. The application can actually scale itself with their API. I know you can scale *sql horizontally, but you cant argue that its easier.

Fom TFA:"Our client said they didn't need a full-text search on the text or abstract of the documents; they only cared about people searching keywords and categories. That’s fine—we could always add further search capabilities later on, using third-party indexing and searching tools such as Apache Lucene.slashdot

I've been messing around writing my own Java NoSQL CMS called Magneato. It stores articles in XML because I use XForms for the front end (maybe a bad choice but there isn't a good forms solution yet, not even with HTML5) and I use Lucene/Bobo for the navigation and search side of things. It is focussed on facetted navigation although you can have relations between articles: parent of, sibling etc via Lucene.

It actually sounds like my efforts are better than this team have produced.

We decided that MongoDB was adequate but didn't leverage the synergies we were trying to harvest from our development methodologies.

We looked at GumboDB and found it was lacking in visualization tools to create a warehouse for our data that would provide a real-time dashboard of the operational metrics we were seeking.

Next up was SuperDuperDB which was great from a client-server-man-in-the-middle perspective but required a complex LDAP authentication matrix that reticulated splines within our identity management roadmap.

I think you're trying to be funny (or at least sarcastic) but the last time I worked on a system that stored multiple values in a field as delimted string--as this guy proposes storing mutiple authors and keywords--was for a late 90s dotcom running a web site off of an Access 97 mdb.

I guess it's ironic that we both recently posted articles that documented our thought process on choosing a NoSQL database, but coming to an opposite conclusion. Apologies, maybe it was just coincidental?

The same sentiment is echoed multiple times in the article. So this really isn't about how a development team choose DynamoDB over MongoDB. But rather the financial limitations of the client mandated the development team to use DynamoDB. In fact the article is more in favor for using MongoDB over DynamoDB, but the client's requirements forced their hands into using an alternative that was not as favorable for development.

We weren't thrilled about this, because writing your own indexes can be problematic. Any time we stored a document, we would have to update the index. That's fine, except if anything goes wrong in between those two steps, the index would be wrong. However, the coding wouldnâ(TM)t be too terribly difficult, and so we decided this wouldn't be a showstopper. But just to be sure, we would need to follow best practices, and include code that periodically rebuilds the indexes.

Hello, I'm a time traveller from 1973 where I've been fondly imagining you folks in the future had written software to solve this kind of problem in a more generic fashion. Back in the past we have some visionary guy by the name of Codd, and in my wilder dreams I sometimes imagine by the year 2000 someone has created some kind of revolutionary database software which is based on his "SEQUEL" ideas and does fancy stuff like maintaining its own indexes.