A blog dedicated to Software Developers, Product Managers and Marketing

Last week I was in a chat with a couple of guys who had just started experimenting with large scale databases. At one point I said: “Unless you can afford enough RAM to fit all your important data along with the indexes, don’t even consider using MongoDB!”. To my surprise these guys were shocked and expressed great disappointment… It seems the general idea newcomers get when considering NoSQL databases is that they can provide a huge performance boost for free. After working with MongoDB for 18 months, and having been disappointed myself when discovering how much reality differed from the hype, I beg to differ. Everything comes at a cost, and that cost in this case is greater memory and disk space requirements.

Why all the RAM?

MongoDB gets most of its performance gains from the fact that it tries to keep as much data in RAM as possible. As we saw in last week’s article, memory consumption in MongoDB can become a pitfall if not well planned out. It’s true that MongoDB allows for other powerful methods to improve performance, but it doesn’t matter how much optimizations you do to your code, if you simply cannot load the data fast enough from wherever it is stored. Therefore, if you are continuously generating IO requests to load your working data from a hard drive, most of the performance gains will be lost. But why wouldn’t you want to consider MongoDB even if there is not enough RAM? Well, the reason is that you will reach a point where the amount of RAM available is so little compared with the data you’re working with, that the speed with which you can work will become directly dependent on your IO speed (i.e. on how fast you can load data from hard drive to memory). In that case, you would want to store your data into the smallest space possible so that it can be read faster from the hard drive/s. That is why some DBMS’s provide optional data compression and unfortunately this is where MongoDB might fail. At the time of writing, MongoDB does not yet provide any means of data compression. Apart from this, it also tends to consume more disk space that its RDBMS counterparts.

MySQL vs MongoDB – Disk Usage Benchmark

For the purpose of this post I created a small case study with some benchmarks to provide you with some tangible comparison results. I built a sample data set which was stored in both MySQL and MongoDB in order to be compared for disk space usage. Since I have been working on software usage analytics for a while, I used a very basic sample data set that can be potentially used to store usage data in its most simplistic form. Now, before you go on to post comments such as “…but there are better ways to store that data…”, just keep in mind that this is not an optimization exercise. This test is simply intended to compare like with like between MySQL and MongoDB when it comes to disk space usage. For this test, assume we want to store daily data about every user. We will be storing the date, some form of user ID, the product version number, the build number and the language of the product installed on the user’s machine. We will also store some basic platform information such as the Operating System version and OS language as well as product usage statistics such as how many times the product was run and and how long the user spent interacting with the monitored product (represented in runtime minutes and sessions). These are shown in the table below:

_id

dt

us

ve

bl

ov

ol

ln

rt

se

ID Field

Date

User ID

Product Version

Product Build

OS Version

OS Language

Product Language

Runtime Minutes

Sessions

You may notice I have also added a field named “_id”. That is because MongoDB always stores such a field which is also indexed. This field can be filled with custom data or else MongoDB can create its own ID. For this sample, I’m filling all the fields with integers. In real life these may then be mapped to friendly names. The field names I used were all only 2 characters long. In MySQL this doesn’t make much difference, but in MongoDB, the longer the names, the larger the data that is stored, so here we try to keep the data as small as possible.

Benchmark Results

The test sample was built to simulate data collected from an application being used by 10,000 users daily for 365 days. Therefore, I inserted 3.65 million records. In MySQL I also added an index on the “_id” field so that we emulate the index that is forced by MongoDB on the said field. Since MySQL offers the option of compressing the data and the indexes, I also measured the data size with compression turned on. The results in the third column were obtained after enabling compression with a key block size of 4KB. The resulting data size was as follows:

MongoDB

MySQL InnoDB
uncompressed

MySQL InnoDB
4KB block size compression

Data: 306MB

Data: 251MB

Data: 113MB

Index: 83MB

Index: 59MB

Index: 22MB

Total: 439MB

Total: 310MB

Total: 135MB

100%

71%

31%

Conclusion

As can be seen from the results above, MySQL consumed less storage space both in terms of the data and the index. This difference became much more significant when enabling compression where MySQL was consuming just 31% of the space consumed by MongoDB for storing the same data. Obviously, compression would then come at a cost of much slower writes due to the high amount of CPU cycles that would be consumed for compressing the data. So this has to also be taken into consideration. One should also point out that MongoDB can, in some scenarios, beat MySQL hands down when it comes to performance once the data is in RAM. However, if your IT budget is low and you cannot afford a server with enough RAM to host MongoDB, or if you are expecting a free performance boost, it might make more sense to stick with traditional RDBMS because switching to a technology like MongoDB might actually make your system run slower!

Related Posts

http://www.facebook.com/paolodominict Paolo Dominict Umali

Even if you put your MongoDB data in RAM, you will still need to do I/O with a drive as RAM is not where to store your data. RAM is temporary storage. IMHO, NoSQL databases are good only for not-so-important data like a chat thread. I won’t spend too much time and money for not-so important data.

Clifford Farrugia

You do have some good points there, Paolo, however not everything is that clear-cut. Regarding I/O, it depends on how much reads vs. writes your application needs to do. If you have enough RAM to fit all data in RAM, almost all reads will come from RAM while you will only do disk I/O for writes, which suits a lot of application that are read-heavy.

Another thing regarding not-so-important data. That depends on what is important for you and what is not. It’s true that it’s probably not a very good idea to use NoSQL databases for financial transactions, for example, but there are some very good use cases. For example, imagine using it to store logs that you will later use to build reports on. You will need that data to be in RAM unless you’re willing to wait for over half an hour to build that report. Since you’d be collecting millions of logs, then losing one or two logs here and there would not have any effect on the reporting mechanism because it’s just a drop in the ocean. So in such a case, a single log would not be so important, but collectively, that data could be very useful.

http://www.facebook.com/paolodominict Paolo Dominict Umali

Good follow-up points you shared in there too.

thinktanktheory

New to NoSQL, but my understanding is that they are preferred as opposed to RDBMS is because they are much faster read/write or I/O as you guys put it.
I’m currently exploring Mongo due to a requirement to write data as fast as possible to disk.
So since the data can be written to disk and read off later, why is there the issue of data loss?
My only concern is how Mongo decides which data to go into RAM, and which to free up. That’s something I need to look further into.

Clifford Farrugia

The issues regarding data loss have been largely exaggerated by some critics, and most of the issues are more related to the default configuration rather than what MongoDB is capable of. By default, writes to MongoDB operate in a “fire and forget” mode where there is no guarantee that the data will actually ever be written to disk (although in most cases it does). This offers very good performance, but cannot be used in cases where one cannot afford data to be lost. For such cases, you can set the “write concern” to “journaled” so that the client will wait for a confirmation from MongoDB that the data has been written to the journal, and can therefore survive server reboots, etc. You can get more information about Write Concern here: http://docs.mongodb.org/manual/core/write-operations/#write-concern