Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. It only takes a minute to sign up.

I've a MongoDB collection that works like a queue: new documents are insert and old documents (after 60days) are removed. I can see a rapid grown of the datafile size, too rapid. I can be reasonable because we remove old data after 60 days, but I was thinking, are my deletes effective without execute the defragmentation? (in few words, what's a good way to manage disk space in MongoDB)

What is a correct defragment / clean collection politics? It's a production database and version is 2.6.9

Basicaly : 1. for 100% fixed collection size, use capped collection 2. use latest version of Mongodb where space is reused as good as possible 3. do some maintenance every few weeks (compact/repair/...) 4. if there is constant timeout on the function that is responsible on allocating free space 5. some more rare issues mentioned in that article
– aldwinaldwinAug 6 '15 at 9:05

What specific version of MongoDB are you using (and if >3.0, what storage engine)?
– StennieAug 10 '15 at 0:27

2 Answers
2

Reasons for unexpected data growth of data files.

"Data fragmentation" and data file preallocation

When a document is deleted, it's space is used right away if the new document fits into that space. Let's say you delete a document which takes 1kb of disk space and a new document requiring 0.9 Kb of disk space is synced to disk, then the first free space (the deleted document's in our example) will be used. Now let's assume the new document would need 1.1k. In a worst case scenario, a new datafile of 2GB had to be provisioned although only 0.1kb of space was missing. The reasons for datafiles being preallocated are rather good ones, btw: it would simply take too long during a disk sync.

Padding

When a document is written, some space is added to allow the document grow in size without triggering a rather expensive document migration each time. Documents are migrated when they do not fit into their position in the data file any more since

Documents are never fragmented

So if your documents grow, they have to be migrated and a new padding is applied, it might well be that millions of places in the data files would provide enough space for a 1k document, but still a new data file has to be preallocated.

Another "problem" is the way padding is calculated. As of MongoDB 2.6, documents are by default by using power of 2 sizes. So let's assume your document is 513 bytes in size. However, since the next power of 2 would be 1kb, almost half of the space allocated for the document would not be in use until it grows in size. So in a worst case scenario, half of the space allocated for your data files -1 byte might be "wasted".

Increased usage

Your application might well be getting momentum and there simply is more data stored than you expect. Congratulations!

What to do

Usually, one of three ways of dealing with data file growth is suggested.

the compact command

the repair command

Forcing a resync from replica set members

I'll go over them with their Pros and Cons from my point of view and explain why I think all of them are improper ways of dealing with that data file growth.

The compact command

How it works

The compact command defragments the data files of a collection. It does it by creating a new data file of 2GB and moves the documents back and forth until there are no gaps between the documents any more.

Pros

The compact command is relatively fast when compared to the other solutions. The defragmentation helps a bit to prevent unnecessary data file preallocation.

Cons

The database containing the target collection is locked during the execution.

No disk space is reclaimed

You really should have a backup of the target collection before using the compact command. So in order to have said backup, you need to over provision your disks with 2Gb (the additional data file) plus the size of your largest collection (for the backup). But with over provisioned disks, space will not be a problem in the first place.

It doesn't help at all when space really is a problem: if you are in a critical situation, the problems detailed above prevent you from using the compact command.

Why I don't think it is a proper solution

Well, it's kind of obvious - you lock your database, which means downtime. For really large databases, this means a lot of downtime, and all this for the relatively small gain of potentially preventing one or two data files to be created (which means 4Gb disk space at max).

The repairDatabase command

How it works

Simplified, the repairDatabase command creates a second instance of your database, iterates over the documents in the original database, verifies them and writes them into the new database in consecutive order. In the last step, the old database is deleted and the new database is renamed.

Pros

With a proper planning, you can reclaim disk space with very little downtime, since the repairDatabase command can be run against secondaries. So you can do the following

Run the repairDatabase command against all secondaries

Have the primary step down. This might lead to 3-5 seconds of downtime during the election of the new primary.

Run the repairDatabase command against the recently stepped down primary

Sounds nice, right? However, there a huge

Cons

You need to massively over provision your disks, since basically a copy of your database is made. So now let's assume you run this command against a database which is in an optimal state. So to make sure the command is successfully executed you need at least the same amount of free disk space as your database uses when you issue the repair command. Since the repair command is potentially even more critical than the compact command, you should make a backup beforehand or use the backupOriginalFiles option.

Why I don't think it is a proper solution

The cons detailed above show that you have to over provision your disks by at least 200% of your payload data. With that massive amount of disk space, you would not have a problem in the first place.

Forcing a resync from replica set members

How it works

You shut down a secondary, delete its data files and restart it. The node notices that it is basically a new member added to a replica set and forces an initial sync with the replica set. Since the initial resync is document oriented, only necessary datafiles are allocated, potentially freeing formerly used disk space.

Like with the repair command, you do this for all secondaries (of course one after another), have the primary step down and delete its datafiles and let it resync.

Pros

You do not need to over provision the disks of an individual node

There is just very little downtime

It is a relatively straightforward process

Cons

This process takes a while, may well have some impact on performance and reduces your planned level of redundancy. Let me explain this in a bit more detail: When planning a replica set, you choose how many replicas you want to have, ranging from one (two data bearing nodes plus an arbiter) to 50 as the time of this writing. You have a good reason for this redundancy, whatever it may be. When arbitrarily shutting down replica set members in order to reclaim disk space, you effectively reduce or even eliminate failover capabilities. So it is safe to say that in order to keep your desired level of redundancy during the resync, you need one additional node to maintain it.

Why I don't think it is a proper solution

Put plainly: putting half the money you spend for the additional node into additional disk space should solve any space problem in the first place. However, this might not be in your case (although that might well be through under dimensioned hardware) and thus the resync might be a viable solution in some cases

Ok, smarty pants: What to do?

Frankly, from my experience, the need of reclaiming disk space is a sure sign of a badly planned cluster.

Granted, MongoDB is not the most efficient when it comes to disk space consumption, but after a while, it levels out. So when MongoDB constantly adds new datafiles, you can be sure that you simply need more disk space.

This can be either achieved through vertical or horizontal scaling. If you still can scale vertically and get an adequate bang for your buck, your hardware was underprovisioned until now. Go for it, problem solved!

If you already get the most bang for the buck and the size of your data (not only the number of your data files) constantly grows, it is time to scale horizontally, read to shard your cluster.

As a rule of thumb: when more than 80% of your disk space is used and the size of your data didn't show a massive spike but is constantly growing, I'd add a shard or start sharding. It requires some experience and knowledge to determine the exact threshold and how to do it exactly is out of scope even for this long answer.

With this approach, the decision when to shard is based on emprirical information, is started early enough to prevent serious problems, it reduces maintenance effort and risks and it enables you to scale properly.

One last word: Often people say that adding a shard is too expensive or they are not up to paying three config servers in addition to the data bearing nodes and start to shard their data manually. The reason for that is plainly wrong calculation of their own prices and a wrong understanding of how to do things sustainably. In the long run, it's going to bite you in the neck to reinvent the wheel.

Sharding won't solve the fragmentation issue. In fact the chunk moves will create additional fragmentation to source shard. In the end one of the methods that removes frag should be applied.
– AntoniosAug 16 '15 at 8:40

@Antonis Fragmentation isn't really an issue at all. It only becomes an issue when disk space is getting really tight or spinning disks are used (which is discouraged for good reasons), hence the question was narrowed down to a "good way to among disk space in MongoDB". For the reasons above, using repair or compact really isn't a viable solution. To shorten my answer into a single sentence and make my approach a bit more clear: When using MongoDB you should take data file fragmentation into account when planning the dimensions of your hardware, with all the consequences.
– Markus W MahlbergAug 24 '15 at 14:39

Mongo is using MM files (at least in v2.6) and the disk fragmentation will cause fragmentation in RAM as well.
– AntoniosAug 26 '15 at 12:30

@Antonis Well, that is correct. But: Fragmented Memory is hardly an issue. We are talking of access times to the random access memory of somewhere in the range of 75 nanoseconds (give and take). The process of reading the documents from disk (mmapped basically means that pointers to the individual documents are stored in RAM) will take orders of magnitude (around 5 x 10<sup>2</sup> for even the fastest (and extremely expensive) SSDs. Mind you, a nanosecond is a billionth of a second - and the response time of an average human is 100,000,000 nanoseconds. Plenty of time.
– Markus W MahlbergAug 27 '15 at 14:13

I am not disagreeing with you I am just concern about the memory footprint. With fragmentation you load unnecessary stuff to RAM and you might remove frequently access data.
– AntoniosAug 28 '15 at 15:04

The best approach is to use a 3 member replica set.
Periodically you will stop one of the secondaries, wipe the data directory and start it. The secondary will begin an initial sync which will remove all fragmentation since it will re-write all datafiles from scratch.
Then do the same for the other secondary and perform a stepdown. The stepdown will require 15 secs of downtime or even less and one of the defragmented secondaries will become the new primary. In the end do an initial sync for the ex-primary.

When a secondary is removed from the replica set, there should not be any downtime at all.
– Markus W MahlbergAug 10 '15 at 11:04

1

Its not removed from the replica set (like rs.remove), its shutdown so no downtime will occur. The only downtime is during stepdown Primary until a new primary gets elected, but can handled gracefully from the driver.
– AntoniosAug 10 '15 at 20:38