Force linux to not swap the JVM

Details

Description

The way mmap()'d IO is handled in cassandra is dangerous. It allocates potentially massive buffers without any care for bounding the total size of the program's buffers. As the node's dataset grows, this will lead to swapping and instability.

This is a dangerous and wrong default for a couple of reasons.

1) People are likely to test cassandra with the default settings. This issue is insidious because it only appears when you have sufficient data in a certain node, there is absolutely no way to control it, and it doesn't at all respect the memory limits that you give to the JVM.

That can all be ascertained by reading the code, and people should certainly do their homework, but nevertheless, cassandra should ship with sane defaults that don't break down when you cross some magic unknown threshold.

2) It's deceptive. Unless you are extremely careful with capacity planning, you will get bit by this. Most people won't really be able to use this in production, so why get them excited about performance that they can't actually have?

Activity

Because we don't want mmap'd data to be locked into memory – typical data sizes far exceed available RAM. The OS deals well with keeping hot mmap'd data paged in, so we want to let it do its job there. We just don't want it to be confused by the JVM's GC behavior into paging part of the JVM itself out.

Jonathan Ellis
added a comment - 09/Feb/11 14:20 Because we don't want mmap'd data to be locked into memory – typical data sizes far exceed available RAM. The OS deals well with keeping hot mmap'd data paged in, so we want to let it do its job there. We just don't want it to be confused by the JVM's GC behavior into paging part of the JVM itself out.

Yang Yang
added a comment - 09/Feb/11 07:06 Jonathan:
why is MCL_CURRENT chosen? I thought you would want to use MCL_FUTURE (ignoring the discussion above that these 2 seem to have the same value).
with MCL_CURRENT, supposedly SSTables that you mmap() later will still have the possibility to be paged out. or maybe I am not understanding it correctly?
Thanks
Yang

It's a best-effort patch dependant on OS (which is all we can do, short of defaulting to mmap_index_only and taking a performance hit by default). Assuming the average use case, this is a much better default than before.

Jon Hermes
added a comment - 18/Aug/10 20:44 +1.
It's a best-effort patch dependant on OS (which is all we can do, short of defaulting to mmap_index_only and taking a performance hit by default). Assuming the average use case, this is a much better default than before.

Again, the relevant text from http://www.apache.org/legal/3party.html is, "LGPL v2.1-licensed works must not be included in Apache products, although they may be listed as system requirements or distributed elsewhere as optional works." We are not including jna, nor are we even requiring it [although it explicitly states it would be fine to do so]. The only restriction is on distributing the lgpl work itself, so while Hadoop is welcome to pile additional restrictions on themselves this is fine for us, since (and perhaps this wasn't clear) dependencies we pull in with ivy are build-time only, and are not distributed with our source or binary artifacts.

(FWIW it is also fine for an apache-licensed debian package, to declare a dependency on an lgpl one.)

Jonathan Ellis
added a comment - 18/Aug/10 20:07 v4 includes the ivy changes to download jna at build time.
Again, the relevant text from http://www.apache.org/legal/3party.html is, "LGPL v2.1-licensed works must not be included in Apache products, although they may be listed as system requirements or distributed elsewhere as optional works." We are not including jna, nor are we even requiring it [although it explicitly states it would be fine to do so] . The only restriction is on distributing the lgpl work itself, so while Hadoop is welcome to pile additional restrictions on themselves this is fine for us, since (and perhaps this wasn't clear) dependencies we pull in with ivy are build-time only, and are not distributed with our source or binary artifacts.
(FWIW it is also fine for an apache-licensed debian package, to declare a dependency on an lgpl one.)

Folke Behrens
added a comment - 18/Aug/10 05:38 I meant that com.sun.jna.Native would be a hard runtime dependency of FBUtilities. (NoClassDefFoundError != ClassNotFoundException) Cassandra wouldn't start without JNA, or did I miss something?

Folke Behrens
added a comment - 18/Aug/10 00:10 Wouldn't it then be better if tryMlockAll() loads another class with Class.forName() and catches the ClassNotFoundException if JNA jar is not on the classpath?

I don't think it's kosher to pull in LGPL as a build dependency with ivy either - in Hadoop we dynamically linked some JNI against LZO (LGPL licensed) but it was decided even that was not allowed, so we had to move the entire LZO support out to github.

Regarding the FD issue, although reflecting out the FD field isn't that portable, I've seen it done in an awful lot of places, so I don't think it's going to change any time soon. There's a patch in the works for Hadoop that adds some JNI calls for IO-related things, and we grab the fd field there. There's also an interface sun.misc.JavaIOFileDescriptorAccess which you can sneak out of sun.misc.SharedSecrets, if that makes you feel better than using reflection

Todd Lipcon
added a comment - 17/Aug/10 18:25 I don't think it's kosher to pull in LGPL as a build dependency with ivy either - in Hadoop we dynamically linked some JNI against LZO (LGPL licensed) but it was decided even that was not allowed, so we had to move the entire LZO support out to github.
Regarding the FD issue, although reflecting out the FD field isn't that portable, I've seen it done in an awful lot of places, so I don't think it's going to change any time soon. There's a patch in the works for Hadoop that adds some JNI calls for IO-related things, and we grab the fd field there. There's also an interface sun.misc.JavaIOFileDescriptorAccess which you can sneak out of sun.misc.SharedSecrets, if that makes you feel better than using reflection

patch that uses JNA, with catch for various error conditions and more informative logging where possible.

As discussed above, we can't ship JNA with Cassandra but we can pull it in with ivy at build time. So one of the conditions handled is simply "JNA doesn't exist at runtime." (But we don't need to resort to reflection to allow it to compile without JNA.) [A sufficiently recent version of JNA is not available in the main public maven repo, and that won't change in the near future, so we will host one on Riptano's repo. I will update this patch when that is ready.]

Jonathan Ellis
added a comment - 17/Aug/10 17:16 patch that uses JNA, with catch for various error conditions and more informative logging where possible.
As discussed above, we can't ship JNA with Cassandra but we can pull it in with ivy at build time. So one of the conditions handled is simply "JNA doesn't exist at runtime." (But we don't need to resort to reflection to allow it to compile without JNA.) [A sufficiently recent version of JNA is not available in the main public maven repo, and that won't change in the near future, so we will host one on Riptano's repo. I will update this patch when that is ready.]

Well, posix_fadvise() is potentially a bit more problematic than mlockall(). It again takes flags, whose values I suppose may be as practically standardized as supposedly for mlockall() (though I have not yet checked). In addition it takes an off_t which, being an abstract type, would have potential for portability concerns but a quick Googling suggests (http://markmail.org/message/qvf7hhq2mgmwwmw3) JNA has some particular support for the off_t data type though I did not find it right now in the API docs (will have to check more carefully).

The other thing is that posix_fadvise() will need a file descriptor in integer form. java.io.FileDescriptor is decidedly abstract and does not expose this information (which is understandable). I am not aware, off hand, of a good way for us to obtain the relevant underlying file descriptor; anyone? Molesting FileDescriptor with reflection should technically do the trick with openjdk/sun derived VM:s (at least based on current openjdk7 FileDescriptor.java), but.... yuck.

If it weren't for the build problems implied by JNI I would strongly prefer it. Under the circumstances I'm not sure. One observation is that given the kind of ifs and buts one seems to have to resort to anyway, writing some simple semi-portable build rules in Ant, specifically targetting certain platforms and compilers, does not feel so bad. Even if one hard-codes each common platform to avoid solving the native build problem generally, that does not feel worse to me in practice than making the assumptions necessary with JNA and stuff like using reflection to access private fields...

As long as the native building remain optional and does not hinder anyone getting Cassandra to work with just Java, and as long as it is relatively easy for someone on an unsupported/problematic platform to simply build the JNI libraries themselves (doable by e.g. a simple Makefile with clear instructions for pointing to JDK headers etc), JNI feels pretty reasonable to me.

Thoughts? Am I painting a bleaker picture than reality with respect to using JNA?

Peter Schuller
added a comment - 17/Aug/10 09:13 Well, posix_fadvise() is potentially a bit more problematic than mlockall(). It again takes flags, whose values I suppose may be as practically standardized as supposedly for mlockall() (though I have not yet checked). In addition it takes an off_t which, being an abstract type, would have potential for portability concerns but a quick Googling suggests ( http://markmail.org/message/qvf7hhq2mgmwwmw3 ) JNA has some particular support for the off_t data type though I did not find it right now in the API docs (will have to check more carefully).
The other thing is that posix_fadvise() will need a file descriptor in integer form. java.io.FileDescriptor is decidedly abstract and does not expose this information (which is understandable). I am not aware, off hand, of a good way for us to obtain the relevant underlying file descriptor; anyone? Molesting FileDescriptor with reflection should technically do the trick with openjdk/sun derived VM:s (at least based on current openjdk7 FileDescriptor.java), but.... yuck.
If it weren't for the build problems implied by JNI I would strongly prefer it. Under the circumstances I'm not sure. One observation is that given the kind of ifs and buts one seems to have to resort to anyway, writing some simple semi-portable build rules in Ant, specifically targetting certain platforms and compilers, does not feel so bad. Even if one hard-codes each common platform to avoid solving the native build problem generally, that does not feel worse to me in practice than making the assumptions necessary with JNA and stuff like using reflection to access private fields...
As long as the native building remain optional and does not hinder anyone getting Cassandra to work with just Java, and as long as it is relatively easy for someone on an unsupported/problematic platform to simply build the JNI libraries themselves (doable by e.g. a simple Makefile with clear instructions for pointing to JDK headers etc), JNI feels pretty reasonable to me.
Thoughts? Am I painting a bleaker picture than reality with respect to using JNA?

Folke Behrens
added a comment - 17/Aug/10 05:36 Note that this is a political matter, not a legal one. It's against the ASF policy to distribute packages containing LGPL code. The licenses are compatible.

It's not a deal breaker for us since we'd like to use it for basically optimizations... ASF says "LGPL v2.1-licensed works must not be included in Apache products, although they may be listed as system requirements or distributed elsewhere as optional works" so that would be workable if sub-optimal.

Curious if Peter things we're going to have to go raw JNI for fadvise on compactions. If we're going to have to bite that bullet anyway then JNA gets less interesting.

Jonathan Ellis
added a comment - 17/Aug/10 04:43 Ugh, that's a pain. (JFFI is also LGPL.)
It's not a deal breaker for us since we'd like to use it for basically optimizations... ASF says "LGPL v2.1-licensed works must not be included in Apache products, although they may be listed as system requirements or distributed elsewhere as optional works" so that would be workable if sub-optimal.
Curious if Peter things we're going to have to go raw JNI for fadvise on compactions. If we're going to have to bite that bullet anyway then JNA gets less interesting.

AFAIK JNA is LGPL and thus incompatible with Apache 2 license. I've wanted to use it in other ASF projects, too, and it's a pain there isn't a Apache-licensed alternative. If some of the Cassandra people are interested in a cleanroom implementation, I'd be interested in helping, though!

Todd Lipcon
added a comment - 17/Aug/10 03:58 AFAIK JNA is LGPL and thus incompatible with Apache 2 license. I've wanted to use it in other ASF projects, too, and it's a pain there isn't a Apache-licensed alternative. If some of the Cassandra people are interested in a cleanroom implementation, I'd be interested in helping, though!

So I take it the way forward would be to take your JNA version and combine with the configuration/policy parts of my patch (assuming people agree that those parts are a good idea) and go for that version for now and maybe move to JNI in the future if JNI becomes a dependency anyway for some other reason.

Peter Schuller
added a comment - 09/Aug/10 17:55 It all sounds reasonable.
So I take it the way forward would be to take your JNA version and combine with the configuration/policy parts of my patch (assuming people agree that those parts are a good idea) and go for that version for now and maybe move to JNI in the future if JNI becomes a dependency anyway for some other reason.
Any objections?

How does the JNA approach behave if there is no C library (Windows?) or mlockall doesn't exist (OS X?)

In case of Mac OS X an UnsatisfiedLinkError will be thrown. Windows? I don't know. Maybe a JNA-specific exception, maybe a ULE, too. OS's can be easily detected with Platform.isXXX() and dealt with accordingly.

something as simple as "grab errno" became a holy mess of portability concerns.

Yes, but errno is a particularly hard case. The "inventors" messed up big time with this. That's why the JNA developers provide two ways to check errno: you either mark your methods with "throws LastErrorException" or you ask Native.getLastError(). This works under Windows, too.

The proposed JNA patch seems to suffer from exactly this problem as far as I can see, making assumptions about what the concrete values are of MCL_CURRENT and MCL_FUTURE.

Theoretically, you're right, in practice, however, I can't find a single POSIX system that assigns different values to MCL_CURRENT or MCL_FUTURE, and I think it's highly unlikely that these will change in the future. If so, Cassandra's code can be adjusted.

As far as I can tell, once one has gotten over the initial one-time hurdle of using JNI and the associated building issues, you have a much more correct/standards-compliant access to the native platform than through JNA since you're in compile time with access to appropriate headers etc.
Please do correct me if I'm wrong, since the idea of avoiding compile time/build issues is certainly very attractive and the reason why I tried to find an acceptable solution with JNA in the past.

You're absolutely right, and your JNI code is really superb. If Cassandra needs to bind a couple more native functions I'd say JNI is the way to go. But not just yet.

Folke Behrens
added a comment - 09/Aug/10 15:57
How does the JNA approach behave if there is no C library (Windows?) or mlockall doesn't exist (OS X?)
In case of Mac OS X an UnsatisfiedLinkError will be thrown. Windows? I don't know. Maybe a JNA-specific exception, maybe a ULE, too. OS's can be easily detected with Platform.isXXX() and dealt with accordingly.
something as simple as "grab errno" became a holy mess of portability concerns.
Yes, but errno is a particularly hard case. The "inventors" messed up big time with this. That's why the JNA developers provide two ways to check errno: you either mark your methods with "throws LastErrorException" or you ask Native.getLastError(). This works under Windows, too.
The proposed JNA patch seems to suffer from exactly this problem as far as I can see, making assumptions about what the concrete values are of MCL_CURRENT and MCL_FUTURE.
Theoretically, you're right, in practice, however, I can't find a single POSIX system that assigns different values to MCL_CURRENT or MCL_FUTURE, and I think it's highly unlikely that these will change in the future. If so, Cassandra's code can be adjusted.
As far as I can tell, once one has gotten over the initial one-time hurdle of using JNI and the associated building issues, you have a much more correct/standards-compliant access to the native platform than through JNA since you're in compile time with access to appropriate headers etc.
Please do correct me if I'm wrong, since the idea of avoiding compile time/build issues is certainly very attractive and the reason why I tried to find an acceptable solution with JNA in the past.
You're absolutely right, and your JNI code is really superb. If Cassandra needs to bind a couple more native functions I'd say JNI is the way to go. But not just yet.

I'll admit I did not investigate JNA (or POSIX-JNA) for this particular case. Last time I did however, I found it lacking. Very trivial cases were okay, but even something as simple as "grab errno" became a holy mess of portability concerns.

I looked briefly at what posix-jna does, and I was unable to find any magic bullets in there and instead saw things like hard-coded constants that are non-portable and difficult to detect when they break due to changes to some particular platform.

The proposed JNA patch seems to suffer from exactly this problem as far as I can see, making assumptions about what the concrete values are of MCL_CURRENT and MCL_FUTURE.

As far as I can tell, once one has gotten over the initial one-time hurdle of using JNI and the associated building issues, you have a much more correct/standards-compliant access to the native platform than through JNA since you're in compile time with access to appropriate headers etc.

Please do correct me if I'm wrong, since the idea of avoiding compile time/build issues is certainly very attractive and the reason why I tried to find an acceptable solution with JNA in the past.

Peter Schuller
added a comment - 09/Aug/10 08:44 I'll admit I did not investigate JNA (or POSIX-JNA) for this particular case. Last time I did however, I found it lacking. Very trivial cases were okay, but even something as simple as "grab errno" became a holy mess of portability concerns.
I looked briefly at what posix-jna does, and I was unable to find any magic bullets in there and instead saw things like hard-coded constants that are non-portable and difficult to detect when they break due to changes to some particular platform.
The proposed JNA patch seems to suffer from exactly this problem as far as I can see, making assumptions about what the concrete values are of MCL_CURRENT and MCL_FUTURE.
As far as I can tell, once one has gotten over the initial one-time hurdle of using JNI and the associated building issues, you have a much more correct/standards-compliant access to the native platform than through JNA since you're in compile time with access to appropriate headers etc.
Please do correct me if I'm wrong, since the idea of avoiding compile time/build issues is certainly very attractive and the reason why I tried to find an acceptable solution with JNA in the past.

This is a draft (as in, submitted now as a work-in-progress for review rather than for commit) patch to add mlockall() support. It allows 'off', 'auto' and 'required' to be specified in the configuration, with the default being 'auto'.

mlockall() can fail either because the native JNI library is missing or because mlockall() itself fails; neither is a terminal condition unless 'required' is specified in the configuration file.

The patch currently does not address building and packaging, except for a toy change to build.xml that is more of an example for a human to use for testing.

I'd be interested to hear any opinions about how building and deployment should be handled given JNI libraries. I think it is important that no one is prevented to use cassandra without mlockall() functionality due to native build issues, so it should presumably be optional. Even then, any suggestions for favorite/preferred method of building JNI libraries portably in a way that hooks nicely into ant and cassandra build infrastructure? In particular taking into consideration deployment (e.g. how it fits into debian packaging or similar infrastructure).

Peter Schuller
added a comment - 08/Aug/10 20:22 This is a draft (as in, submitted now as a work-in-progress for review rather than for commit) patch to add mlockall() support. It allows 'off', 'auto' and 'required' to be specified in the configuration, with the default being 'auto'.
mlockall() can fail either because the native JNI library is missing or because mlockall() itself fails; neither is a terminal condition unless 'required' is specified in the configuration file.
The patch currently does not address building and packaging, except for a toy change to build.xml that is more of an example for a human to use for testing.
I'd be interested to hear any opinions about how building and deployment should be handled given JNI libraries. I think it is important that no one is prevented to use cassandra without mlockall() functionality due to native build issues, so it should presumably be optional. Even then, any suggestions for favorite/preferred method of building JNI libraries portably in a way that hooks nicely into ant and cassandra build infrastructure? In particular taking into consideration deployment (e.g. how it fits into debian packaging or similar infrastructure).

Jonathan Ellis
added a comment - 16/Jul/10 00:26 according to http://andrigoss.blogspot.com/2008/02/jvm-performance-tuning.html , using huge pages automatically gives us the lock-jvm-heap-in-memory behavior we want, and may provide a substantial performance benefit as well.
See also: http://java.sun.com/javase/technologies/hotspot/largememory.jsp

I am strongly in favor of defaults that are as flexible and stable as possible. If it is hard for even a relatively small percentage of users to get stable performance with mmap, then I would agree that the default should be standard I/O. There should then be a Cassandra Tuning wiki page that include a mmap discussion.

That said, I also agree that it is worth doing the native code work to get mmap more stable with larger datasets and/or smaller machines.

Tupshin Harper
added a comment - 29/Jun/10 23:12 I am strongly in favor of defaults that are as flexible and stable as possible. If it is hard for even a relatively small percentage of users to get stable performance with mmap, then I would agree that the default should be standard I/O. There should then be a Cassandra Tuning wiki page that include a mmap discussion.
That said, I also agree that it is worth doing the native code work to get mmap more stable with larger datasets and/or smaller machines.

I have tried many levels of swappiness (including 0) without any change in behaviour. Additionally, I haven't seen much if any change in performance with standard IO.

Continuing to iterate on the mmap code might be a good idea. But, it's the wrong default. Especially now that we've agreed that it is currently broken. It's possible that it may be a sensible default in the future, but right now, it's not a good choice for production (in most cases).

James Golick
added a comment - 22/Jun/10 16:21 I have tried many levels of swappiness (including 0) without any change in behaviour. Additionally, I haven't seen much if any change in performance with standard IO.
Continuing to iterate on the mmap code might be a good idea. But, it's the wrong default. Especially now that we've agreed that it is currently broken. It's possible that it may be a sensible default in the future, but right now, it's not a good choice for production (in most cases).

Jonathan Ellis
added a comment - 22/Jun/10 05:16 It seems that what is happening is,
the JVM hasn't needed to run a major collection in a while,
so Linux says "I'll swap part of the JVM's heap so I can pull more of this hot sstable into ram,"
then the JVM goes to GC and thrashes pulling its heap in from swap
The "right" solution is probably to use mlockall(MCL_CURRENT) on JVM start (with min heap = max heap so that gets pre-allocated). Then perform the mmapping.
mmap'd io is enough faster that this is probably worth biting the native code bullet for.

This is one of the very first things we've had to do with every cluster we've built. The mmap implementation just does not work for anything I've seen in production beyond trivial datasets. This would be a wonderful, reality-driven change.

Jeff Hodges
added a comment - 20/Jun/10 22:41 This is one of the very first things we've had to do with every cluster we've built. The mmap implementation just does not work for anything I've seen in production beyond trivial datasets. This would be a wonderful, reality-driven change.