fine-grained control over data directories

Details

Description

Currently Cassandra supports multiple data directories but no way to control what sstables are placed where. Particularly for systems with mixed SSDs and rotational disks, it would be nice to pin frequently accessed columnfamilies to the SSDs.

Jonathan Ellis
added a comment - 08/Jun/11 13:44 There's some tension between managing this cluster-wide and the actual data directory definitions being per-machine. Not sure what the best solution there is.

Since each keyspace is stored in a different sub-directory of the DataDiretories, you can already split the storage of different keyspaces with some clever mount options. Maybe we could give column families the same treatment?

Ryan King
added a comment - 08/Jun/11 16:55 Since each keyspace is stored in a different sub-directory of the DataDiretories, you can already split the storage of different keyspaces with some clever mount options. Maybe we could give column families the same treatment?

+1 on that. We have been discussing the same thing, for the same purpose. The only kink is that you don't want to do something like having a per-cf setting that is tied to local node details like paths. But simply placing CF:s in a named subdirectory (similar to the pg tablespace) which can, on a per-node basis, by a symlink or a mountpoint, avoids that.

This means there's no problem doing a rolling re-configuration of a cluster, and there is no need to realize before hand that you might want to move some particular CF and do something like assign it to a tablespace (to get the level of indirection). It all just works by default, and you can move CF:s at any time on any node without co-ordination other than the node being down for a bit.

I can foresee it being easier to accidentally start a node which seems to work but has some CF:s be completely empty, because Cassandra won't be able to distinguish between an actual empty CF and a directory that wasn't mounted (or a symlink pointing to a non-mounted directory). Something simple like creating a marker of some kind on CF creation might help with that; on start-up CF:s that are missing the marker could be rejected. But - I suppose this is overkill at least initially.

Peter Schuller
added a comment - 08/Jun/11 17:38 +1 on that. We have been discussing the same thing, for the same purpose. The only kink is that you don't want to do something like having a per-cf setting that is tied to local node details like paths. But simply placing CF:s in a named subdirectory (similar to the pg tablespace) which can, on a per-node basis, by a symlink or a mountpoint, avoids that.
This means there's no problem doing a rolling re-configuration of a cluster, and there is no need to realize before hand that you might want to move some particular CF and do something like assign it to a tablespace (to get the level of indirection). It all just works by default, and you can move CF:s at any time on any node without co-ordination other than the node being down for a bit.
I can foresee it being easier to accidentally start a node which seems to work but has some CF:s be completely empty, because Cassandra won't be able to distinguish between an actual empty CF and a directory that wasn't mounted (or a symlink pointing to a non-mounted directory). Something simple like creating a marker of some kind on CF creation might help with that; on start-up CF:s that are missing the marker could be rejected. But - I suppose this is overkill at least initially.

Chris Burroughs
added a comment - 09/Aug/11 15:00 It would also be cool (but this is obviously speculative) to have the ability to keep Index files on an SSD, and the larger data files on rotating disks.

Pavel Yaskevich
added a comment - 16/Aug/11 16:27 After crushing my head again this for a few days I can say that this is more complicated that it sounds for a few reasons:
We will need to support both old/new directory structures which requires major changes in the way how Descriptor class works and how CFS and SSTable classes do file lookup and path generation.
Adds additional complexity to the way how we do backups, snapshots and recover which could potentially lead to some nasty bugs.
As Peter already mentioned "Cassandra won't be able to distinguish between an actual empty CF and a directory that wasn't mounted (or a symlink pointing to a non-mounted directory)".
There are more but those mentioned are major ones. Let's skip this for now.

Marcus Eriksson
added a comment - 30/Sep/11 12:23 submitting a "working" patch for separating column families into subdirectories, <ks>/<cf>/<cf>-xyz.db
There are alot of things to clean up/refactor, but submitting patch for comments.
Unit tests work both for new-style dirs and old (except for a backup test that i will fix when backups actually end up where they should).
TODO: (probably more than this)
snapshots need to go into the <ks>/<cf>/snapshots dir since the purpose of this patch is to make it possible to have the <cf>/ directory on a separate drive
incremental backups - same issue as for snapshots
refactoring of Descriptor class, quite hairy now
Disk space checking in column family subdirs

Thanks for your work, Marcus! First of all, can you please write your algorithm down so people can participate without reading actual code?

I don't think that this is the right way to go because I can't find how does it manage to keep backward compatibility with current directory structure (and that test SSTables should be moved is yet another confirmation) that would imply major change in the Descriptor/Component classes, all the current changes relay on DatabaseDescriptor.getPerCFDirectory() which is static "true" which makes it redundant. In my vision of things major changes should be made only in Descriptor, Component, ColumnFamilyStore and SSTable classes to change the way they create/lookup file locations and do backups and snapshoting.

Pavel Yaskevich
added a comment - 30/Sep/11 21:16 Thanks for your work, Marcus! First of all, can you please write your algorithm down so people can participate without reading actual code?
I don't think that this is the right way to go because I can't find how does it manage to keep backward compatibility with current directory structure (and that test SSTables should be moved is yet another confirmation) that would imply major change in the Descriptor/Component classes, all the current changes relay on DatabaseDescriptor.getPerCFDirectory() which is static "true" which makes it redundant. In my vision of things major changes should be made only in Descriptor, Component, ColumnFamilyStore and SSTable classes to change the way they create/lookup file locations and do backups and snapshoting.

It is not backwards compatible, i figured it was not worth the extra complexity, i imagine the "upgrade"-path to be:
1. shut down node
2. edit config file
3. symlink in SSDs and move the files into the subdirectories
4. start node

Of course, this could be done on startup by cassandra itself, but providing a shell-script that does it keeps it simple, what do you think? I really dont want the complexity with sstable files in two places at the same time.

My solution simply changed the Descriptor class to return paths to sstable files in subdirs, and then fix everything that was affected. Many of the classes modified were due to assumptions in the unit tests. I'm currently refactoring the Descriptor class to make it cleaner and a central place to glob for files etc. Expect a patch tonight or tomorrow.

I will of course change DatabaseDescriptor.getPerCFDirectory() to read the config file, kept it hard coded to make testing simpler.

Marcus Eriksson
added a comment - 01/Oct/11 06:58 It is not backwards compatible, i figured it was not worth the extra complexity, i imagine the "upgrade"-path to be:
1. shut down node
2. edit config file
3. symlink in SSDs and move the files into the subdirectories
4. start node
Of course, this could be done on startup by cassandra itself, but providing a shell-script that does it keeps it simple, what do you think? I really dont want the complexity with sstable files in two places at the same time.
My solution simply changed the Descriptor class to return paths to sstable files in subdirs, and then fix everything that was affected. Many of the classes modified were due to assumptions in the unit tests. I'm currently refactoring the Descriptor class to make it cleaner and a central place to glob for files etc. Expect a patch tonight or tomorrow.
I will of course change DatabaseDescriptor.getPerCFDirectory() to read the config file, kept it hard coded to make testing simpler.

Pavel Yaskevich
added a comment - 01/Oct/11 10:59 This should be done like compression which compresses only newly created SSTables, we don't want to make users to stop their node for indefinite time to change config and move files.

I don't think taking compression as a model makes sense. With compression, it's fine to set the option globally and each node can start using compression when it gets the schema update. But this affects the storage engine "below" schema.

I think it's reasonable to make the change offline, especially if an upgrade tool is provided to make it simple.

Jonathan Ellis
added a comment - 01/Oct/11 18:17 I don't think taking compression as a model makes sense. With compression, it's fine to set the option globally and each node can start using compression when it gets the schema update. But this affects the storage engine "below" schema.
I think it's reasonable to make the change offline, especially if an upgrade tool is provided to make it simple.

Marcus Eriksson
added a comment - 01/Oct/11 18:24 agree, gonna look at how much pain it is to actually build though
wont need alot of downtime if it is done offline, can rsync files into the subdir right before stopping the node

Marcus Eriksson
added a comment - 04/Oct/11 12:51 First version of a backwards compatible patch.
Approach is:
On startup, recursively glob all data directories to find existing sstables
whenever generating new files, look at the setting in the config, and put the file there
probably more, cant remember, my brain is messed up after fighting with this for a few days, look at patch!
this gives that a user can enable/disable the setting without mocking around manually with the sstables.
Comments? Yes, the setting is still hardcoded as a marker that this is not ready to be merged

Marcus Eriksson
added a comment - 05/Oct/11 14:42 Updated patch, working snapshots and incremental backups
also a bit of cleanups
btw, the patches are always against trunk, not against the older patches, is this the correct way to do it?

Anyone got a suggestion on how to do the available disk space checking? Reason in that now we cannot simply check which dataFileLocation that has the most available disk space, we need to check the subdirs where files are actually written.

My naive first approach would be to simply change DatabaseDescriptor.getDataFileLocationForTable(..) to take a column family name, which we ought to always know when calling that method (and a quick check seems to say that only compactions might prove a problem for this approach).

Marcus Eriksson
added a comment - 10/Oct/11 16:55 Anyone got a suggestion on how to do the available disk space checking? Reason in that now we cannot simply check which dataFileLocation that has the most available disk space, we need to check the subdirs where files are actually written.
My naive first approach would be to simply change DatabaseDescriptor.getDataFileLocationForTable(..) to take a column family name, which we ought to always know when calling that method (and a quick check seems to say that only compactions might prove a problem for this approach).
Comments?

Pavel Yaskevich
added a comment - 13/Oct/11 20:42 Looks better now but I don't like how you changed DescriptorTest especially testLegacy() "userActionUtilsKey" there is meant to be a CF name, so it seems like current version does not work correctly?

Good, can you briefly tell what is left? I see that ColumnFamilyStoreTest has an commented assertion and code still have some TODOs. Also please make sure you follow your CodeStyle and put space after if/for/while/... statements...

Pavel Yaskevich
added a comment - 13/Oct/11 21:27 Good, can you briefly tell what is left? I see that ColumnFamilyStoreTest has an commented assertion and code still have some TODOs. Also please make sure you follow your CodeStyle and put space after if/for/while/... statements...

Marcus Eriksson
added a comment - 14/Oct/11 07:20 Biggest part left is figuring out how to estimate disk space to know where to write an sstable, ill work on that this weekend
Other than that small cleanups (codestyle), unit tests and adding the config parameter

Marcus Eriksson
added a comment - 24/Oct/11 13:39 third version, this is could be considered feature-complete, changes compared to v2 of the patch:
make sure disk space is calculated correctly
if there is a subdirectory in the datadirectory with the same name as the CF, check in that directory, otherwise, check in the datadirectory (the subdirectory will be created)
add config param
add test for RecursiveGlob
make RecursiveGlob ignore case when comparing file names
add test for scrubbing old style CFs to new style CFs (or, modify an existing test to check this too)

Pavel Yaskevich
added a comment - 04/Nov/11 09:29 v3 looks good (needs rebase), here is my comments:
DatabaseDescriptor.getPerCFDirectory() should be renamed to something like "useSeparateCFDirectories"
in the ColumnFamilyStory.accept method file name should be checked depending on case because we can CF names are case-sensitive.
we don't have ColumnFamilyStore.rename method anymore which is a good thing for this patch, you can just rebase and remove related code.
in the Table.snapshotExists I think we should be more careful determining if snapshot actually exists.
TODO should be cleaned up.
Wish: if it's possible I think we can remove ColumnFamily name from the SSTable files if those are in the CF directory already.
I think we are almost done in here.

Marcus Eriksson
added a comment - 05/Nov/11 19:17 v4 attached
renamed DatabaseDescriptor method
made file name checking case sensitive again...
removed unused rename code (following the renameSSTables call path, it looks very strange, even in trunk, ill fix this after this patch is merged)
cant figure out a way to make snapshotExists any better, im looking for a directory with the snapshot tag in all places, not sure if it could be wrong here (even if it is quite naive)
removed dead todos

Tested v4 on trunk and I see few test failures - LegacySSTableTest and ColumnFamilyStoreTest, it is related to how you determine what placement style is used for a given SSTable, feels like !DatabaseDescriptor.useSeparateCFDirectories() condition is insufficient. Also I think it would be correct to rename "one_directory_per_column_family" option to "use_separate_column_family_directories" with comment like "Useful when you have disks with different speeds (HDD/SSD) and you want explicitly distribute Column Families between them".

Pavel Yaskevich
added a comment - 08/Nov/11 12:32 Tested v4 on trunk and I see few test failures - LegacySSTableTest and ColumnFamilyStoreTest, it is related to how you determine what placement style is used for a given SSTable, feels like !DatabaseDescriptor.useSeparateCFDirectories() condition is insufficient. Also I think it would be correct to rename "one_directory_per_column_family" option to "use_separate_column_family_directories" with comment like "Useful when you have disks with different speeds (HDD/SSD) and you want explicitly distribute Column Families between them".

make LegacySSTable test work by telling Descriptor that the file is old-style

make incremental backups work

problem was when migrating from old-style to new-style, the generation-counter in ColumnFamilyStore generated clashing ids, which was not a problem until it tried to hard link the files to the same directory with the same name. This should be refactored in CASSANDRA-1983 (i'll give that a try when this is merged)

Marcus Eriksson
added a comment - 08/Nov/11 19:19 4 patches, one is the same as the v4, the other three patches are
change configuration option name
make LegacySSTable test work by telling Descriptor that the file is old-style
make incremental backups work
problem was when migrating from old-style to new-style, the generation-counter in ColumnFamilyStore generated clashing ids, which was not a problem until it tried to hard link the files to the same directory with the same name. This should be refactored in CASSANDRA-1983 (i'll give that a try when this is merged)

I don't really like the idea of keeping a map of all generates for each of ks/cf and updating it all the time (this also implies that we need to resize it after cf is created/dropped). After re-thinking this once again it feels like we really should just make a tool to transfer SSTable to separate directories and back, make a flag in config that will indicate old/new style of file structure, because I don't see a error-prone way to handle this when old/new style SSTables are mixed...

Pavel Yaskevich
added a comment - 08/Nov/11 20:48 I don't really like the idea of keeping a map of all generates for each of ks/cf and updating it all the time (this also implies that we need to resize it after cf is created/dropped). After re-thinking this once again it feels like we really should just make a tool to transfer SSTable to separate directories and back, make a flag in config that will indicate old/new style of file structure, because I don't see a error-prone way to handle this when old/new style SSTables are mixed...

I have tested your latest patch using stress and sstablemover and then stress (+ flush/scrub/snapshot) again for Standard (+ secondary indexes) and Super ColumnFamilies and everything worked just fine for me. Here are my last comments and I think after this gets done we are ready to push your changes:

1). a). Tool should be made more user-friendly by adding at least --help and add an option to move sstables back to the keyspace directory (as we have an two states in the config);
b). I think it should delete old sstables after copy automatically or by asking user using --delete-old option?
c). Move sstablemover tool the ./bin instead of ./tools/sstablemover.

3). I think we should extend comment in the config describing what should be done before it is safe to change to change the option e.g. "Note: before setting this option to 'true' you should move existing SSTable files to the separate directories manually by using ./bin/sstablemover tool, the same applies for 'false' state - ./bin/sstablemover could be used to restore original directory structure (Please use ./bin/sstablemover --help for help and information)."...

Pavel Yaskevich
added a comment - 19/Nov/11 23:11 Thanks a lot for taking care about this, Marcus!
I have tested your latest patch using stress and sstablemover and then stress (+ flush/scrub/snapshot) again for Standard (+ secondary indexes) and Super ColumnFamilies and everything worked just fine for me. Here are my last comments and I think after this gets done we are ready to push your changes:
1). a). Tool should be made more user-friendly by adding at least --help and add an option to move sstables back to the keyspace directory (as we have an two states in the config);
b). I think it should delete old sstables after copy automatically or by asking user using --delete-old option?
c). Move sstablemover tool the ./bin instead of ./tools/sstablemover.
2). LegacySSTableTest fails on my machine can you, please, check that?
3). I think we should extend comment in the config describing what should be done before it is safe to change to change the option e.g. "Note: before setting this option to 'true' you should move existing SSTable files to the separate directories manually by using ./bin/sstablemover tool, the same applies for 'false' state - ./bin/sstablemover could be used to restore original directory structure (Please use ./bin/sstablemover --help for help and information)."...
Also please remove/implement remaining // TODO: comments.

I've only look at the last posted patchset (2749_proper.tar.gz) but I've found the following few problems:

The disk space free space checks doesn't take the new layout into account. In other words, it always checks the keyspace directory for free space. But in case we use CF directories, the keyspace directory could likely be on a disk with almost no freespace (and in any case, this won't always be the right disk to check).

The check that a snapshot is existing and the clearing of snapshot is broken for the CF directories layout (in Table.java)

The leveled manifest is not put into it's CF directory (it follows that its snapshot is not in the right place either).

I'm also not a fan of having the directory argument given to the Descriptor constructors not be the actual directory of the sstable it describes. I think it's error prone, making Descriptor harder to use (the fact that Descriptor.getDirectory() doesn't necessarily returns the directory you've fed to the constructor is confusing).

More generally, the current code dealing with the paths to the data directories is currently very badly encapsulated (not this patch fault at all). Which means that adding a new "parallel" data directory layout leaks everywhere and make the code hard to read/maintain. I strongly believe we should use the opportunity to improve that code encapsulation.

So attaching a patch that refactor a bit more the code to improve encapsulation. It essentially consists in adding a new Directories class that abstracts the actual layout of the data directory (including the fact that we can use multiple locations, which imho greatly clean the code). Aside from that, the patch reuse the sstablemover of the preceding patch (with an added verbose option). The patch also includes fairly extensive unit tests for the new class.

Now there is one thing that really bother me with this patch (the preceding patches have the same problem). It's the fact that the Descriptor.fromFilename method needs to query the global useSeparateCFDirectories option to work correctly. This has a number of practical consequences:

it is the reason for a number of super annoying gymnastic in the unit tests, like the need to duplicate the sstables used by LegacySSTableTest for instance. I've put all of this gymnastic in a separate patch (0002-fix-unit-tests.patch).

this means we cannot stream between two nodes, one using separate cf directory, one that don't (that limitation could be solve independently for streaming only, but it's worth considering a more general fix imo, in particular because a specific fix would be a tad ugly).

tools like the sstableloader will require that you put the sstable to load into a ksname/cfname/ directory when you use the separate directory option. It's not a big deal, but it feels arbitrary and annoying.

At first, I had hoped that we could make Descriptor.fromFilename 'detect' whether or not we were using the separate directory layout by using the fact that the file name contains the column family name already, but that doesn't work when the ksname == cfname (which is allowed afaik).

One option would be to stop relying on the directory structure the sstable is in to detect the keyspace name. We could do that if we included the keyspace name in the filename for instance. The pros would be to fix all the problem listed above, and in particular we would be able to allow stuffs like sstableloader <filename>, instead of requiring that the files are put inside a specifically name directory, which would be nice. I don't see much cons except making the file names longer. Any opinion on that idea?

Sylvain Lebresne
added a comment - 06/Dec/11 12:06 I've only look at the last posted patchset (2749_proper.tar.gz) but I've found the following few problems:
The disk space free space checks doesn't take the new layout into account. In other words, it always checks the keyspace directory for free space. But in case we use CF directories, the keyspace directory could likely be on a disk with almost no freespace (and in any case, this won't always be the right disk to check).
The check that a snapshot is existing and the clearing of snapshot is broken for the CF directories layout (in Table.java)
The leveled manifest is not put into it's CF directory (it follows that its snapshot is not in the right place either).
I'm also not a fan of having the directory argument given to the Descriptor constructors not be the actual directory of the sstable it describes. I think it's error prone, making Descriptor harder to use (the fact that Descriptor.getDirectory() doesn't necessarily returns the directory you've fed to the constructor is confusing).
More generally, the current code dealing with the paths to the data directories is currently very badly encapsulated (not this patch fault at all). Which means that adding a new "parallel" data directory layout leaks everywhere and make the code hard to read/maintain. I strongly believe we should use the opportunity to improve that code encapsulation.
So attaching a patch that refactor a bit more the code to improve encapsulation. It essentially consists in adding a new Directories class that abstracts the actual layout of the data directory (including the fact that we can use multiple locations, which imho greatly clean the code). Aside from that, the patch reuse the sstablemover of the preceding patch (with an added verbose option). The patch also includes fairly extensive unit tests for the new class.
Now there is one thing that really bother me with this patch (the preceding patches have the same problem). It's the fact that the Descriptor.fromFilename method needs to query the global useSeparateCFDirectories option to work correctly. This has a number of practical consequences:
it is the reason for a number of super annoying gymnastic in the unit tests, like the need to duplicate the sstables used by LegacySSTableTest for instance. I've put all of this gymnastic in a separate patch (0002-fix-unit-tests.patch).
this means we cannot stream between two nodes, one using separate cf directory, one that don't (that limitation could be solve independently for streaming only, but it's worth considering a more general fix imo, in particular because a specific fix would be a tad ugly).
tools like the sstableloader will require that you put the sstable to load into a ksname/cfname/ directory when you use the separate directory option. It's not a big deal, but it feels arbitrary and annoying.
At first, I had hoped that we could make Descriptor.fromFilename 'detect' whether or not we were using the separate directory layout by using the fact that the file name contains the column family name already, but that doesn't work when the ksname == cfname (which is allowed afaik).
One option would be to stop relying on the directory structure the sstable is in to detect the keyspace name. We could do that if we included the keyspace name in the filename for instance. The pros would be to fix all the problem listed above, and in particular we would be able to allow stuffs like sstableloader <filename> , instead of requiring that the files are put inside a specifically name directory, which would be nice. I don't see much cons except making the file names longer. Any opinion on that idea?

agree that the file structure is all over the place, that's why this patch was so incredibly painful to do

seems i only did the disk space checking in sub directories in the backwards compatible patch

i had on algorithm for detecting what kind of file it is in the old backwards compatible patch, it iterates over all data directories and figures out which data directory the file is in, then it knows the keyspace is the next part of the filename, and can check if there are files or directories in that directory.

but i agree, it is incredibly ugly, your modifications make it alot better, and yes, this would be a good time to do this properly, not like me just trying to do a minimal patch that "works".

regarding keyspaces in file names, sure, why not, guess having a header with this info in the file is out of the question, then the only meta data we have is the file name, right? A problem could be if we want to do CASSANDRA-1983 later, that would increase the file name length even more.

Marcus Eriksson
added a comment - 06/Dec/11 12:56 agree that the file structure is all over the place, that's why this patch was so incredibly painful to do
seems i only did the disk space checking in sub directories in the backwards compatible patch
i had on algorithm for detecting what kind of file it is in the old backwards compatible patch, it iterates over all data directories and figures out which data directory the file is in, then it knows the keyspace is the next part of the filename, and can check if there are files or directories in that directory.
but i agree, it is incredibly ugly, your modifications make it alot better, and yes, this would be a good time to do this properly, not like me just trying to do a minimal patch that "works".
regarding keyspaces in file names, sure, why not, guess having a header with this info in the file is out of the question, then the only meta data we have is the file name, right? A problem could be if we want to do CASSANDRA-1983 later, that would increase the file name length even more.

i had on algorithm for detecting what kind of file it is in the old backwards compatible patch, it iterates over all data directories and figures out which data directory the file is in, then it knows the keyspace is the next part of the filename, and can check if there are files or directories in that directory.

I tried that too, but this doesn't work when say you're opening a sstable by the sstableloader, because the files are not in a data directories then.

Sylvain Lebresne
added a comment - 06/Dec/11 13:04 i had on algorithm for detecting what kind of file it is in the old backwards compatible patch, it iterates over all data directories and figures out which data directory the file is in, then it knows the keyspace is the next part of the filename, and can check if there are files or directories in that directory.
I tried that too, but this doesn't work when say you're opening a sstable by the sstableloader, because the files are not in a data directories then.

I don't see any reason to continue to support the old-style directory layout. That adds complexity (operationally as well as in the code) for no benefit that I can think of. I think we should migrate from old layout to new on the first startup under 1.1.

regarding keyspaces in file names, sure, why not, guess having a header with this info in the file is out of the question, then the only meta data we have is the file name, right? A problem could be if we want to do CASSANDRA-1983 later, that would increase the file name length even more

I'm on the fence here – on the one hand having ks + cf in the filename simplifies some things. On the other hand, we allow arbitrary-length KS + CF names (up to 64K iirc) so UUID aside we're already in trouble on ext3/ext4, xfs, and ntfs, which all support max filename length of ~256. I'm starting to think we should move these into the metadata component instead of the filename.

Jonathan Ellis
added a comment - 15/Dec/11 17:45 we cannot stream between two nodes, one using separate cf directory
I don't see any reason to continue to support the old-style directory layout. That adds complexity (operationally as well as in the code) for no benefit that I can think of. I think we should migrate from old layout to new on the first startup under 1.1.
regarding keyspaces in file names, sure, why not, guess having a header with this info in the file is out of the question, then the only meta data we have is the file name, right? A problem could be if we want to do CASSANDRA-1983 later, that would increase the file name length even more
I'm on the fence here – on the one hand having ks + cf in the filename simplifies some things. On the other hand, we allow arbitrary-length KS + CF names (up to 64K iirc) so UUID aside we're already in trouble on ext3/ext4, xfs, and ntfs, which all support max filename length of ~256. I'm starting to think we should move these into the metadata component instead of the filename.

On the other hand, we allow arbitrary-length KS + CF names (up to 64K iirc) so UUID aside we're already in trouble on ext3/ext4, xfs, and ntfs, which all support max filename length of ~256. I'm starting to think we should move these into the metadata component instead of the filename.

The thing with the metadata component is that from a code perspective, there is lots of places where we want to create a Descriptor, which involves extracting the keyspace/cf names only based on the filename. Adding the necessity to locate and read the metadata in those places will likely don't be very fun.

So I'd be in favor of just limiting the keyspace and column family names. It's one for which there is no real point to have very long names. Limiting each one to 32 characters shouldn't be a strong limitation.

Sylvain Lebresne
added a comment - 15/Dec/11 18:23
On the other hand, we allow arbitrary-length KS + CF names (up to 64K iirc) so UUID aside we're already in trouble on ext3/ext4, xfs, and ntfs, which all support max filename length of ~256. I'm starting to think we should move these into the metadata component instead of the filename.
The thing with the metadata component is that from a code perspective, there is lots of places where we want to create a Descriptor, which involves extracting the keyspace/cf names only based on the filename. Adding the necessity to locate and read the metadata in those places will likely don't be very fun.
So I'd be in favor of just limiting the keyspace and column family names. It's one for which there is no real point to have very long names. Limiting each one to 32 characters shouldn't be a strong limitation.

Marcus Eriksson
added a comment - 15/Dec/11 20:20 sounds great (both just supporting new-style-layout and limiting names to 32chars)
guess we need to supply a tool to rename sstables files if anyone is on longer names? and rolling upgrades are out of the question then right? (maybe the already are?)

guess we need to supply a tool to rename sstables files if anyone is on longer names?

We probably don't need to do anything. I don't think anyone is really using names long enough for them to it the file system limit, the goal of limiting the names is just so to prevent this from happening but there will be no other assumption that the names are short from the code.

I also don't think anything will prevent rolling upgrades, do you had something in mind?

Note: I have a long flight ahead of me so I plan to update my last patch with both those changes, as I still like the moving of all the directories handling in a dedicated class, even if we don't support both layout.

Sylvain Lebresne
added a comment - 15/Dec/11 20:40 guess we need to supply a tool to rename sstables files if anyone is on longer names?
We probably don't need to do anything. I don't think anyone is really using names long enough for them to it the file system limit, the goal of limiting the names is just so to prevent this from happening but there will be no other assumption that the names are short from the code.
I also don't think anything will prevent rolling upgrades, do you had something in mind?
Note: I have a long flight ahead of me so I plan to update my last patch with both those changes, as I still like the moving of all the directories handling in a dedicated class, even if we don't support both layout.

It might be worth adding a "are my filenames going to be too large" check against all KS + CF combinations before starting to migrate data files around, though. It would suck to end up with a partially converted database if some short CF names complete early on, before erroring out on a long one.

Jonathan Ellis
added a comment - 15/Dec/11 20:58 It might be worth adding a "are my filenames going to be too large" check against all KS + CF combinations before starting to migrate data files around, though. It would suck to end up with a partially converted database if some short CF names complete early on, before erroring out on a long one.

adds the keyspace into the file name (to not rely on the directory sstables are in)

limit keyspace and cf names to 32 characters

During the initial upgrade, we also check if the user would be fine with the current keyspace and cf name. If not, we just refuse to start. I hope this won't happen to anyone because renaming a CF/Keyspace in a rolling fashion is not something fun (or even possible for that matter). Note that this check don't fully enforce the 32 chars limitation however but rather tries to be as permissive as possible checking that any file path resulting for the upgrade should be less than 256 chars (the windows limit).

PS: the first patch is the bulk of the change, the second one is the unit tests. The later is huge because it renames the sstables in test/data to include the keyspace name in them.

Sylvain Lebresne
added a comment - 19/Dec/11 17:11
Patch attached (0001-2749-v2.patch and 0002-fix-unit-tests-v2.patch) that:
only support the new layout with sstables in cf directories
migrate automagically files the first time
adds the keyspace into the file name (to not rely on the directory sstables are in)
limit keyspace and cf names to 32 characters
During the initial upgrade, we also check if the user would be fine with the current keyspace and cf name. If not, we just refuse to start. I hope this won't happen to anyone because renaming a CF/Keyspace in a rolling fashion is not something fun (or even possible for that matter). Note that this check don't fully enforce the 32 chars limitation however but rather tries to be as permissive as possible checking that any file path resulting for the upgrade should be less than 256 chars (the windows limit).
PS: the first patch is the bulk of the change, the second one is the unit tests. The later is huge because it renames the sstables in test/data to include the keyspace name in them.

o.a.c.db.Directories comment should be updated because it still uses SSTable file name without keyspace.

o.a.c.io.sstable.SSTableReaderTest won't compile

if you start with empty data directory you get following exception and process exits

INFO 23:51:11,987 Upgrade from pre-1.1 version detected: migrating sstables to new directory layout
ERROR 23:51:11,988 Exception encountered during startup
java.lang.NullPointerException
at org.apache.cassandra.db.Directories.migrateSSTables(Directories.java:416)
at org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:164)
at org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:360)
at org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:107)
java.lang.NullPointerException
at org.apache.cassandra.db.Directories.migrateSSTables(Directories.java:416)
at org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:164)
at org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:360)
at org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:107)
Exception encountered during startup: null

Pavel Yaskevich
added a comment - 22/Dec/11 21:23 Thanks, Sylvain! We are getting really close
Here are problems I found:
o.a.c.db.Directories comment should be updated because it still uses SSTable file name without keyspace.
o.a.c.io.sstable.SSTableReaderTest won't compile
if you start with empty data directory you get following exception and process exits
INFO 23:51:11,987 Upgrade from pre-1.1 version detected: migrating sstables to new directory layout
ERROR 23:51:11,988 Exception encountered during startup
java.lang.NullPointerException
at org.apache.cassandra.db.Directories.migrateSSTables(Directories.java:416)
at org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:164)
at org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:360)
at org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:107)
java.lang.NullPointerException
at org.apache.cassandra.db.Directories.migrateSSTables(Directories.java:416)
at org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:164)
at org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:360)
at org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:107)
Exception encountered during startup: null
on snapshot doesn't create or move (from older schema) index SSTables related to CF
shouldn't old "snapshots" directory be removed after move?

o.a.c.db.Directories comment should be updated because it still uses SSTable file name without keyspace.

Fixed, thanks

o.a.c.io.sstable.SSTableReaderTest won't compile

Sorry, I forgot to check the test after a last rebase, fixed too (this involved renaming a number of sstables from test/data/legacy-sstables/hb to include the keyspace name, so that specific change is in the 2nd 'fix unit tests' patch to avoid polluting the 3rd one).

if you start with empty data directory you get following exception and process exits

Fixed. I've actually made two modifications: the migration checks the existence of the directory to avoid the NPE during listFiles(), but I've also modified the 'should we migrate' check to detect new nodes (checking if the system keyspace directory exists) and thus not print the migration message at all.

I'm not sure I see what this one is. Are we talking of the migration process?

In any case, you made me think about secondary indexes. Maybe it is more "natural" to have secondary indexes sstables be in the same directory than the base cfs? Since the indexes name is not really something exposed (granted you don't have to be a genius to figure it out), it feels like it would slightly simplify administration to not put them in a separate directory.

I've updated the patch to implement this last idea (so indexes are in the same directory than their base cf), but it would be nice to have multiple opinions on that move since we don't want to have to do a new migration in 6 month because "we've changed our mind".

Sylvain Lebresne
added a comment - 23/Dec/11 11:27 Attaching rebased patches, with a 3rd patches (0003-Fixes.patch) addressing the Pavel's remarks. More specifically:
o.a.c.db.Directories comment should be updated because it still uses SSTable file name without keyspace.
Fixed, thanks
o.a.c.io.sstable.SSTableReaderTest won't compile
Sorry, I forgot to check the test after a last rebase, fixed too (this involved renaming a number of sstables from test/data/legacy-sstables/hb to include the keyspace name, so that specific change is in the 2nd 'fix unit tests' patch to avoid polluting the 3rd one).
if you start with empty data directory you get following exception and process exits
Fixed. I've actually made two modifications: the migration checks the existence of the directory to avoid the NPE during listFiles(), but I've also modified the 'should we migrate' check to detect new nodes (checking if the system keyspace directory exists) and thus not print the migration message at all.
on snapshot doesn't create or move (from older schema) index SSTables related to CF
I'm not sure I see what this one is. Are we talking of the migration process?
In any case, you made me think about secondary indexes. Maybe it is more "natural" to have secondary indexes sstables be in the same directory than the base cfs? Since the indexes name is not really something exposed (granted you don't have to be a genius to figure it out), it feels like it would slightly simplify administration to not put them in a separate directory.
I've updated the patch to implement this last idea (so indexes are in the same directory than their base cf), but it would be nice to have multiple opinions on that move since we don't want to have to do a new migration in 6 month because "we've changed our mind".
shouldn't old "snapshots" directory be removed after move?
Your right, fixed (for backups too).

Pavel Yaskevich
added a comment - 23/Dec/11 11:48 I'm not sure I see what this one is. Are we talking of the migration process?
I was testing it like this :
run 1.1 without modifications
./tools/stress/bin/stress -n 50000 -S 512 -x KEYS
./bin/nodetool -h localhost flush Keyspace1 Standard1
./bin/nodetool -h localhost snapshot Keyspace1
made sure that Standard1.Idx-* SSTables are in the snapshots/<timestamp> directory
run 1.1 with you patch applied
checked if snapshots directory was moved and what files did it include - it was lucking Standard1.Idx-* files
cleaned data directory
repeated steps 1 - 5 but this time with your patch applied and it didn't include Standard1.Idx-* into snapshot
Maybe it is more "natural" to have secondary indexes sstables be in the same directory than the base cfs?
+1

Weird. I just tried the same scenario and everything worked correctly. I should mention that when moving the snapshots/backups, the migration process rename them to the new filename convention, so they will be called Keyspace1-Standard1.Idx-*. Or maybe I fixed it with the last version of the patch without realizing it.

Sylvain Lebresne
added a comment - 23/Dec/11 14:41 Weird. I just tried the same scenario and everything worked correctly. I should mention that when moving the snapshots/backups, the migration process rename them to the new filename convention, so they will be called Keyspace1-Standard1.Idx-*. Or maybe I fixed it with the last version of the patch without realizing it.

NTFS is technically okay with paths up to 32K long[1], but the windows api is limited to 256[2]. Common Linux filesystems have a limit of 255 bytes per path component (i.e. directory or filename) but no total path limit. However, Linux defines PATH_MAX and FILENAME_MAX, both 4096. [3]

In short: restricting KS and CF names to 32 characters is a good idea for the benefit of Windows portability. However, we may want to exempt Linux systems from the startup length check to allow easier upgrades.

Jonathan Ellis
added a comment - 02/Apr/12 22:22 Did some more research on path limitations:
NTFS is technically okay with paths up to 32K long [1] , but the windows api is limited to 256 [2] . Common Linux filesystems have a limit of 255 bytes per path component (i.e. directory or filename) but no total path limit. However, Linux defines PATH_MAX and FILENAME_MAX, both 4096. [3]
[1] http://en.wikipedia.org/wiki/Comparison_of_file_systems
[2] http://msdn.microsoft.com/en-us/library/aa365247.aspx
[3] http://serverfault.com/questions/9546/filename-length-limits-on-linux
In short: restricting KS and CF names to 32 characters is a good idea for the benefit of Windows portability. However, we may want to exempt Linux systems from the startup length check to allow easier upgrades.

Chris Chiappone
added a comment - 05/Oct/12 21:00 This actually causes a bunch of problems for customers that have used longer the 48 character keyspace and cf names. How will this affect the upgrade path for those customers?