I have a directory that contains hundred thousands of files.
I need to list sample of these files (example 10 files), without processing all the files found in the directory which will take too much processing time.

@invert actually, sorting a list of hundred thousands of files is probably not such a good idea. Maybe use ls -f| tail -n 1000 instead of just ls for the answer in the link
–
daniel kullmannJul 13 '12 at 8:09

tail is aweful in this case, since it will wait for ls to finish its work! head would terminate it after reading the 1000 lines it needs.
–
lynxlynxlynxJul 13 '12 at 9:33

1

However you throw it, you're going to have to interrogate the inode structure. You could use a file indexer to query instead, note that will require an initial directory scan too. see man locate
–
invertJul 13 '12 at 9:43

1 Answer
1

I don't think you can sample from the whole file list without reading them all in some way or the other, even on the filesystem level.

Unless their names follow a pattern that is (eg. fileXXXXXXX), in which case you could pregenerate a random list of names before accessing the files. For such a large amount of files it would be odd if their names were random.

But let's assume you're not that lucky. Using find is prefered over ls, since it can escape output with null, making it immune to nonstandard characters in filenames. If we don't want to read all the files, it is the fastest to use the ones at the start of the listing. To get a better sample, I would use a bigger sample first ($oversamplesize below) and then do a random subselection of size $samplesize from there. I didn't manage to make sort -R or shuf work well with null separators, so the shuffling and final selection is done by awk:

Two notes here. For some reason, it often also prints an empty filename, so I increased the sample size just in case. The trivial note is not to forget to change the search path (~ here) and the final command.