Can one use a sorted file as an index?

Hopefully this won't be posted twice. I wrote a response, but it seems to have disappeared.

Tagsort appears to be the best/easiest solution. I tried sorting a partial file, merging it with the original file, and then indexing the file. The first steps ran significantly faster, but indexing the file turned out to take as much time as simply using the tagsort option.

For each index, it seems to do a tagsort then write the tags and radix ( I think that's what they call the address of each row) to the index.

When we sort and build index in the same step, it might be nice to suggest that when the compiler recognises that the index requested is supported by the sort keys that these two tasks are merged and it uses the sort also for index building. It might also be within the wit of a good compiler to create tags for all indexes requested instead of re-reading the entire data for every index requested :-)

Tagsort option is a single-threaded sort instead of standard proc sort operation which is multi-threaded. The usage of tagsort is helpful when you need to reduce temporary disk usage, because you either do not have the space or are working with utility space on very slow drives. As noted in the documentation for the tagsort option, it will sometimes have much higher processing time. I have tested this myself before on some very wide files, lets say 10,000 columns and 100,000,000 rows with a sort on two keys. Usage of tagsort significantly increased processing time. The best way to increase the performance of sorts on wide data is to increase sortsize memory allocation, number and size of data buffers, and making sure the disk you are writing to for utility files is fast.

If you have a specific case in which the usage of tagsort will be of significant benefit for a sort on big data I would love to see examples as it is a real problem for me at times to deal with.

Can one use a sorted file as an index?

Actually, for my sample dataset, TAGSORT worked much better than using a multi-threaded sort. I'm still investigating alternatives, and was going to post the TAGSORT results, but Peter's suggestion was posted before I had an opportunity to post them.

Can one use a sorted file as an index?

Yes, testing it ourselves appears to be the test option. Paul Dorfman, over on SAS-L, suggested some reading material:

Art,

A sorted SAS file IS a self-index. Does binary/interpolation search ring thebell? You can find plethora of examples on sas-l, just search for thegermane keywords. Or check the section "8. Best Index - NO Index?" in:

Can one use a sorted file as an index?

I use a server with very fast disk, '/local' and create a expanded version of your data, I initialize the session with work space on a slow, remote nas, so I have poor util space for sort. Tagsort in this case is 2x performance of standard sort. This is, I guess, as expected, not sure if there is a breakpoint. I assume if the situation were reversed and I have my dataset stored in the nas and my util space stored on my fast local disk I see the opposite, since tagsort will read the data file twice. And then I will need to test using data and work stored in fast space. I will need to continue testing another time, but this is a topic that certainly interests me.

That is a very good paper, I am familiar with it already. It does not really discuss tagsort though.