We all know that operating on strings can be prohibitively expensive. In my free time, I am working on my own script language, and there are several places where strings are used: during the compilation of scripts, building the AST, code generation, and assembly generation, to name a few. To help speed up compilation, I knocked together this small class, which basically replaces string comparisons with integer comparisons. The class itself is nothing more than a glorified hash-table, that allows you to insert, find, and retrieve strings from this table.

Its nice and it's especially nice to see people testing their code .But how would st.Find(sz1_dup) == index1 speed up the comparison when st.Find uses HashFunction(string) that calculates the hash over all string? Unless you always calculate indices for all strings and then use these instead of strings themselves. I am just asking the question based on the example code.

Also wouldn't it be easier to use stl::set, since you are using stl anyway already? It might be slower of course. What I mean is something like that:

Its nice and it's especially nice to see people testing their code .But how would st.Find(sz1_dup) == index1 speed up the comparison when st.Find uses HashFunction(string) that calculates the hash over all string? Unless you always calculate indices for all strings and then use these instead of strings themselves. I am just asking the question based on the example code.

I suppose the sample code could have been a bit clearer The point was indeed to use the Find/Insert functions only once, when a string is encountered, and then use the indices in the strings place. The way I am using it, is during the parsing of the script files all named primitives are inserted into the table, and then throughout the rest of the AST validation and codegeneration, I only use the indices to compare string. I don't touch the actual strings, unless there is a compilation error, and I need to dump some human-readable info on the error.@mrjones

Also wouldn't it be easier to use stl::set, since you are using stl anyway already? It might be slower of course.

I actually hadn't thought about using std::set for this. The only issue I can see with using it, is that it sorts all its entries, which is not a requirement for my specific case. This would incur some unneeded overhead, but then again, since it's only to be called once, it might be negligible. Whatever floats your boat

This is a great optimization for compilers and similar projects, but it can be a nightmare for debugging. Not long ago I had to track down an obscure bug in a C preprocessor that had a string (token) table like this, and working with numbers instead of readable strings made the job a lot harder.

So I'd recommend to either use char* as JarkkoL already suggested, with additional methods to get the index of a string when you need a more compact representation (like when writing out to a file), or to just make IndexedString a struct with an additional string pointer in debug mode.

I actually wanted to use wchar_t strings, but I'm currently using flex/bison to generate the lexer and scanner, and both of them choke on non-ansi symbols, so I sticked with char* for the time being. If I were able to use wchar_t though, modifying the class to support it would be trivial, I expect. @.oisyn

.edit: And why does Get() return a std::string, rather than a const char *?

A lazyness/convenience combo.I mentioned ealier that I only touch the actual strings when an error occured, and I needed to dump some info. I return a std::string, so I can start concatenating the compiler error immediately to the string, and that's the only reason

Hey, those are good suggestions - I think I'll incorporate them in my code :yes: I can't really do the memcmp though, as I need case-insensitive comparison for my use case.:sad:

Do you need to store the string as case-sensitive?

Because, if all strings are hashed as upper case, I suppose you could just convert the search string to upper case before searching for it.

I guess that would be another improvement tip if you can use it. Hash all strings as upper (or lower) case. Some people prefer upper case over lower, because they think the upper case conversion functions are slightly faster. I haven't really search for proof of it though.

You can use this tip when all of your strings are filenames or filepaths, since you don't have to care about case sensitivity. I've seen MMO games like World of Warcraft and Warhammer store all their game asset filepaths as upper case, to improve hash lookups.

Well, I need to save out the strings in some cases, and I need them to preserve case (well, at least preserve the case of the first string passed to it). For some cases, doing it as all upper or all lower would be fine though... I guess I could store a mixed-case AND an all upper version of each string... more memory, but would allow the memcmp when doing lookups...

I wouldn't store two strings just for that. I don't think it's worth it.

The memcmp() check is only done after the hash index is checked, so it's only useful when you have collisions. Collisions should be avoided as much as possible, so I don't think it's worth the effort for something that shouldn't happen most of time.

If it was possible to have a collision free hash routine, then you could get rid of all the code after the hash index lookup, and just return the string. All the code here is just meant to speed up collision checking.

Mostly no, they are mostly arbitrary names used as IDs for all sort of things, and a lot of them are read from (and written to) xml files. Using strings for IDs makes it very easy to work with when doing rapid prototyping (like when I tried making games in 2-3 hours for TigJamUK).