Should be migrated to meta; as the question and answer both specifically deal with SO implementation, and the accepted answer is from @JeffAtwood.
–
casperOne♦Nov 18 '11 at 20:21

15

@casperOne Do you think Jeff is not allowed some non-meta reputation? The question is about "how can one do something like this", not specifically "how is this done here".
–
Paŭlo EbermannNov 19 '11 at 13:05

@PaŭloEbermann: It's not about Jeff getting some non-meta reputation (how much reputation he has is really not my concern); the question body specifically referenced StackOverflow's implementation hence the rationale for it being on meta.
–
casperOne♦Nov 22 '11 at 14:04

This is great. The only change I have made so far is to change "if (i == maxlen) break;" to become "if (sb.Length == maxlen) break;" just in case there are lots of invalid characters in the string I am passing in.
–
Tom ChantlerMay 30 '11 at 18:21

2

A minor optimisation: if (prevdash) sb.Length -= 1; return sb.ToString(); instead of the last if statement.
–
Mark HurdFeb 21 '12 at 10:34

8

@Dommer sb.Length == maxlen break; is buggy if the sign on maxLenght-1 is "ß" it gets converted to "ss" sb.Length == maxlene will never be true, it is better instead to test for (sb.Length > = maxlen).
–
Henrik StenbækMar 29 '12 at 13:14

The hyphens were appended in such a way that one could be added, and then need removing as it was the last character in the string. That is, we never want “my-slug-”. This means an extra string allocation to remove it on this edge case. I’ve worked around this by delay-hyphening. If you compare my code to Jeff’s the logic for this is easy to follow.

His approach is purely lookup based and missed a lot of characters I found in examples while researching on Stack Overflow. To counter this, I first peform a normalisation pass (AKA collation mentioned in Meta Stack Overflow question Non US-ASCII characters dropped from full (profile) URL), and then ignore any characters outside the acceptable ranges. This works most of the time...

... For when it doesn’t I’ve also had to add a lookup table. As mentioned above, some characters don’t map to a low ASCII value when normalised. Rather than drop these I’ve got a manual list of exceptions that is doubtless full of holes, but it is better than nothing. The normalisation code was inspired by Jon Hanna’s great post in Stack Overflow question How can I remove accents on a string?.

+1 This is great Dan. I also added a comment on your blog about possibly changing if (i == maxlen) break; to be if (sb.Length == maxlen) break; instead so that if you pass in a string with a lot of whitespace/invalid characters you can still get a slug of the desired length, whereas the code as it stands might end up massively truncating it (e.g. consider the case where you start with 80 spaces...). And a rough benchmark of 10,000,000 iterations against Jeff's code showed it to be roughly the same speed.
–
Tom ChantlerNov 17 '11 at 23:34

Thanks, responded on my blog and fixed he code there and above. Also thanks for benchmarking the code. For those interested it was on a par with Jeff's.
–
DanHFeb 12 '12 at 0:29

It seems like there are some problems with Slug.Create(): Uppercase versions of ÆØÅ are not properly converted ÆØ gets ignored while Å is translated to a. Normally you will convert “å” to “aa”, “ø” to “oe” and “æ” to “ae”. Second (sb.Length == maxlen) break; is buggy if the sign on maxLenght-1 is "ß" (sb.Length == maxlen) will never be true it is better instead to test for (sb.Length > = maxlen). I’m suppressed that you cut on any random position and not cut on last “-“, this will save you from ending with an not wanted word in the end: as if you had to cut “to assert” after the last "s"
–
Henrik StenbækMar 29 '12 at 13:09

What about funny characters? What are you going to do about those? Umlauts? Punctuation? These need to be considered. Basically, I would use a white-list approach, as opposed to the black-list approaches above: Describe which characters you will allow, which characters you will convert (to what?) and then change the rest to something meaningfull (""). I doubt you can do this in one regex... Why not just loop through the characters?

downcase turns the string to lowercase, strip removes leading and trailing whitespace, the first gsub call globally substitutes spaces with dashes, and the second removes everything that isn't a letter or a dash.

There is a small Ruby on Rails plugin called PermalinkFu, that does this. The escape method does the transformation into a string that is suitable for a URL. Have a look at the code; that method is quite simple.

To remove non-ASCII characters it uses the iconv lib to translate to 'ascii//ignore//translit' from 'utf-8'. Spaces are then turned into dashes, everything is downcased, etc.