I have no way to link to a single news update I have made. I've considered making target="" insertions, but I find those annoying. Also, they don't work on all platforms (read Hiptop). Fortunately, my corpus is regular so breaking it into different files is no problem, but what should I call them?

I was thinking "title_with_underscores" was good, unless I have conflicting titles. Which I might. Or no titles, which I do. I don't really want to use the date, because it's not unique. I could use some of the first words in the message body, but they're not unique either. This gradually leads me to some kind of form string, check for collisions, and loop plan, I think.

Longest word sounds contrived. Amazon uses something called statistically improbable phrases. I don't know what their algorithm is (it's apparently proprietary), but language log has analyzed it and doesn't like it much.

Do you not have any sort of timestamp on these entries? Maybe you could go back and give each of them a rough month according to your memory. I think a consistent "datatype" (a [date, string] pair) for everything is a much better idea than something less consistent. The string's generation will differ depending on whether you gave something a title. I would slug (entitle) everything that doesn't have a title with the first several words, which is what Livejournal does when you comment on something without a title, if I recall correctly. Then your datatype will also have consistent meaning: it will be [creationDate:date, entryTitle:string]. As for how to slug the documents, I'll think of something if you want. I don't think it's necessary for each entryTitle to be unique but it is somewhat more interesting to think of this case. In the case of non unique, I would take the first sentence that is four or more words (to rule out starting off with a greeting or short exclamation).

Why is it contrived? Does everything have to be complex for you? It's effectively a hashcode and the longer it is the less likely there will be a collision (although since the domain is words that isn't necessarily the full case), but still. It's a quick hash for a millisecond of time.

True, if the first sentence of the post covers the full context of the post then your idea is better.

Like I said, I got a year on all of them. If I assigned a month I'd be making it up. Manually. I don't want to. When I record a date, I record the day and month. I do not record the time of day.

I think titling untitled things with the first several words is a good plan. I'm actually leaning away from including small words now. Instead, I can add things like "well" and "hmph" to my list of words to remove and hopefully pick out some good ones. I might have pithy four word sentences. You don't know.

I just want the date+title string to be unique. I think a good format is