Post navigation

Web 2.0, blogging, and tags all go together, hand-in-hand. However, while RPC standards exist for blogs and the pinheads boggle over the true definition of a “blog,” no one has a cast-in-iron standard for tags. Depending on where you go and who you ask, tags are implemented differently, and even defined in their own unique way. Even more importantly, tags were meant to be universal and compatible: a medium of sharing and conveying info across the internet — the very embodiment of a semantic web. Unfortunately, they’re not. Far from it, tags create more discord and confusion than they do minimize it.

To Space or Not to Space, that is the Question

This one is probably the most obvious obstacle and the most destructive when it comes to tallying tag popularity or making those pretty tag clouds: Can tags have spaces in them or not?! If tags don’t/shouldn’t have spaces, then what do you do with multi-word tags that you just can’t shorten? Do you replace the spaces with underscores, dashes, or just take ’em out? Does it matter?

Yesterday we were discussing how best to implement the tagging feature in the upcoming blogging engine, Habari, and this topic caused quite a lot of confusion. It’s an important question. What it means is you can have a single tag split across 4 or 5 different tags – for no good reason. If you thought having www. and no www. in a domain name made things confusing, you should probably sit down now. Take, for instance, the tags “Software, Windows Vista, Microsoft” Depending on which site you’d like to enter this tag, it’ll take quite a few different forms – each with a different meaning to another website!

That’s assuming you already know what form the site accepts and what it filters out. Suppose you were used to Del.icio.us and just found Technorati’s tagging feature – do you put “Windows Vista” in quotes or do you type it as WindowsVista? Or do you use underscores instead? Talk about semantic web!

Obviously the need for spaces in tags is an important one. Whether it’s “Semantic Web” or “Ford Interceptor” that you need to tag, it’s rather different from “Windows AND Vista” and “Ford AND Interceptor” – and it gets worse if you have a search engine that places OR in there instead of AND. Much worse. The big question is, why doesn’t such a standard already exist? It’s obvious that Web 2.0 is all about connecting ideas and bringing articles, content, and readers together. But anyone looking at the tagging process would immediately assume it’s about the exact opposite: splitting up content, making things difficult to find, and purposely making bloggers’ lives miserable.

With Habari, so far we’ve gone through all the forms, and at the moment we’re at number 3 for compatability and familiarity’s sake. But that may change – hence the need for a visible, tangible tagging standard. The only problem is, tagging isn’t some new concept. A tagging standard isn’t something that we can just whip up and serve on a platter.

What about the noun/verb argument? Look at the tags for this post: “Blogs, Blogging … Tags, Tagging” We just don’t know what people will search for – and we try to cover all the bases. But then you have so many possibilities! Code, Coding; Design, Designing; Research, Researching. For every pair there is one word more likely than the other. But people like to have all the bases covered, hence all the clutter. Tagging is fun, but only if done the right way.

Something this prevalent and widespread needs years of discussion, negotiation, and failure between the big companies before they can come to a conclusion. It’s going to be something that del.icio.us and Technorati and all the other major players agree on – which is practically impossible.

Del.icio.us is arguably the “tagging leader” in Web 2.0, but their budget is far smaller than that of the commercial competitors like Technorati, and their ideas are also much older and even out-dated given their being the original players in the game. Spaces are important, maybe they can agree on that. But what about delimiters? UTW uses commas as delimiters, Technorati & Del.icio.us use spaces. But if spaces are a part of a tag, then you have to enclose them in quotes – but what if your tags require quotes?1

Basically, it’s too late for a tagging standard that will be used unanimously throughout the web. A truly semantic web most certainly won’t ever exist because of the reluctance to change and the unwillingness to compromise and accept defeat. A semantic web requires objective analysis of methods and data, culminating in honestly evaluated options, and immediate acceptance of the outcome. But that’s never going to happen.

We can’t conceivable think of a tag that would actually require quotes, but you never know what might happen. What if C# is replaced with C”? No one considered the octothorpe a viable tag element – then again, it’s not a real octothorpe but a sharp↩

45 thoughts on “The Need for Creating Tag Standards”

I’ve personally given up on the concept of a semantic web. While highly appealing, it’s also very, very highly unlikely & border-line impossible.

No matter what you do, you’re not going to get Company A to openly admit to having an “inferior” opinion than Company B – and then requiring them to follow Company B’s model. It’s just asking for too much.

Compromise is nice, but it’s only theoretical. Some people are just entirely wrong/out-dated, just like del.icio.us in the example you gave above. They had a great idea, they started the tagging madness, kudos to them. But sometimes, you just have to go with the flow. Change isn’t always bad, but no one will ever accept that – big companies especially.

The answer to how to delineate tags has been solved. On the web, its called XML. These questions about commas, quoting rules, all that shit has been hashed out for years, and the way you handle it in 2007 is with XML.

This says nothing for the linguistic problem of noun/verb, but the technology has been solved in concept at least.

I just want to start by saying I think it’s pretty cute that there aren’t any tags on this article.

Tags: No Tags.

Heh.

Anyway, this problem can be solved so easily. First off, tags are simply words. There isn’t much to screw up there, verbs, nouns, it doesn’t matter, people use the same basic words to describe most things. To address your small concerns with “complex” tags:

You might want to look up fulltext search implementations. I wrote one of these in a day and it handled most of what you’re talking about. it was for a help system, granted, but the basic concepts are easily adapted to tagging and semantic comparison.

To anyone else saying “XML!” That’s worthless. XML is a meta format. In this case, the data that we’re trying to handle is no longer meta – it is the point of this discussion. Shouting, “XML will save us!” belies two things: a lack of knowledge of the problem space, and a lack of experience implementing even basically similar algorithms.

Seems to me the “original” tags are HTML META tags…. However they were abused and not as relevent now…. Whatever cohesive standard might be one day developed will have to somehow keep such abuse from happening again.

I think there’s a difference between semantic tagging standards (librarianship) and tag system standards (syntax and representation). It’s very important not to confuse the two (IMHO).

We can let statistics and search engines deal with “blog”, “blogs”, “blogging” etc. I don’t believe there is any need to enforce standards in this area. Any attempt to do so will fail. You can’t force everybody to behave in the same way, and presenting them with a drop-down list of tags to choose from isn’t tagging at all.

Where I do agree with you is the space/underscore issue. There really ought to be an XML standard for this stuff. What do you suggest?

The technical parts of tags is covered by the microformat rel-tag and thats about how far a standard for tags can go I think. It can extend to other formats, but you can’t standardize the content of a tag – only the technological format of that content.

To merge a “WindowsVista”-, a “Windows Vista”- and a “Vista”-tag you have to use some kind of artificial intelligence which analyzes the content of the tags and merges them. This or a similar approach as Wikipedia has when it comes to different names of the same thing. How such a merge can be shared semantically is however something thats still to be standardized.

@Issac: “tags are simply words” and “remove all non-alphanumeric characters from each tag before processing.” will not work when the words are not in a Western language/encoding such as English/latin-1.

Encodings such as UTF-8 alone won’t help you either, you need to know the language that the person doing the tagging was using, thinking of. You can’t just lowercase or ‘normalize’ them such as removing characters without knowing more about which are significant. XML won’t help either although it does have a place in the syntax to at least record the language in xml:lang.

“…tags were meant to be universal and compatible: a medium of sharing and conveying info across the blogosphere – the very embodiment of a semantic web.”

No offense, but … where the hell are you getting this? I seem to recall the early tag implementations at Flickr and Del.icio.us specifically described by their designers as a means of personal organization, i.e. NOT universal. The only thing that matters in a tagging implementation is that the *tagger* be able to find his information according to his own mnemonic. Everything else is secondary.

The premise of this article is completely flawed, mixing up SemWeb with unstructured labels, and assuming that the only thing that can help is vendor sports among moneyed the “big companies” and “major players”. Bah.

XML is not the solution, it’s a language and protocol of it’s own. It’s basically a human-readable way the primary of which is for two machines to communicate with one-another – humans don’t fit into the equation.

Verbs vs Nouns should be solved with lemmatizing or tag synonyms.
The rest is just a matter of better input elements. Tags should be tokens, and tokens should be allowed to contain most anything. If your parser is lazy and doesn’t grok that “Multiword Tag” is one token, fix your parser or your input method. The syntax doesn’t matter.
‘Normalizing’ “This is a Multi-Word Tag” to “this-is-a-multi-word-tag” is so 1990.

I’m a library student, and I’m very interested in the potential of tagging as an adjunct to traditional cataloging methods. But I’ve noticed the same problem. Every site and every user has their own scheme that works for them, but doesn’t necessarily work for anyone else.

Example: let’s say I’m on del.icio.us and I want to look for stuff on Monty Python. There’s plenty there. But one user might have it tagged as “humor”, another as “comedy”, a third as “MontyPython”, a fourth as “Pythons”, another as “funny”. And others probably use words I haven’t thought of.

The only way to really do this and have it be useful for everyone, methinks, would be with a controlled vocabulary. Make a list of a thousand or two carefully selected words, and don’t let anyone use anything not on the list. Any changes have to be approved by whoever’s in charge of the standard.

Of course, even with this, there’d still be a lot of noise. But restricting the vocabulary would at least cut it down considerably.

@heisencat: But if you use controlled vocabulary, how is tagging any different from Categories?

Already the line between them is so fine it’s almost invisible, the only real difference being that tags can be added on-the-fly and on whims, but this way, you’re just giving end-users the ability to categorize, not tag.

Surely the idea of tagging is to allow categories to evolve on their own without having a team-of-1000s to “fix the controlled vocabulary”?

Better clustering agents and more user input is what is required – the real question is: will the capacity of all the world’s computers be enough to correctly handle the multi-dimensional input space and then correctly reduce that multi-dimensional space in a coherent way that assists people (rather than hindering them) ..

As someone that writes tagging software for a living (even not for the usual case of web tagging, but for email – something is dire need of helping!), I’ve seen this go a few ways. Here’s a quick summary of the feedback I’ve got over the last 6 months or so, from people that have used the software:

– Tags have to be very user self-centered. A fixed hierarchy won’t work for all – and as soon as you lose some people then the usefulness quickly drops off.

– We tried a few formats, including XML. What won out was the simplest thing that could possibly work, i.e. a ‘comma’. Our tagline looks like this (and is picked up / indexed if this comment is read in my RSS reader):

Tags: tag1, tag2

– We’ve had some success with ‘Aliasing’, where users can match up their tags with others that come in. It is typical to have a few aliases for each tag, of course, depending on the type of tag. The aliases are often very personal and would have been hard to predict with AI.

Soundex and normalization don’t work. Windows 2.0 is different from Windows 20, and Rupert is different from Robert. The first, when normalized, becomes the same; and the second, when run through a soundex filter, become the same.

Computer Guru: The difference between category and tags is as I see it that something can be tagged with many different words while a category is one and only one. A category needs to describe the whole site, book etc and therefore nothing really fits the category. With tags a site, book etc isn’t put into one specific box but rather into many and it’s the combinations of the boxes which describes it and with enough tags to choose from the description the tags combines into exactly matches the site, book etc.

I think there is a fundamental point that is being overlooked in this discussion. The disconnect is that “tags” should fundamentally remain an unstructured construct, with as few limitations as possible — they are simply a container for identification data as asserted by humans. In this model, tags are a free-form extension of whatever humans choose to label something, apart from any structured categorization. This unstructured labeling is what makes it attractive, flexible, and powerful. As this topic suggests, there are issues in resolving various tags that whilst literally different they are contextually equivalent. I believe this to be the critical juncture. Perhaps the solution lies not in heaping upon more standards, but improving the manner in which tags are processed by consumers.

Computers haven’t any notion of language context, so we must begin to bridge the gap. One way is by implmenting ontologies and employing markup using OWL, and investigating the use of tools to parse tags represented as concepts in our ontologies, such as OpenCyc.

@Pelle: Exactly. Tags are categories, except that you can use more than one at a time.

There’s a long-standing debate within the field of cataloging over how you delineate categories. For example, let’s say you have a Western novel… written in German by a German. Do you put it with Westerns or with German literature? Or, more controversially, what do you do with a book on, say, intelligent design? Do you put it with books on evolution, on Christianity, or on pseudoscience?

Or: I presume there are a lot of science fiction fans here. How do you define what is science fiction? Is The X-Files science fiction? Is Firefly? What about Neal Stephenson’s Baroque Cycle?

If you can only use one category per item, it’s hard to figure out where to place things that cross boundaries. Tagging allows you to do that. But, if you allow users to create new categories whenever they want, you end up with chaos.

Oh, and drk: With a controlled vocabulary, you reduce the problem of searching among all these categories to a manageable level… without the effort of building the Semantic Web.

heisencat: I don’t see why can’t use more than one category at once. With library books, toads, butterfly collections, etc. maybe you can’t, but with cyberdata, we can and we do.

Most bloggers mark posts with multiple categories. For instance, this very article is in 3 categories, because it does indeed fit in all three (in our opinion, I guess). The difference is that a tag is a very specific category. Maybe a tag is like a glove and a category is like a mitten sort of thing.

David Ing, I’m actually very interested in what you were saying, and I especially liked the statistics – please share if you have any more! I just downloaded Taglocity for Outlook, nifty add-in! Though one could argue that the need for such a program has been greatly diminished with Office 2007 and it’s new, powerful instant search feature, it’s definitely a nice thing to have.

the input might be designed to be a controlled vocabulary to narrow the search space – but the search space should be as unconstrained as possible – “anything goes” might mean that “everything stays” but repeated tagging usage should prefer the most popular tags (and, by default, the most popular “tag equivalences”

one of the reasons why I prefer the tagging (informal) approach to a top-down (librarian approach) is that new categories can evolve in their own space without intervention – but right now clustering the search space to create novel categories is the problem – if we can get the algorithms right – then the internet can help us lots (I’d have given my right arm for a corpus as large as the internet when I was working in AI)

I call it “collective emergent categorisation” – and it seems that it might just work in the same way as human categorisation – as long as we can get the software to do the the high-level dimensional reduction that human brains do so well – and which computers are totally useless at unless given lots of memory and cpu cycles ..

hey – blue sky thinking right? right now this is all so new nobody knows where it might lead …

Heisencat: I don’t think we have to be afraid of getting to many categories – the more the better and if we don’t count possible spam the nature of the language sets the possible tags to a finite amount. If you just have enough people tagging you will have enough data to make something good out of it – with a threshold and a AI to filter the tags and adding a structure to it. But the tags in itself should be freely set and dynamic because only then it can adapt to the fast pace of the internet.

Analyzes can also go deeper than just having a normal threshold and a little AI, add the geographical data to it and you can get the exact tags that the people in your area likes. Lets not limit the possibilities – lets make the best use of it as we can!

Computer Guru: The line between tags and categories are not fixed as I see it – it is in a way the same. Tags are the modern incarnation of the categories. The categories of this blog could as well be tags – they would probably have other names but the meaning would be the same.

@Computer Guru > “Though one could argue that the need for such a program has been greatly diminished with Office 2007 and it?s new, powerful instant search feature, it?s definitely a nice thing to have.”

Well, it’s not for everyone, but the way I see it is that even though http://www.google.com exists, http://del.icio.us still has a purpose and a use – sometimes just indexing by the words alone won’t be enough for the vast amount of info that gets thrown at us each day; tags provide a context, especially when shared. That super fast search is only useful if you get good results out of the 100’s returned…

I’ll try to get some stats together and put on my personal blog here: http://www.from9till2.com although I can’t promise it won’t be dull :-)

To create a semantic web, you need semantics. There needs to be an element that understands that certain words have similar or related meanings. It’s too difficult to put every conceivable keyword on an article. At some level, the search algorithm needs to recognize that “humor”, “comedy”, and “funny” are related and give you a match if you search on any of these. Only then will this be a semantic web. What we have now is closer to a syntactic web, where the spelling of the word is more important than its meaning.

I have a hard time understanding the problem with spaces with respect to the Windows Vista example mentioned.

imho, windowsvista would be two separate tags attached to articles/blogs related to the software and I would take it one step further at least and add microsoft.

my reason for this is it affords me the opportunity to refine my search, I can look for articles about any windows os with the tags microsoft and windows, instead of having to type microsoft, windows vista, windows xp, windows 2003 (you get the idea). if I want articles about a specific windows os, all I have to do is add tag(s) that concentrate the search.

I can understand how this kind of flexibility can be abused by excessively adding tags to articles.

Great post and discussion. I’ve been exploring the need for tagging standards for a while now, and you can certainly see from the discussion above that there are many different perspectives on the issue. Controlled vocabularies, tagging standards, microformats, cataloging best practices, authoritative sources, user rating systems — all are interesting concepts that may or may not be applicable or appropriate for the masses.

As a fellow techno-geek with a background in Library Science, I’ve come to think that there’s no “right” way of tagging, especially in the loosely coupled web we live in today. Who are we to impose some Draconian system on our users? Some would argue that putting more structure around the tagging process is the way to go. I see some value in this, but if we make it too hard for someone to tag a URI, tagging will remain with us tech-savvy geeks and never reach the masses. At the bare minimum, I think the tagging systems need some consistency in the following areas:

Spell checking. FF2.0 brings this into the fold, but spelling errors in tags results in bloated and inaccurate tag clouds.

Better equivalence detection capabilities – singular vs. plural, synonyms, typos, etc.
Better tag management – for example, right now, on del.icio.us, I can only delete one tag at a time. Tag replacement helps me get rid of equivalents, but only one at a time.

First of all, tagging is to be embraced by everyone. As we’ve seen with HTML, few actually abide by standards. There’s just no way to make a completely robust standard that handles all the edge cases we throw at it and yet still remains usable for the average lazy/dumb user.

Back in 1996, people were saying we needed to standardize mp3 filenames as Group_Name-Album_Title-[03]-Track_Title.mp3 Yeah, didn’t really catch on. Now we index by ID3 tag so filename is moot. So now how do we index webpages and other information if our metadata is effed up?

The solution is intelligent code.

Soundex and AI stemming get us halfway there. Maybe a source data set is the necessary piece (Hello, Google!). User input will always be varied and as we realize, it’s too late for standards to work. Let computers do the thinking for us.

Surely a programming ninja could write a tag parser in under a week…

And on the input side? Do we have users type double-quotes or delimit by commas? I say let people input however they want. If its tagged with the phrase “windows vista” I probably also want that item retrieved when I’m looking for just vista. The item shouldn’t be hidden from me solely because I didn’t input the other word. Obviously the need for sorting by context and relevance is necessary, but that’s what computers do. Users shouldn’t have to understand the backend architecture in order to use the product correctly.

Okay, so you’ve got a chaotic system of user-defined category names, and the chaos bugs you, you want to have some way of organizing it without throwing it out altogether and switching to a keyword system standardized by experts. Isn’t this just another job for a “search engine”?

Dominic Sayers makes the point about mixing up standards for technical implementation and tag descriptor that I wanted to make.

Michal Migurski makes the point about there being no need for a standard implementation between different services I wanted to make. And also the point about getting confused as to what the semantic Web is.

The whole point of tags is that they implement a folksonomy – /. has its haha and its yes, no, maybe. Those are tags defined by Slashdotters and they have meaning in the context of Slashdot. If you take tags out of context they lose their meaning. And the difference between ‘tags’ and ‘tagging’ is obvious to anybody who knows the first thing about the English language. Users are likely to think about their tags in different contexts leading to duplicate tags for each lexeme of a word (this post is about the process of tagging information: tagging, this post is about standardising tags: tags).

They are for me to use as I see fit, my way of ordering information of interest to me. The tag clouds are fun but I’m not really interested in iPhones or ParisHilton. I can deliberately share tags within discipline or interest areas with people who have a shared interest, we can develop our own tags. Connotea is the science tagging space and it is a mixture of exact science tags and a whole range of interesting and frequently irrelevant tags.

There is no such thing as an incorrect tag, its really about the user as an earlier entry implied, the purpose is not to codify information for everyone just for yourself and those with similar tags.

I would have to be very very bored to look at all the material tagged “Windows Vista”…

the first tag is the ISO code of the language of the document if not english
the following tags are localization (country, city…) if content is localized
pages about software are tagged with a “license-” tag (license-gpl, license-mit…)

Oliver, that convention may be great, but the whole point is only you use it on a everyday basis. So long as others aren’t indexing what they find/like with the same convention, it’s rather pointless since you’ll only ever have your content to look at.

The need for a tagging standard definitely needs to be developed. However, I’ve heard this over and over again. Why doesn’t someone set forth a proposal for a standard instead of writing a long blog post about it? ;).

Regardless, you laid things out quite well.

The only problem I have with tags is dialectical differences. Things like slang and what not. Someone might tag something as “cool” to denote that it’s interesting while someone might search “cool” to look for something related to coldness or cool colors if you’re thinking Flickr etc. I think it’d be important for a tagging standard to include the approximately latitude and longitude or some sort of locational information with each individual tag denoting where it originated from, that way when searching using tags you can additionally apply a weight to the value of various tags based on the distance of the searcher from the origin of the tag.

I came here looking for a technical standard, sionce I am about to buld a tagging system. I need to decide whether spaces are allowed, allowed characters in general, max length of a tag etc. While the discussion about how to discover equivalent tags is interesting, interoperability on a technical level is step one.

We should start by asking what are the requirements we want of a tag standard? What is the purpose of tagging? Fast personal categorization, semantics, knowledge sharing, interoperability for future p2p distributed settings? Should tags be used for only describing aboutness of a resource or should we allow other characteristics and qualities? There are many questions to ask. I am currently developing a socio-semantic collaborative tagging system. It let users create a shared tag ontology and create relations between the tags. http://www.fuzzzy.com. Check it out.

you bring up very pertinent issues.. however, reading your article made me feel that you are mostly covering issues from the company standpoint. Delicious or technorati et all are agents to help the end user find content – based on the *tags* the end user enters.. this journey from Intent to Content isnt achieved just by the search engine.. the user is creating the content, the end user ie. Who is going to teach gazillions of users – the nuances of : linguitics, noun/verb, commas, delimiters, – or _ .. etc etc.. especially when the blogging softwares make it so easy for the end user – to create multiple tags — and not only that, store it for future use.. I am referring to the “manage tags link” –> unless the *problem* is captured at the source level itself – i dont see much improvement at all..

Can you imagine teaching the end users how to tag their photos – and posts – especially when there exist many a users – who write like : will u b home 2morrow?