On Search: Metadata

Search

In the Web’s early years, the overwhelming favorite among search engines
was Yahoo.
Today it’s Google.
Neither has actually had better text search technology than the
competition.
They won because they used metadata effectively to make
their services more useful.
In this ninth On Search episode, a survey of what metadata is,
where it comes from, and how to use it.

Metadata is technically “information about information” and you
can start a fistfight in the bar at any XML or Content-Management conference
about what’s data and what’s metadata.
In the context of search, metadata is anything that you know about the
documents you’re searching beyond the words they contain.
With
descriptive markup,
it’s easy enough to store a document’s metadata right
inside it (consider HTML’s <META> tag).

Yahoo ·
Back when everyone searched at Yahoo, the usual result list looked quite a
bit different.
If I typed in “donkey,” before the pointers to Web pages there would
be a few pointers to categories in the Yahoo taxonomy that contained the word
“Donkey.”

This worked really well, because if the Yahoo editor had classified
Diseases of the Horse Family or The Asses of the British
Isles under a donkey-related category, I’d find them even though
“donkey” wasn’t in the title.

In effect, Yahoo maintained one useful piece of metadata about each page
in the engine: What is this about?.
This is a real value-add for the searcher.

Google ·
Google, like Yahoo, maintains one key metadata field about each item it
indexes: the well-known PageRank, essentially a measure of how many other
pages point to it.
They make use of it very simply, to order the result list with high
Page-Ranks at the top.

Conclusions? ·
Google seized search leadership from Yahoo;
can we conclude that it’s more
important to know how popular something is than to know what it’s about?
If you’d told me that ten years ago I would have had a hard time
believing it, but the evidence seems pretty compelling.
Note that Google actually does have some subject metadata via their
integration with the Open Directory Project,
but they don’t push it that hard, and the volunteer-staffed,
highly-political, AOL-semi-orphan ODP is fairly weak reed to lean on
anyhow.

On the other hand, Google has always been way more focused on search than
Yahoo has, and isn’t always trying to get in front of you with stock
prices and news and weather and so on.
More important, even if it turns out that popularity is the key thing for
Internet search, the Internet is a very special place, and it’s quite
unlikely that popularity is the killer metadatum for the whole universe of
search applications.

I believe, though, in the other obvious conclusion: that the number-one
way to make search work better is to bring some metadata to bear on the
problem.
This really shouldn’t be surprising:
As I’ve discussed
before, it’s really hard to
make search engines act much smarter than they do today.
So instead, let’s reinforce them with externally-supplied metadata.

Where Does Metadata Come From? ·
Those Yahoo and Google metadata offerings, while really quite
different, have one important thing in common: both are expensive.
Yahoo has for years employed a team of editors to sort websites into their
subject hierarchy by hand.
And Google’s immense rooms full of machines humming away computing
PageRanks twenty-four hours a day are a legend in our industry.

In my experience, this is typical. Put another way: There is no cheap
metadata.
Of course, if we could use computers to compute the metadata like Google does,
that would be immensely cheaper than having employees do it.
And a lot of smart people have invested a lot of effort and money into the
problem of deriving metadata from data, but it’s a hard one.
(Still, we should be on the lookout for opportunities; more later).

Many people in the content-management and knowledge-management trades have
noticed this, and concluded that the trick is to gather metadata upstream.
Remember how Microsoft Word, out of the box, used to pop up a dialog every
time you created a new document and encourage you to provide a little
metadata?
Most people immediately said “Make this go away!” and I don't think
Word has done this (by default) for years.

Historically, the difficulty of collecting metadata at source has been
generally large enough to outweigh the (potentially huge) benefits from
collecting it.
But I for one am not ready to give up on this approach.
There are, after all, domains where metadata is at the core of the business
proposition, and the process works there.
For examples, the editorial staff who produce the Wall Street
Journnal add metadata as they go along, identifying people, companies,
stock ticker symbols, and so on.

If You Collect Metadata By Hand ·
The most important lesson I’ve learned, is: Don’t try to
collect too much. You might, just might, get people, when they’re
interacting with your intranet, to label their information by project and
title; but more than a couple of fields and people will just bypass the
process.

This is harder than it looks.
When you decide in principle that metadata should be collected, it will
develop that many stakeholders have short-lists of the fields they need
to make this worthwhile.
You can easily end up with a “short” list of a dozen
or more fields that constitute the “absolute minimum” that people
think you must have.
And if you adopt it, you’re deadd, because except in special circumstances
(e.g. the WSJ),
people just will not take the time to do this.

Automatic Metadata ·
Obviously, there are some metadata items the computer will give you for
free: a filename, created/modified dates, who created it, what kind of file
(HTML, Excel, PowerPoint), how big it is.
These can be handy for search applications and since they’re free, you
should collect them and make them available.

The second category of machine-generated metadata is what
“Autocategorization” software does. These are the companies like
(in alphabetical order) Autonomy, Gavagai, Semio, Stratify, and Vivisimo; they
all promise to take your raw data and either generate or fill-in a subject
taxonomy telling you what it’s about.

Sometimes they work, sometimes they don’t, and sometimes it can be
puzzling figuring out whether they’re going to work or not.
But they are not an exception to the no-cheap-metadata rule; this is
software that’s generally expensive to buy and expensive to deploy.

Don’t Neglect Your Logfiles ·
There’s one kind of automatic metadata that I think doesn’t get the
respect it deserves: the contents of your logfiles.
Here’s the most obvious example: unless you’ve been throwing away
your internal Web server log files, you already know which are the most
popular items on the Intranet.
It would’t be that hard to boil them down (occasionally, on a batch basis,
this doesn’t need to be real-time) and develop your own internal
“PopRank” based on what gets downloaded the most.
It might not be as sexy as PageRank, but if I search the Intranet for
material on expense policies, you can bet I’m going to find a lot, and
if two or three stand out because they’re the ones everyone ends up
reading, you might save a lot of people a lot of time.

Care, Feeding, and Using ·
Once you’ve got some metadata, since it’s expensive, you should
take good care of it.
This almost always means putting it in a relational database.
As I mentioned above, debates over the meta-ness of data can get religious,
but in practice, I’ve observed that while data itself (for example XML
or video) often resists being forced into rows and columns, metadata usually
lines up happily.
Even ongoing has a little MySQL database sitting off to the side of all the
XML-encoded entries, tracking a bunch of useful facts about them, including
some (e.g. the title) that are replicated inside the data.

And of course you’ll want to put this goodness to work.
One obvious way is to have a query screen, so that people can search for
resources by author, date, title, and so on, not just brute-force full-text.
But what you’d really like is to learn from Yahoo and Google, and have
the metadata just there, silently helping.
For example, to use in ranking your results.

Another thing you could do <commercial-plug>is call up
Antarctica, our Visual Net product
takes metadata and gives search a Graphical User Interface just like your
personal computer has.</commercial-plug>

In the API ·
This means that if you’re going to design an API for a search engine
(something I plan to do eventually in this series) you’re going to need
to include entry-points not just for searching and adding words to the
full-text index, but also for adding, maintaining, and using the metadata
that drives the search.

The Web and the Semantic Web ·
One of the Web’s distinguishing features is that there’s a big
gaping hole where the metadata ought to be.
The Web has resources, identified by URI, and you can ask for
“representations,” which come with some metadata, but the metadata is
about the representation, not the resource.
This is probably a bit abstract for those who don’t
wrestle professionally with
Web Architecture, so an example’s in order:
Suppose you read an online news story from your desktop computer at 9AM.
You get a Web page with some metadata telling you that it’s in HTML and
is in English and ISO-Latin-8859-encoded and can’t be cached and so on.
Suppose, at noon, on the road, you hit the same story from the
minibrowser in your cellphone.
The server cleverly notices this is a small-screen device and sends the same
information in WAP or simplified HTML or some such thing, with metadata
saying what it is (which is completely different from the metadata you got
with the PC Browser version).

So, given a URI, the Web has no built-in way to ask questions about it,
for example “What is this about?” or “When does it expire?”
or “Is this suitable for children?” or “Is this good?”

The Semantic Web project is trying to make the whole Web smarter and more
machine-readable, and obviously this is never going to happen without
metadata.
So a lot of really smart people are working hard to develop good ways to
encode, organize, and interchange metadata keyed by URIs.
Of course, these people’s dreams aren’t about mere search, they’re
about managing your schedule and your medical treatments and your shopping
and your supply chain.
All of which is fine; but if the Semantic Web ever takes off, there is going
to be a whole lot more metadata available about a whole lot of stuff.

As a side-effect, I expect that all the search services of the world will
become a lot richer, a lot smarter, and a lot more fun to use.
But we’re not there yet.

A Word On Our Sponsor ·
This is a sponsored essay. It is brought to you by
the local power
company, who arranged a complete power failure in Antarctica’s offices
this afternoon, so I took advantage of battery power to type this in.
Power’s back, it’s back to work we go.