Information Retrieval, Or: How I Stopped Worrying And Learned To Love The Data

By Steve Gerencser on March 5, 2012

Lately people have become more and more aware of just how much information is collected about them. Leading the pack is Google. Ad re-targeting, personalized web search, instant pairing translations, and universal search are all ways that Information Retrieval is being used and studied to improve Google’s products. But Google certainly isn’t the first, simply the most honest about doing it.

IR is used all around us every day to analyze, interpret and suggest products and actions. From grocery stores to search engines, IR is used all around us, many times in ways you would never expect.

IR: It’s Older Than You Think

Information Retrieval was first described in an essay by Vannevar Bush titled Mechanization and the Record in 1939. He theorized that a machine could be used to achieve a higher level of knowledge organization by combining lower level technologies. By using a piece of furniture he dubbed the memex, multiple screen viewers and microfilm reels could be searched quickly and easily to recover the information the user was searching for.

In July, 1945, an expanded version of Mechanization and the Record, As We May Think, was printed in The Atlantic. The ideas presented at the time were simple compared to current technology, but that one paper held the seed of hypertext markup by allowing users to link pages of information creating a map of the thought process of the user. As We May Think can legitimately be called the true birth of the internet nearly 45 years later, regardless of Al Gore’s comments.

Google, Learn All That Is Learnable

In 1998 two Stanford University graduate students published The Anatomy of a Large-Scale Hypertextual Web Search Engine.With this paper, and their creation known as Google, Sergy Brin and Larry Page stepped out to the forefront of information retrieval. Google was originally created to provide a better way to gather, organize and make available the millions, eventually billions, of documents on the World Wide Web.

The core of Google’s success was the use of human signals to help determine the relevance and accuracy of the search results. Links became the key signal about whether a site was worthy or not because they required more than just the site owner to say “this is a good page“. So, links became a key signal when deciding whether a site was worthy or not. As Google’s index grew, it became more and more capable of learning what the user wanted nearly as quickly as the user realized they wanted it.

The second key to Google’s success was its voracious appetite for data. The index infrastructure was created from the beginning to scale as quickly as possible. This allowed the search engine to gather as much data as possible as quickly as possible and made other search engines scramble to keep up. Magically, as Google gathered more data, their results got better.

Artificial intelligence requires data. As the amount of data increases it becomes easier and easier to see patterns in actions and logic and that allows a machine to make better decisions. To say that Google’s algorithms are intelligent may be a stretch, but they are closer to true machine intelligence than nearly anything else available in the public sector. What can be said is that the sheer volume of data in Google’s servers made it the smartest search engine on the internet. It knew nearly everything.

Would You Like One Of Our Frequent Shopper Cards Ma’am?

Long before Google was even a dream, grocery stores were at the forefront of retail IR. Often referred to as frequent shopper cards or discount cards, these simple cards are used to tie all of the purchases from a single shopper together. Over and over the data collected by stores was used to refine their sales process and to target their customers with ever more accurate offers.

The information gathered isn’t tied to just a frequent shopper card. Don’t use one? If you use a credit card your data is also tagged to your shopper ID. Use a credit card one time and cash with your frequent shopper card the next? Again, your data is tied together.

A recent article about Target in the New York Times clearly illustrates just how much data is being mined and used to predict shopping habits. Andrew Pole works at retail shopping giant Target as a statistician. His assignment was to sift through the vast amount of data Target had acquired of the years to find a way to predict when shoppers were about to be come new parents. As creatures of habit, it takes a life altering event to make a shopper change their habits, and few things are more life altering than having a child.

An incident in the article demonstrates just how accurate information retrieval can be when paired with other types of research such as habit formation. In a mailer designed to target potentially pregnant women a 15 year old girl received a collection of advertisements for maternity clothing, nursery furniture and pictures of smiling babies. Her irate father accused Target of trying to encourage his daughter to become pregnant only to admit a few days later that he had just found that she was pregnant and Target knew before he did.

If You Don’t Want Us to Know About It, Don’t Do It

In 2009 the CEO of Google admitted that Google’s real goal was to know everything about everyone and everything when he said, “If you have something that you don’t want anyone to know, maybe you shouldn’t be doing it in the first place.” This was really the first time a major corporation openly admitted that they would gather any and all data they could gather on everyone they came into contact with.

In the past, companies such as Target held the fact that they collected and stored all of their customer data as a closely guarded secret. They understood that if someone suddenly got a stack of pregnancy related advertising before they had told anyone that they were pregnant, they may justifiably become upset and never shop there again.

Google doesn’t seem to care about public perception in many cases. The lofty goal of gathering all of the world’s data outweighs any potential negative impact people concerned about security may have. This attitude goes even further and challenges many laws around the world in an effort to do what their corporate culture feels is right.

This came in to sharp focus again on March 1st when Google presented the unified privacy policy across their entire collection of corporate holdings. Previously, a user of a Google product could opt to not share their user data across Google’s collection of platforms. Want to use Google Analytics but not share your site’s data with the AdWords group? You could opt-out. This is no longer the case. If you choose to use one Google product all Google products will have access to your habits and data.

Has Anything Really Changed?

We now have privacy rights groups, government agencies and private users up in arms about how their information is gathered and used. But has anything really changed?

Companies have been collecting information about us for decades. But they have always been very careful to keep that fact a secret. When people know that they are being watched, they behave differently, unless you happen to be a reality TV star and live with cameras in your life 24-7. The only difference is that the curtain has been pulled back and people have become far more aware of just how important their personal data and habits are to companies. We are more aware that we are being watched and cataloged, and thanks to Google, companies are beginning to admit that they need this information to provide us with better services.

Learn To Love the Data

To think that we can somehow turn back the clock and put the IR genie back in the bottle is fanciful at best. Instead, we should be more aware of how much value our data has and leverage that knowledge for better services from the companies that we choose to shop with and work with. Google Analytics and the multiple keyword research tools are fantastic tools when you use them to their full extent.

Some companies already understand that the information playing field has tilted again and are quickly become the new game changers. Wolfram|Alpha is at the leading edge of this newest change. Wolfram|Alpha Pro allows you to analyze your data, any data in nearly any format, using their hardware and software. No longer is IR analysis the private domain of corporations with buildings full of computers, anyone can do their own research using any data they can gather.

The world has changed again and rather than demanding that companies such as Google stop gathering our data, we should be demanding that they give us better access to the data they have harvested from us. We should demand that they level the playing field and instead of being a closed repository of data where they dole out the data they deem necessary, we should be getting full, unrestricted, access to that data. It is ours, after all.

Steve built his first website in 1995 (Its actually still out there). In the years since then, he's transitioned from web development to direct marketing, through SEO and Paid traffic and specializes in helping website and business owners find the ROI in their business. When not helping clients at Steam Driven Media, you can find him creating works of art in wood or pulling a tractor out of the mud.

Hmmm… Thanks, Steve. I came here hoping for anti-NSA kryptonite but instead, if I am hearing you correctly, I have learned that any mad-scientist can become a self-made NSA operative in the comfort of his mother’s basement?…Especially given the age of this article… 🙂