Xerox Helps Google Fill in Some Search Gaps: From Pre-Web to Post Panda

If Google had launched in the early 90s, it might have come out with technology that could be used to search some of the electronic databases of the day, prior to the World Wide Web, such as Lexis or Dialog. It would have developed ways to visualize results from those systems in useful ways, and custom user interfaces. It might have developed a progress bar that would show you that your search was taking place, and the system hadn’t failed, back when searches took more than milliseconds.

If Google got its start before a WWW had a place in front of its name in a browser address bar, it might have developed very similar technology to what it’s working on today, but with a slightly different approach that can be sensed when reading through a number of Web-based patents from a company like Xerox.

Google was assigned 94 granted (90) and pending (4) patents from Xerox as indicated by an assignment recorded by the United States Patent Office last week, on February 16th, 2012. The execution date of the assignment is November 10, 2011. The USPTO assignment database doesn’t include any information regarding the details of the transaction, such as financial terms.

My last post linking Google and Xerox together was titled Xerox Brings Patent Infringement Suit Against Google, Yahoo, and YouTube. A look at the PACER records for the case (1:10-cv-00136-UNA) in US District Court for the District of Delaware shows it being closed on December 15th, 2011. The case docket includes a stipulation between Google and Xerox dismissing Google from the case on 11/11/11, the day after the assignment of these patents was executed. It appears that the assignments of the patents might have been related in some way to the stipulation, though the patents Xerox claimed were being infringed upon by Google and YouTube weren’t included in the assignment.

While the patent filings include a number outside of search and information retrieval, such as a few involving handheld devices, printing over a network, distributed networking systems, optical character recognition, and workflow processes, many of the patents do seem related to search based services that Google provides.

A number of the patents involved focus upon reviews and collaborative filtering of those reviews, caching of webpages in part and in whole, managing online documents, and what seems to be a large family of patents by the same or similar names that focus upon comparing and determining the quality of documents. Reading through a number of those, I was reminded that today is the one year anniversary of Google’s announcement of their Panda Algorithm.

The patents that focus upon document quality could potentially influence some aspects of the quality scoring of web pages that might be classified based upon an algorithmic machine learning approach such as Panda. Here’s the abstract from one of those patents:

Text, images, and/or graphics of electronic documents should be organized and laid out in a two-dimensional format for presentation to the viewer. The best such layout depends upon the content present, the creator’s intent, the output device, and the viewer’s interests. To analyze the qualitative nature of the layout in quantifiable terms, the electronic document is measure using various quantifiable factors; such as, balance, uniformity, white space management, alignment, consistency, legibility, etc.; that impact a qualitative nature of a document. Such quantifiable factors are then used to quantize the aesthetics, ease of use, eye-catching ability, interest, communicability, comfort, and convenience of the document.

I haven’t had the chance to read through all of these, and pick them apart, and will probably be doing that as time permits, but thought that might be easier with more eyeballs on the patent filings. Here are the granted and pending patents that were included in the USPTO assignment:

Takeaways

Google has been acquiring a large number of pending and granted patents from other companies in the past couple of years. A number of those covered a very wide range of technologies, from sensor technology for driverless cars, to fiber optics networking processes and devices, to computer and database architecture, and more.

This acquisition seems a little more focused upon some of the core search technologies that Google is best known for, from some fairly old patents still focused upon search, to some newer patents that might help Google with its move towards improving its processes for reviews and recommendations and determining quality scores for documents on the Web. For anyone interested in how Google is evolving towards machine learning processes to rank web pages, there can be some value in spending some time going through these patents.

A machine learning system is often only as good as the data set that it uses to start out with. What I liked about a number of the patents involved in this transaction, like the document quality ones, is that they set out some baselines for defining quality that wouldn’t be so dependent upon different seed sets of “quality” pages. Without those, I think you run a greater risk of lowering the rankings of pages that don’t fall close enough to the mold of the sites you included in your seed set, yet which might still provide quality content, and a quality user experience.

A lot of the patents listed are related to document management systems including pages, images, vector spaces, etc, which seem to be a domain of Xerox. Hence, these patents were invented by a company which has a huge impact on digital production like printers or photo copiers. Xerox, next to Adobe, is a one of the most influencing software developer.
What I am trying to say is that Google needs to use their trade partners, like Xerox, to improve their services.

Many of these patents are indeed a way of looking at documents in a manner that is very different from how Google might when trying to analyze them for search. I think the very different approach adds a level of sophistication that Google hasn’t had the chance to develop. It’s not really certain how Google might use them, but with approaches like Google’s Panda update, it seems the search engine is focused upon understanding how the layouts of pages might influence how people view and use them.

I wish Google to perform well. I hear from everywhere they try to help humans in designing a new, better digital world. From everywhere you can hear many voices saying Google tries to fight with SEO spammers. But not sure if the changes they have been doing will not bring them more mess. They try to create searches more targeted and personalized which appears against them.

Every algorithm change and every new ranking approach usually has a possible way that it might be manipulated and abused by people looking to do so. The way to try to combat that is to make it cost more in terms of time, expense, and effort to do so to the point where it becomes more expensive to manipulate than it does to not spam.

[…] intellectual property from some of the most well known names in the technology field (including Xerox, IBM, Hewlett Packard, and other acquisitions), by acquiring 36 granted patents from Unisys […]