Full-text Patent Search: Is it a choice between quality and quantity?

In previous articles, we’ve looked at the role that bibliographic (front page) data has to play in patent research and I’ve highlighted some of the challenges database producers face in terms of improving the quality of the raw data through standardization, normalization and error correction.

Now I want us to look beyond the front page to explore the reasons why full-text is both growing in importance and increasing in scope and I’ll conclude by pointing out some aspects of full-text patent research solutions that users should look for and the potential pitfalls inherent in searching full-text.

Let’s begin with a definition. As someone who has specialized in this subject for over 30 years, I regard full-text as the entire document; including the bibliographic information, the Title, Abstract, Description, Claims and Citations but excluding any drawings, figures and diagrams. It does, though, include any text “embedded” within elements such as tables, charts, etc. and, where appropriate, Examiner Search Reports.

We take full-text patent databases for granted these days but we only have to go back little more than a decade, to find that full-text collections were very rare indeed. An interesting fact is that LexisNexis (or more accurately, its predecessor, Mead Data Central) launched the world’s first online full-text patent database. LexPat™, as it was known, was launched in 1983 and at the time consisted of US (issued) patents back to 1975. LexPat was advertised in the ABA Journal at the time as “A patent attorney’s dream come true”

While other versions of US granted patents would emerge, there was a gap of over 15 years before the first batch of non-US (i.e. international) full-text patent databases became widely available. MicroPatent, then owned by IHI Holdings, created a searchable full-text file of PCT (Patent Cooperation Treaty) patent applications from 1978 onwards, using Optical Character Recognition (OCR) technology. While this was a major step forward, OCR was then and still is an imperfect technology. Errors were introduced into the data through the scanning process, meaning that punctuation, paragraphs and other “mark-up” wasn’t always interpreted correctly. This could lead to incorrect documents being returned or more importantly, correct ones not being found. Moreover, because the WO/PCT file was published in several languages including English, French and German, (subsequent rule changes mean that PCT publications can now be in any one of 10 languages), it meant that anyone searching the file using only English search terms, ran the risk of missing a critical document. Nevertheless, the creation of the WO/PCT full-text database by MicroPatent was a huge advance, and the company’s decision to license the file to two of the scientific communities’ most important online hosts, Dialog and STN, paved the way.

Over the next decade, more and more full-text patent databases became available, largely created by the privately owned Dutch company Univentio BV, which was acquired by LexisNexis in 2005. Univentio released another, richer version of the WO/PCT file, followed rapidly by full-text databases of European Patent Office applications and granted patents, French, German and other collections, as well as what I believe still remains the only database of British granted patents. The original LexPat database was extended back to 1836 and US published applications came online in 2001. Moreover, Univentio included English machine translations of the full-text from French, German, Spanish and other languages. Today, there are more than 30 full-text patent databases available on platforms such as TotalPatent® and the quality of the OCR, as well as of machine translations, has improved markedly from those early days.