Over
the last few years, there has been an explosion in access to online
files containing the complete texts of patents. These types of
files present some unique advantages to both regular and infrequent searchers,
but they also have some significant pitfalls for the unwary.
This short study considers the content of files containing the full texts
of patent applications published under the Patent Co-operation Treaty
(PCT).

OPERATION OF
THE PATENT CO-OPERATION TREATY

Before discussing
the files, you should know about the operation of the PCT in general terms.
The PCT is administered by the International Bureau (IB) of the World Intellectual
Property Organization (WIPO). Despite the fact that published PCT documents
are often loosely referred to as "world patents," this term is highly misleading.
It is not part of the role of the IB to grant patents at all—this is still
handled by the various national and regional patent offices around the
world. The function of the IB is to try to make the patent system more
accessible to all countries, by providing a common "front-end" to the patenting
process. Once the Bureau deals with the initial formalities, these so-called
"international patent applications" may be forwarded to one or more national
offices for the second stage, namely substantive examination that may lead
to a granted patent.

One of the striking
aspects of the PCT is its multilingualism. This is a deliberate policy,
in order to help the would-be patentee to defer the considerable costs
of translating an application into the languages required by the world's
examining offices.

FILINGS CHRONOLOGY

The normal sequence
of events for filing an application under the PCT, using the so-called
Chapter I procedure, is as follows:

1. Initial application—usually
in the applicant's home country in their national language.

2. At 12 months—file
an international patent application under the PCT, again in the national
language, at a "national receiving office," usually a department within
the national patent office.

3. At 18 months—the
international patent application is published, in one of seven official
languages (English, French, German, Japanese, Russian, Spanish, or Chinese).

4. At 20 or 21
months—transfer to national patent offices for further processing to grant.

Stages (1) and
(2) are both entirely confidential, but stage (3) generates a published
document, which is the source material for the full-text files under discussion
here. They are numbered in an annual series, with a two-letter "WO" prefix
identifying the publishing authority, WIPO—hence the first document published
in 1999 was WO 99/00001. The WO documents are, broadly speaking, indicative
of an applicant's desire to obtain patent protection for his or her invention
in a large number of countries around the world. As such, they have become
a highly significant source of early warning intelligence for the patent
searcher.

There are two points
in the process where an applicant may be required to file a translation
of his or her patent application. This will depend upon whether the applicant
has one of the seven PCT languages as his or her own national language.
Consider, for example, a Swedish company that intends to obtain a patent
in Japan via the PCT. At stage (1), its original ("priority") filing will
be in Swedish. At stage (2), it is again permitted to use Swedish in order
to lodge its international application, but by the time it reaches stage
(3), it must provide a translation into one of the official publication
languages; typically, this would be into English. Eventually, if it decides
to proceed to full examination, it will be required to provide a further
translation, this time into Japanese, so that the Japanese Patent Office
can proceed with the case.

By contrast, the
situation for applicants in states that already use one of the seven languages
of the PCT as their official language is somewhat simpler. For example,
an Austrian company would lodge a priority document in German (stage 1),
followed by the same text at stage (2), with the same text publishing at
stage (3), and it would only be faced with any translation costs when it
came to stage (4) when, in common with the Swedish applicant, it would
be obliged to provide a Japanese translation if it chose to proceed in
Japan.

This disparity
in processes makes it important to distinguish between the language used
at stage (2), which is referred to as the "filing language" of the PCT
application, and that used at stage (3), which is the "publication language"
of the actual WO-prefix documents which enter the databases. Unfortunately,
some of the earlier attempts to provide searchable files of PCT documents
did not retain this distinction, and merely created a single LA (language)
field that was effectively useless. There are several candidate files for
this survey, referred to in the following discussion by the file number
in the first column of the accompanying table.

In addition to
MicroPatent, Dialog, STN, and Questel*Orbit, Aurigin Inc. has announced
the availability of an intranet file of PCT specifications under its Aureka
system, apparently covering from 1978 to the present, and further files
are believed to be pending from Delphion and Univentio,but no further details
are known at this time.

FILE PREPARATION

A key issue in
considering these files is the question of how complete they are for the
claimed time span. This in turn relates to the method by which the file
data are collected and turned into machine-readable files.

MicroPatent, Dialog,
and STN all use a dataset originated by MicroPatent LLC, which is prepared
by Optical Character Recognition (OCR) scanning of original paper texts.
This has two immediate consequences. From the point of view of searchability
of the resultant text files, OCR is a vastly improved technology, but is
still far from 100% perfect. The most notable handicap is in the handling
of tabular data. Patent specifications frequently use tables to illustrate
test results or comparative ex- amples, and it is quite common for the
OCR version of these tables to be so mangled that the individual words
within the table columns become un-usable as discrete search terms. This
is a pity, as column headings or cell entries may contain potentially valuable
text, not used elsewhere in the specification.

The second principal
cause for concern is the treatment of non-Roman scripts. As outlined above,
some of the published PCT applications are issued in Japanese, Chinese,
or Russian. As far as I know, MicroPatent does not attempt to include these
texts in its file at all. The searchable elements for these records are
thereby reduced to basic bibliographic fields, including a title and an
abstract. All of the benefit of increased recall inherent in a full-text
file is lost for these records. The resultant file contains only full-texts
for documents in English, French, German, and Spanish. The Bluesheet for
Dialog's load of the data claims that "additional content is provided by
WIPO from 1997 forward," but gives no details about whether that content
relates to additional full texts or simply abstracts and bibliographic
records.

In July 2001, Aurigin
announced that it had completed its own OCR version of the PCT specifications,
back to 1978. A press release stated that the file would "extend full-text
searchability to all English-language publications since 1978. Currently,
Aureka provides full-text searchability to all (English) publications since
1993, and about 45% of pubs in the range 1978-1992." (my emphasis). From
this, it is not clear whether any non-English specifications have been
scanned at all. If not, it leaves substantial gaps in the file over both
time periods.

FILE COMPLETENESS

It is by no means
straightforward to derive information concerning how complete these files
are. For example, the accompanying table shows the official statistics
issued by WIPO concerning publication under the PCT during 2000.

However, examination
of the paper PCT Gazette for 2000 clearly shows that the highest allocated
number was WO 00/79858, i.e. a total of 79,858 published documents. What
has happened to the missing 89 documents? It sometimes occurs that a WO
publication number is allocated but that the document is withdrawn from
publication at the last moment. If this publication number is included
in the number sequence but classed as "non-published," it would result
in the highest publication number being higher than the total of the publication
statistics. But what is found is the reverse—the publication numbers are
lower than the publication statistics. One possible explanation is that
the publication statistics include a number of delayed search reports (carrying
the Kind of Document code suffix A3 but the same publication number as
the earlier corresponding specification). But this explanation is not supported
by the WIPO statement that specifically refers to the figures as "WO pamphlets,"
i.e. the entire specifications. Despite approaches to WIPO, at the time
of writing no satisfactory explanation for the missing 89 documents has
been found.

However, the mystery
deepens on examining the contents of the STN version of the MicroPatent
file. The individual statistics by publication language are derived by
a set of search statements "2000/PY AND CC/LA," in which "CC" was replaced
by the corresponding two-letter code for publication language. Taken together,
this adds up to 78,492 documents (1,455 short of the official WIPO total,
Grand Total 1). However, using the single search term "2000/PY" results
in yet another figure, of 79,856 (Grand Total 2, still 91 short, but only
2 different from the total derived from publication numbers). The most
serious shortfall is clearly in the Japanese language documents, where
at least 1,250 appear not to have even a bibliographic entry in the file.

From the point
of view of the searcher, it is bad enough that any documents are missing.
The situation is then exacerbated by the treatment policy for non-Roman
script documents, which are not processed to include full texts at all.
This reduces the actual texts available for searching by a further 6,410
based on the missing Japanese, Chinese, and Russian texts, with one further
English and five German texts also disappearing.

If we assume that
the explanation for the missing documents is based on problems at source,
then this will affect all the file versions based on the MicroPatent dataset
(MicroPatent, Dialog, and STN).

The fourth file,
based on EPO data, is derived differently. The selection policy for the
file limits the possibility to English, French, or German source documents,
and the corresponding statistics for year 2000 publications. Clearly, the
overall numbers are lower as a result of the missing Spanish and non-Roman
text documents, but in addition it can be seen that only between 80 and
90% of all the potentially available documents have been selected for addition
to the file. Questel*Orbit, to its credit, has never pretended that this
file is a complete collection of PCT texts, and always advises that it
is best used in conjunction with the bibliographic PCTPAT file. The Questel*Orbit
print command is designed to operate from within the PCTPAT file in order
to retrieve full texts on-the-fly from the WOTEXT file. If there is no
corresponding full text, the search results display only the bibliographic
entry from the PCTPAT file.

The further missing
10-20% is due to the selection policy for the addition of texts, which
is based upon that used for the European Patent Office's internal search
files. This policy gives preference to the inclusion of an English-language
full-text whenever such a member is available in the patent family. For
example, if an English-language PCT application was the first member of
a new family to be published (the "basic"), there is a high probability
that the text of this document would be added to the EPO search files as
the "master text." However, if there were a granted U.S. patent (in English)
published quickly (say in 15 months), followed by a German-language PCT
case (an "equivalent") at 18 months, the U.S. text would be adopted for
the search files. Due to this policy, it is virtually impossible to predict
what proportion of the WO texts in any given period will be chosen for
inclusion in the database. The situation has changed since mid-2000, when
WIPO started supplying the EPO with full-texts in XML format; from this
point onwards, all texts in English, French or German were added to the
internal EPO files, and carried through to the WOTEXT file.

CONSEQUENCES
FOR THE SEARCHER

The purpose of
this article is not simply to criticize the file producers, who are doing
the best they can with a complex set of data, but to raise awareness of
the problem for the unwary searcher. Anyone coming to these files who based
his or her understanding of the content upon the publicity material distributed
by the vendor will be hard-pressed to discern the danger of missing prior
art.

Consider searchers
who approach the MicroPatent file, either in its Web site version or via
one of the commercial vendors. Before they have even touched a keyboard,
they are limiting their search to approximately 90% of the documents that
they might think are included (this is not good news: lest anyone consider
otherwise, the 80/20 rule does not apply in patentability searching!).
This 90% proportion, moreover, will only continue for as long as the Roman
script languages (English, French, German and Spanish) maintain their dominance
within the PCT. Some time ago, the South Korean patent office was accepted
as a so-called International Preliminary Examination Authority under the
PCT. In my opinion, the next logical step will be to accept Korean as the
eighth official publication language. Given the prolific rate at which
national Korean patent applications are published, this would be a popular
move and create a considerable downward pressure on the dominance of English
within the system. In turn, this will cause even bigger holes to appear
in the full-text databases reliant upon Roman scripts.

The second hurdle
facing the unwary searcher is if he or she forgets multilingualism entirely,
and starts to use only English search terms in his or her strategy. If
the user chooses the MicroPatent file, the maximum possible recall drops
immediately to around 56019/79858, or just over 70%. Should he or she choose
the WOTEXT file (against advice), his or her recall plummets to 46985/79858,
or 59% of the entire theoretical search file. For the reason given above,
in future years users may not even be able to maintain that level, as English
loses its dominance.

The third hurdle
is that these files (excluding WOTEXT) include a document title and abstract
in English, for all cases. It is therefore possible to retrieve certain
non-English texts by searching in English, provided that your search terms
happen to appear in the abstract and that the abstract is included in the
basic index (or its equivalent). This can give the misleading impression
that all the documents in the same language as the single fortuitous hit
have in fact been searched.

To summarize, there
are dangers in treating these tools as comprehensive subject-matter sources
for patentability searching, when reliant upon words. At the present time,
there is only one reliable, language-independent method for searching these
important texts by subject in their entirety—namely, patent classification.
The files remain useful tools for searching other bibliographic elements,
such as inventor or assignee, or when knowingly limiting a search to the
(not-very-informative) applicant's abstracts and titles, but such a method
negates the advantages of full text and can equally well be done in one
of the several PCT bibliographic files, such as PCTPAT or PATOS-WO. The
problem of multilingualism is not going to go away—if anything, it will
get worse—and the challenges that this brings to all searchers, whether
of patents or anything else, remain formidable.