Tag: USPTO

Data

In my experience, no one has quite realised how amazing this link is. It is a hosting (by Google) of bulk downloads of patent and trademark data from the US Patent and Trademark Office.

Just think about this for a second.

Here you can download images of most commercial logos used between 1870(!) and the present day. Back in the day, doing image processing and machine learning, I would have given my right arm for such a data set.

Moreover, you get access (eventually) to the text of most US patent publications. Considering there are over 8 million of these, and considering that most new and exiting technologies are the subject of a patent application, this represents a treasure trove of information on human innovation.

Although we are limited to US-based patent publications this is not a problem. The US is the world’s primary patent jurisdiction – many companies only patent in the US and most inventions of importance (in modern times) will be protected in the US. At this point we are also not looking at precise legal data – the accuracy of these downloads is not ideal. Instead, we are looking at “Big Data” (buzzword cringe) – general patterns and statistical gists from “messy” and incomplete datasets.

Storage

Initially, I started with 10 years worth of patent publications: 2001 to 2011. The data from 2001 onwards is pretty reliable; I have been told the OCR data from earlier patent publications is near useless.

An average year is around 60 GBytes of data (zipped!). Hence, we need a large hard drive.

Download

I have an unlimited package on BT Infinity (hurray!). A great help to download the data is a little Firefox plugin called FlashGot. Install it, select the links of the files you want to download, right-click and choose “Flashgot” selection. This basically sets off a little wget script that gets each of the links. I set it going just before bed – when I wake up the files are on my hard-drive.

[TIF Files for the drawings – name format is [Publication Number]-[Date]-D[XXXXX].TIF where XXXXX is the drawing number e.g. US20010002518A1-20010607-D00012.TIF] – Size ~20kB

[Update: this structure varies a little 2004+ – there are a few extra layers directories between the original zipped folder and the actual XML.]

The original ZIPs

ZIP Files & Python

Python is my programming language of choice. It is simple and powerful. Any speed disadvantage is not really felt for large scale, overnight batch processing (and most modern machines are well up to the task).

Ideally I would like to work with the ZIP files directly without unzipping the data. For one-level ZIP files (e.g. the 20010607.ZIP files above) we can use the ‘zipfile‘, a built-in Python module. For example, the following short script ‘walks‘ through our ‘Patent Downloads’ directory above and prints out information about each first-level ZIP file.

This code is based on that helpfully provided at PythonCentral.io. It lists all the files in the ZIP file. Now we have a start at a way to access the patent data files.

However, more work is needed. We come up against a problem when we hit the second-level of ZIP files (e.g. US20010002518A1-20010607.ZIP). These cannot be manipulated again recursively with zipfile. We need to think of a way around this so we can actually access the XML.

As a rough example of the scale we are taking about – a scan through 2001 to 2009 listing the second-level ZIP file names took about 2 minutes and created a plain-text document 121.9 MB long.

Next Time

Unfortunately, this is all for now as my washing machine is leaking and the kids are screaming.

Next time, I will be looking into whether zip_open works to access second-level (nested) ZIP files or whether we need to automate an unzip operation (if our harddrive can take it).

We will also get started on the XML processing within Python using either minidom or ElementTree.