Finding records in PhysioBank

Among the most frequent questions asked by visitors to PhysioNet are requests
for data with specific combinations of characteristics. For example, “Which
records include three or more ECG signals and a respiration signal, are at
least two hours long, and are from male patients between the ages of 60 and
70?” This and many similar questions are readily answered using the
PhysioBank Index. (See how to do this below.)

This package includes sources for the PhysioBank Record Search (the web client
pbsearch, and the server pbsqsd) as well as a command-line client
(pbsqc), a stand-alone search engine (pbsqs), and a set of
plugins. The software runs on all popular platforms; it can work with a local
copy of the Index, or it can read the Index directly from the PhysioNet web
server or a mirror via HTTP. This package may be useful if you wish to
customize or automate your searches; familiarize yourself with the
web-based PhysioBank Record Search first.

Standard command-line tools such as grep, cut, and uniq

You can also search the Index using Unix (POSIX) utilities for manipulating
text files, as described below. These are standard
components of GNU/Linux, Mac OS X, and all other Unix and Unix-like platforms;
Windows users can get them by
installing Cygwin.

If you wish to index your own PhysioBank-compatible records, the portable
sources for the PhysioBank Index generator are available in the
pbindex software package.

Downloading the PhysioBank Index

You do not need to download the Index in order to use the recommended
PhysioBank Record Search. If
you wish to another method for searching the Index, you may need to do so.

Don't attempt to load the Index directly in your web browser, since its large
size will cause problems for most browsers!
(wfdbcat is part of the WFDB
Software Package, curl is available
here, and rsync is
here. All three
are open-source and run on all popular platforms.)

Contents and format of the Index

Each line of the PhysioBank Index describes one signal, annotation file, or
other feature of a single PhysioBank record; there are about 860,000 lines in
the Index. All lines pertaining to any given record are consecutive, and the
records appear in dictionary order. Here is a sample from the Index:

Each entry (line) of the Index contains up to seven tab-separated
fields (columns) that describe a signal, annotation set,
or feature associated with the record. For entries describing
features (such as the first five lines in the example above), these columns are
(from left to right):

Record name

In each entry in the example above, the record name is edb/e0103, i.e.,
record e0103 of the collection named
edb. (The file DBS contains short descriptions
of each collection; edb is the European ST-T Database.)

Class (feature type)

One of:

AgeSex: The next two columns contain the subject's age and gender.
Age of subjects over 89 is PHI, and is
stated as 90; unknown age is stated as -1. Ages of infants less than 1 year
old may be shown as 0, or as a decimal fraction of a year (e.g., 0.3). Gender
is stated as M or F, or ? if unknown. If neither age nor gender is known, this
entry is omitted from the Index.

Diagn: The next column contains one or more free-text
diagnoses. This entry appears only if diagnoses have been tagged in
the .hea file for the record; in many cases, as in the
example above, diagnoses appear as Info since they have not
been tagged.

Infon: Untagged information collected from
the .hea file for the record.

Medsn: One or more medications. As
for Diag, this entry appears only if medications have been
tagged in the .hea file; otherwise, they appear
as Info.

A sequence number n is affixed to each entry for any given class
if more than one such entry is possible in a single record (e.g., Info1,
Info2, etc.); this is done even if only a single entry is actually
present. The sequence numbers are consecutive but not necessarily
contiguous, as the example above illustrates. Note that entries
describing features are present only for a relatively small number of
records.

Feature value(s)

A string providing the value for the feature named in the previous column.
Feature values may include spaces but not tabs. (An AgeSex entry has
two feature value fields, which are separated by a tab.)

In entries describing signal and annotation sets, the columns are (from left to
right):

a category of annotations (either AnnM for machine-derived
annotations, or AnnR for reference annotations)

A sequence number n is affixed to each entry for any given class
if more than one entry is possible in a single record (e.g., ECG1,
ECG2, etc.); this is done even if only a single entry is actually
present.

Signal or annotator name

The index contains an entry for each signal and for each
annotation file. In addition, each type of annotation present in a
given annotation file is counted separately and listed in its own
annotation subset entry. The last eight lines in the example
above are annotation subset entries, summarizing
the (N, N, s, etc. annotations present in
within the atr annotation file. See
PhysioBank Annotations for a
glossary of commonly used annotations.

Sampling frequency (Hz)

The number of samples per second. Annotations contain timestamps with
a time resolution equal to the reciprocal of the sampling frequency.

Gain (of signals) or number (of annotations)

Gain is expressed as adu per physical unit. An adu is
one analog-to-digital converter unit (the quantization step, which is
the smallest measurable difference between samples). An amplitude
resolution of 200 adu/mV, as for the ECG signals in the example above, means
that two unscaled samples that differ by 20 units represent a potential
difference of 0.1 millivolt.

Duration

Length of the signal or annotation set, in seconds.

Time intervals

Interval(s) during which samples or annotations have been recorded if
present. The times shown for the beginning and end of each interval are the
elapsed times in seconds from the beginning of the record.

Some records have both annotated and unannotated segments; in these cases,
the length of the annotation set is shorter than the length of the
signals. In annotation subset entries, the duration and time interval reflect
the times of the first and last annotations of the associated type, and are
generally shorter than the entire annotation set.

In most cases, signals are present throughout, and the last column is omitted
in entries that describe such signals. The MIMIC II Waveform Database is an
exception to this rule, since many of its signals have been recorded in only a
subset of segments; in these cases, the lengths of the signals is less than
that of the entire record, and there may be more than one time interval shown.

As for the signal and annotation set lines, the first two columns are the
record name and class (data type). The first four feature lines shown above
illustrate diagnoses, medications, and two lines of free-text information;
the data appear in the third column. The final feature line contains the
age (in years) in the third column, and the sex (M, F, or ? in the fourth
column). If the subject’s age is over 89, it is shown as 90 (since ages over
89 are protected health information); if the age was not recorded, it is
shown as –1.

Using PhysioBank Record Search to search the
PhysioBank Index

In this section, we'll answer the question at the top of this page using
PhysioBank Record Search. Please follow along in your web browser. Note
that your results may vary from those shown below if additional records have
been added and indexed since this tutorial was written.

PhysioBank Record Search is controlled from your browser. You can open it
from the PhysioNet menu button at the top left corner of most pages on
PhysioNet (choose PhysioBank → PhysioBank Search). To follow this
exercise, click here
to open it in another browser tab or window. You should then be able to go
back and forth between this page and the PhysioBank Record Search page.

The upper section of the page, below the PhysioBank Record Search
heading, contains a form for composing simple queries,
searches that are defined by a Subject, a Relationship, and a Value.
Once you have performed any simple queries, the Results section
opens beneath the upper section; it contains the results of your queries
and additional controls for combining and manipulating them. Near the
bottom of the page, on-line help for PhysioBank Record Search appears
below the heading How to search for records in PhysioBank. Read
the on-line help to familiarize yourself with the controls.

If you have used PhysioBank Record Search within the past week or so on
this computer, you will see two buttons in the upper section of the
page, labeled “Restore previous session” and “Discard previous results”.
Unless you wish to keep the results of your earlier searches, discard
them before continuing.

To answer the example question (”Which records include three or more ECG
signals and a respiration signal, are at least two hours long, and are from
male patients between the ages of 60 and 70?”), our strategy will be to
decompose it into simple queries, collect the results of those queries, and
combine them. At each step, we'll be able to see how many PhysioBank records
fit the simple queries.

Let’s begin. The first simple query will find all records that contain three
or more ECG signals. From the Subject menu, select '(#) ECG'. The
notation '(#)' indicates that we can specify a minimum number of signals of
this type (ECG) in the Name/# box below the Subject menu; type
3 into that box. From the Relationship menu, choose '?' (”defined”),
since for this simple query, we only want to know if the 3 ECGs
exist. It is unnecessary to enter anything in the Value box (if you do,
it will be ignored since the Relationship is '?'). Once the query has been set
up, click on Get List.

Immediately below the Results heading, you should now see a line that
looks like this:

☐ A [25372] ECG-3 ?

“A” is the tag for the list of results of this query; the
bracketed number (which, as noted above, may vary) is the number of PhysioBank
records that match the criterion that you have just defined, and the criterion
itself appears as a link (”ECG-3 ?”). Click the link (on the search page; the
facsimile shown above is not a working link) to view the results if you wish (a
list of 25372 record names, probably beginning with
challenge/2009/test-set-a/101a/101a).

For the second simple query, let's find all records that include a respiration
signal. We might construct a query such as 'Resp ?', but since we also want
records that are at least two hours long, we can include that constraint as an
element of this query. Select '(#) Respiration' as the Subject, '>=' as the
Relationship, and type '2:0:0' (i.e., two hours) as the Value. Don’t forget
to erase the '3' that may be left in the Name/# box from the previous query;
it’s unnecessary (but harmless) to type a '1' in its place.

After clicking Get List, a second checkbox and result appears. New results
appear above old ones, so you should now see this:

☐ B [38507] Resp >= 2:0:0
☐ A [25372] ECG-3 ?

There are many records that satisfy this criterion, too.

If you forgot to erase the '3' before clicking 'Get list',
list B will be much shorter, since it will contain only records that
include 3 or more respiration signals. Such records do exist (for
example, some have thoracic and abdominal impedance plethysmograms and
a simultaneously recorded nasal thermistor signal), but they are
relatively uncommon.

Next, let’s find records from male subjects ('sex = M'). By now, you may be
becoming familiar with the steps of creating a simple query: select the
subject, then the relationship, then type a value, and click Get List.

☐ C [9962] sex = M
☐ B [38507] Resp >= 2:0:0
☐ A [25372] ECG-3 ?

You may be surprised that list C has only 9662 records from male subjects,
given that list B has about 4 times as many records. For many records, however,
information about the subject’s gender (and, as we shall see, age) is not
available.

To restrict the search to a range of ages (60-70), we'll use two simple
queries ('age >= 60' and 'age <= 70'). When you have run these, your
results should look like this:

The final step is to combine all of these results. The 'And' button combines
two or more selected lists, producing a new list containing those records that
belong to all of its input lists. Select each of the five lists now by clicking
on its checkbox, then click 'And' to generate list F:

The final list contains 131 records, all from the 'mimic2wdb/matched/' data
collection (the MIMIC II Waveform Database Matched Subset). You can view or
download the list of records if you wish by clicking on the link next to F,
or you can examine the records using the PhysioBank ATM. If you wish to do
this, select list F by clicking its checkbox, then click on 'Choose'. The
PhysioBank ATM will appear in place of the PhysioBank Record Search page,
and the first record belonging to list F will be preselected as input.

If you are unfamiliar with the ATM, read its on-line help (visible below
the How to use the PhysioBank ATM heading), then click '*' (in the
ATM control panel under Navigation) to dismiss the help and display the
first 10 seconds of the first record. You can use any of the ATM's controls
to examine the record as you wish; when you are ready, click on 'Next record'
to view the next record in list F.

The '+' and '-' buttons in the ATM control panel are active only while you
are examining search results. Use them to mark individual records of interest.
When you click on either of them, a '+' or '-' appears after the record's name
in the ATM's Record menu (control panel, upper left). Mark at least one
record now.

After reviewing as many records as you wish, return to the search page
(for example, by clicking the PhysioBank Record Search button in the ATM's
page header). When you do, you will see that one or two new lists have been
created below list F. These lists, tagged as F+ and F-, contain only those
records that you have marked using the ATM's '+' and '-' buttons. You can
select list F again and click 'Choose' to return to the ATM if you wish to
mark additional records, and these lists will be updated as you do so. Although
lists F+ and F- are described as “accepted from F” and “rejected from F”,
you can use these lists in any way you wish. Note, however, that a list
made by combining them with other lists will not update itself automatically
if you make changes in them later on; to get updated results, repeat the
actions you used to combine the lists initially.

If you don’t need to refer back to a list, select it and click on Erase to
discard it. This helps to avoid confusion if you use PhysioBank Record Search
frequently.

Your results are retained for about a week; after that, they will be removed.
Download them if you wish to keep them (especially if you have invested effort
in choosing individual records using the ATM). When you return, your previous
results are identified using a browser cookie (pbs_id); if you use more than
one browser, or more than one computer, you will have different cookies (hence
different sets of results) for each one unless you synchronize your cookies.

Using standard command-line utilities to search the PhysioBank Index

Begin by downloading the Index using any of the methods above. Open a terminal emulator window and navigate to the directory in which
you saved physiobank-index.

There are five records in PhysioBank that include a left ventricular stroke
volume signal, which is labelled SV. Finding them is simple: type

Most of these results are records containing supraventricular tacharrhythmias
(annotated as '(SVTA'), and others contain SV in comments.
These are easily ignored, but it's also possible to improve the search
using either

grep $'\tSV\t' physiobank-index

(if you are using the bash shell), or

grep -P '\tSV\t' physiobank-index

(if you are using GNU grep). Either of these commands interprets the
sequence \t as a tab character in the search pattern, so that
the results contain only lines from the index in which SV appears
surrounded by tabs (i.e., in a column by itself):

Getting (re)acquainted with the command line

If you've ever used any version of Unix, or even MS-DOS, the examples on
this page may look familiar. If not, consult any introductory
book or on-line tutorial about Unix or GNU/Linux. Here are a few places
to start:

The necessary shell (command-line interpreter, such
as bash, csh, ksh, or sh) and command-line
tools (grep, cut, sort, etc.) are standard
components of GNU/Linux, Mac OS X, and all other Unix and Unix-like platforms;
Windows users can get them by
installing Cygwin.

After nearly 30 years, Kernighan and Pike's
The Unix Programming Environment remains the best
introduction to this approach of tackling problems using tools that each do one
job well, and work well together. Used copies are far less expensive than
new.

If we want to find records that have at least 3 ECG signals,
we can look for ECG3:

grep ECG3 physiobank-index

This results in a very long list of records that quickly scrolls off the
screen. If we want to know how long the list is, we can use wc
to count the lines:

grep ECG3 physiobank-index | wc -l

(The pipe symbol, '|', connects a pair of commands; it means
”take the standard output of the command on the left and feed it to the standard input of the command on the right”.)
When this page was written, there were 25,372 recordings with at least 3
ECG signals in PhysioBank. We can save the entire list by redirecting
the standard output into a file, like this:

grep ECG3 physiobank-index >ECG3-records

The '>' collects the standard output of the command, which would
otherwise be shown in the terminal window, and saves it in a file
(ECG3-records).

Suppose what we really want are the longest such recordings. Here's how
to find the 3 longest cases:

grep ECG3 physiobank-index | cut -f 1,6 | sort -nr -k2 | head -3

(This command uses pipes to chain four commands together, each one
reading the output of the previous one; cut selects the first and sixth
fields — the record name and the duration — from each line output
by grep; sort rearranges the lines
in reverse numerical order of the second field output by cut; and
head discards all but the first three lines output by sort.)
The output lists 3
recordings, each containing over 400 hours of ECG3:

There is a caveat, however: these recordings are all from the MIMIC II
database, and the signals are not necessarily continuous; in fact, they
may not even be simultaneously available. To find a set of long records
with at least 3 continuous, simultaneous ECG signals, we can exclude the MIMIC
databases and the similar Challenge 2009 database from the search:

(Here the \ characters indicate the command continues on the
following line.) The results are:

ltstdb/s30691 85860
ltstdb/s30731 85845
ltstdb/s30801 85821

These examples illustrate the flexibility of using standard command-line tools
to search within the PhysioBank Index. If these tools are already familiar,
it’s easy to perform much more complex searches, including many that would be
very difficult to perform using a relational database and SQL.

As of January 2012, over 36,000 record sets from over
50 collections are included in the PhysioBank Index. Many record sets
include two or more records, and some records belong to more than one
collection, so the number of record names in the index is nearly 73,000.

The MIMIC Database and the
MIMIC II Waveform Database consist
of record sets (pairs of records acquired simultaneously from each subject: a
waveform record of signals sampled at 125 or 500 Hz, and
a numerics record of vital signs sampled once per second or once per
minute).

Records (or excerpts of records) may belong to more than one data collection:

Several important subsets of the MIMIC II Waveform Database are also indexed as
independent records, so that records belonging to them appear more than once in
the PhysioBank Index. These include the MIMIC II Waveform Database version 2, containing older versions of about
25% of the current (version 3) MIMIC II Waveform Database; the
MIMIC II Waveform Database
Matched Subset, containing records that have been matched and time-aligned
with those in the MIMIC II
Clinical Database; and excerpts of MIMIC II Waveform Database records used
in the 2009
and 2010 PhysioNet/CinC Challenges. In
addition, records containing estimates of heart rate, blood pressure, and
signal quality are associated with about half of the version 2 waveform records.

Single-collection indices

Each file listed below contains an index for a PhysioBank data collection
(or part of one). These files are concatenated to form the PhysioBank Index.