Table of Contents

The infinite MNIST dataset

Formerly known as MNIST8M.

1. Background

This code produces an infinite supply of digit images derived from the well
known MNIST dataset using pseudo-random deformations and translations. This
is a streamlined version of the code used for the experiments reported in
(Loosli, Canu, Bottou, 2007).
A subset of the examples generated by this code are known as MNIST8M.
Unfortunately the original MNIST8M files have been deleted from the NEC servers.
However you can use InfiMNIST to regenerate these files or generate
much larger files if you prefer. You can even use this code to generate
deformed MNIST examples on the fly.

Each InfiMNIST example is identified by a long integer index that determines the source
of the example and the transformations applied to the pattern. The examples
numbered 0 to 9999 are the standard MNIST testing examples. The examples
numbered 10000 to 69999 are the standard MNIST training examples. Each
example with index i>=70000 is generated by applying a pseudorandom
transformation to the MNIST training example numbered 10000+((i-10000)%60000).
Because the pseudo-random transformations are deterministically derived
from the example number, this is similar to having a file containing about
one trillion distinct MNIST examples.

The supplied makefiles are very standard and should work
on nearly all machines. Customizing the variable CFLAGS
could possibly achieve better performance.

Linux/Unix/Cygwin: Unpack the archive and type make.

Windows: Unpack the archive and type nmake /f NMakefile in a MSVC shell.

4. Using the InfiMNIST executable

Synopsis:

$ infimnist [-d <datadir>] <format> <first> <last>

Option -d <datadir> can be used to specify the location of the
six data files. The default data directory is simply data
in the current directory.
Arguments <first> and <last> define the first and last
index of the range of examples written to the standard output.
Argument <format> describes the format of the produced
data. Any unambiguous prefix of the following formats
are recognized:

Generating files containing the MNIST8M training set with a format similar to the standard MNIST files. This is intended to provide exactly the same data as the original MNIST8M dataset. A bug in releases 1.1 and 1.2 uprevents this from happening on 64 bit machines. This bug was fixed in release 1.3.

Generating a LibSVM compatible MNIST8M file. This file is expected to be identical to the MNIST8M file saved on the libsvm web site.

$ infimnist svm 10000 8109999 > mnist8m-libsvm.txt

5. Using InfiMNIST as a library

Files infimnist.h and infimnist.c form a self-contained
library that you can use to generate an infinite amount
of MNIST-like examples on the fly. This is adequately
explained by the comments found in file infimnist.h
reproduced below

/* Function <infimnist_create> creates the infimnist_t data structure that
contains the digit data (about 450MB) and caches up to about 1GB worth of
deformed digit images. The argument <datadir> points to the directory
containing the data files. Setting it to NULL implicitly selects the
directory named "data" in the current directory. */
infimnist_t *infimnist_create(constchar*datadir);/* Function <infimnist_destroy> destroys the data structure
and returns its memory to the heap. */void infimnist_destroy(infimnist_t*);/* Function <infimnist_get_label> returns the label (0 to 9)
associated with example <index>. */int infimnist_get_label(infimnist_t*,long index);/* Function <infimnist_get_pattern> returns the image associated with the
example numbered <index>. The image takes the form of a vector of 784
unsigned bytes organized in row major order. Each bytes takes a value
ranging from 0 (white) to 255 (black). There is no need to free the
resulting pointer as it directly points into the pattern cache. These
vectors may be automatically deallocated in the future. However, at any
time, you can safely access the last ten vectors returned by this
function. */constunsignedchar*infimnist_get_pattern(infimnist_t*,long index);