Learn how one team developed algorithms to automatically identify tissues from big whole-slide images

The primary goal of the Tumor Proliferation Assessment Challenge 2016 (TUPAC16) was to develop algorithms to automatically predict breast cancer tumor proliferation scores. In this challenge, the training set consisted of 500 whole-slide images that are scored (1, 2, or 3) by pathologists based on mitosis counts. A higher proliferation score indicates a worse prognosis because higher tumor proliferation rates are correlated with worse outcomes. The tissue samples are stained with hematoxylin and eosin (H&E).

Tissue identification in whole-slide images can be an important precursor to deep learning. Deep learning is computationally expensive and medical whole-slide images are enormous. Typically, a large portion of a slide isn’t useful, such as the background, shadows, water, smudges, and pen marks. You can use preprocessing to rapidly reduce the quantity and increase the quality of the image data to be analyzed. This can lead to faster, more accurate model training.

In this series, we’ll look at whole-slide image processing and will describe various filters that can be used to increase the accuracy of tissue identification. After determining a useful set of filters for tissue segmentation, we’ll divide slides into tiles and determine sets of tiles that typically represent good tissue samples.

The solution should demonstrate high performance, flexibility, and accuracy. Filters should be easy to combine, chain, and modify. Tile scoring should be easy to modify for accurate tile selection. The solution should offer the ability to view filter, tile, and score results across large, unique datasets. The solution should also have the ability to work in a batch mode, where all of the image files and intermediary files are written to the file system, and in a dynamic mode, where high-scoring tissue tiles can be retrieved from the original WSI files without requiring any intermediary files.

In summary, we will scale down whole-slide images (WSI), apply filters to these scaled-down images for tissue segmentation, break the slides into tiles, score the tiles, and then retrieve the top tiles based on their scores.

Five steps

Setup

This project heavily uses Python3. Python is an ideal language for image processing. OpenSlide is used for reading WSI files. Pillow is used for basic image manipulation in Python. NumPy is used for fast, concise, powerful processing of images as NumPy arrays. Scikit-image is heavily used for a wide variety of image functionality, such as morphology, thresholding, and edge detection.

We use scikit-image filters (hysteresis thresholding) in this tutorial that are not present in the
latest released version of scikit-image at the time of this writing. We can install scikit-image
from the source, as described in the README file.

Whole-slide imaging background

A whole-slide image is a digital representation of a microscopic slide, typically at a very high level of magnification such as 20x or 40x. As a result of this high magnification, whole slide images are typically very large in size. The maximum file size for a single whole-slide image in our training data set is 3.4 GB, with an average over 1 GB.

WSI example slide

A whole-slide image is created by a microscope that scans a slide and combines smaller images into a large image. Techniques include combining scanned square tiles into a whole-slide image and combining scanned strips into a resulting whole-slide image. Occasionally, the smaller constituent images can be
visually discerned, as in the shaded area at the top of the following slide.

Combining smaller images into a whole-slide image

A fairly unusual feature of whole-slide images is the very large image size. For our training dataset of 500 images, the width varied from 19,920 pixels to 198,220 pixels, with an average of 101,688 pixels. The height varied from 13,347 pixels to 256,256 pixels, with an average of 73,154 pixels. The image total pixel sizes varied from 369,356,640 to 35,621,634,048 pixels, with an average of 7,670,709,628 pixels. The 500 training images take up a total of 525 GB of storage space.

Training image sizes

The following images shows a histogram distribution of the training image sizes in megapixels.

Distribution of images based on number of pixels

We can use the OpenSlide project to read a variety of whole-slide image formats, including the Aperio *.svs slide format of our training image set. This is a pyramidal, tiled format, where the massive slide is composed of a large number of constituent tiles.

The deepzoom_multiserver.py script starts a web interface on port 5000 and displays the image files at the specified file system location (the WSI_DIRECTORY value in the previous code, which could be a location such as ~/git/python-wsi-preprocessing/data/). If image files exist in subdirectories, they will also be displayed in the list of available slides.

If this viewing application is installed on a server that also hosts the whole-slide image repository, this offers a convenient mechanism for you to view the slides without requiring local storage space.

OpenSlide available slides

The following image shows the initial view of one of the whole-slide images viewed in a web browser.

OpenSlide whole-slide image

Using this web interface, the whole-slide image can be zoomed to the highest magnification, revealing fine details at the tile level. Zooming and scrolling operations make it relatively easy to visually peruse the whole-slide image.

OpenSlide whole-slide image zoomed

Scale down images

To develop a set of filters that can be applied to an entire set of large whole-slide images, two of the first issues that we are confronted with are the size of the data and the format of the data. As mentioned, for our training dataset, the average .svs file size is over 1 GB and we have a total of 500 images. Additionally, the .svs format is a fairly unusual format that typically can’t be visually displayed by default by common applications and operating systems. Therefore, we will develop some code to overcome these important issues. Using OpenSlide and Python, we’ll convert the training data set to
smaller images in a common format, thus reformulating a big data problem as a small data problem. Before filtering at the entire slide level, we will shrink the width and height down by a factor of 32x, which means we can perform filtering on 1/1024 the image data. Converting 500 svs files to png files at 1/32 scale takes approximately 12 minutes on a typical MacBook Pro using the following code.

In the wsi/slide.py file, we have many functions that can be used in relation to the original svs images. Of particular importance are the following functions:

The open_slide() function uses OpenSlide to read in a .svs file. The show_slide() function opens a WSI .svs file and displays a scaled-down version of the slide to the screen. The slide_info() function displays metadata associated with all of the .svs files. The slide_stats() function looks at all images and summarizes size information about the set of slides. It also generates a variety of charts for a visual representation of the slide statistics. The training_slide_to_image() function converts a single .svs slide to a smaller image in a more common format such as .jpg or .png. The singleprocess_training_slides_to_images() function converts all .svs slides to smaller images, and the multiprocess_training_slides_to_images() function uses multiple processes (1 process per core) to
speed up the slide conversion process. For the last three functions, when an image is saved, a thumbnail image is also saved. By default, the thumbnail has a maximum height or width of 300 pixels and is a .jpg format.

One of the first actions you can take to become more familiar with the training data set is to look at the metadata associated with each image, which you can do with the slide_info() function. The following code shows a sample of this metadata for Slide #1:

The most important metadata for our purposes is that the slide has a width of 130,304 pixels and a height of 247,552 pixels. Note that these values are displayed as width followed by height. For most of our image processing, we will be using NumPy arrays, where rows (height) are followed by columns (width).

If we visually look over the metadata associated with other images in the training dataset, we see that the slides are not consistent in their various properties such as the number of levels contained in the .svs files. The metadata implies that the data set comes from a variety of sources. The variability in the slides, especially regarding issues such as H&E staining and pen marks on the slides, needs to be considered during our filter development.

If we call the slide_stats() function, in addition to the charts, we obtain a table of pixel statistics, as shown below.

Training images statistics

Attribute

Size

Slide #

Max width

198,220 pixels

10

Max height

256,256 pixels

387

Max size

35,621,634,048 pixels

387

Min width

19,920 pixels

112

Min height

13,347 pixels

108

Min size

369,356,640 pixels

112

Avg width

101,688 pixels

Avg height

73,154 pixels

Avg size

7,670,709,629 pixels

The wsi/slide.py file contains constants that can be used to control various image conversion settings. For example, the SCALE_FACTOR constant controls the factor by which the slides will be scaled down. Its default value is 32, meaning that the height and width will be scaled down by a factor of 32. This means that when we perform filtering, it will be performed on an image 1/1024 the size of the original high-resolution image. The DEST_TRAIN_EXT constant controls the output format. We will use the default format, .png.

Using macOS, the following conversion times using singleprocess_training_slides_to_images() and multiprocess_training_slides_to_images() on the 500 image training set were obtained:

Training image dataset conversion times

Format

Processes

Time

jpg

single process

26m09s

jpg

multi process

10m21s

png

single process

42m59s

png

multi process

11m58s

After calling multiprocess_training_slides_to_images() using the .png format, we have 500 scaled-down
whole-slide images in lossless .png format that we will examine in greater detail in relation to our filters.

Image saving, displaying, and conversions

To load, save, and display images, we use the Python Pillow
package. In particular, we use the Image module, which contains an Image class used to represent an image. The wsi/slide.py file contains an open_image() function to open an image stored in the file system. The get_training_image_path() function takes a slide number and returns the path to the corresponding training image file, meaning the scaled-down .png file that we created by calling multiprocess_training_slides_to_images().

If we want to convert a single .svs WSI file to a scaled-down .png (without converting all .svs files),
open that .png image file as a PIL Image, and display the image to the screen, we can do the following.

To mathematically manipulate the images, we use NumPy arrays. The wsi/util.py file contains a
pil_to_np_rgb() function that converts a PIL Image to a 3-dimensional NumPy array in RGB format. The first dimension represents the number of rows, the second dimension represents the number of columns, and the third dimension represents the channel (red, green, and blue).

rgb = util.pil_to_np_rgb(img)

For convenience, the display_img() function can be used to display a NumPy array image. Text can be added to the displayed image, which can be very useful when visually comparing the results of multiple filters.

When performing operations on NumPy arrays, functions in the wsi/filter.py file will often utilize the
util.np_info() function to display information about the NumPy array and the amount of time required to perform the operation. For example, the previouis call to pil_to_np_rgb() internally calls np_info():

We see that the PIL-to-NumPy array conversion took 0.16s. The type of the NumPy array is uint8, which means that each pixel is represented by a red, green, and blue unsigned integer value from 0 to 255. The image has a height of 1385 pixels, a width of 1810 pixels, and three channels (representing red, green, and blue).

We can obtain additional information about NumPy arrays by setting the util.ADDITIONAL_NP_STATS constant to True. If we rerun the previous code with ADDITIONAL_NP_STATS = True, we see the following:

The minimum value is 2, the maximum value is 255, the mean value is 182.62, and binary is false, meaning that the image is not a binary image. A binary image is an image that consists of only two values (True or False, 1.0 or 0.0, 255 or 0). Binary images are produced by actions such as thresholding.

When interacting with NumPy image processing code, the information provided by np_info() can be extremely useful. For example, some functions return boolean NumPy arrays, other functions return float NumPy arrays, and other functions might return uint8 NumPy arrays. Before performing actions on a NumPy array, it’s usually necessary to know the data type of the array and the nature of the data in that array. For performance reasons, normally ADDITIONAL_NP_STATS should be set to False.

The wsi/util.py file contains an np_to_pil() function that converts a NumPy array to a PIL image.

If we have a PIL image, saving the image to the file system can be accomplished by calling the image’s save() function.

img.save(path)

Summary

This first article in the automatic identification of tissues from big whole-slide images series gave you an overview of what our challenge was and explained how to set up the developing environment and scale down the images. Part 2 of this series will explain applying filters and tissue segmentation.