Learn how one team developed algorithms to automatically identify tissues from big whole-slide images.

In the first article in this series, we gave an overview of this project and explained how we scaled down the images. Next, we will investigate image filters and will determine a set of filters that can be utilized for effective tissue segmentation with our data set. We will mask out non-tissue by setting non-tissue pixels to 0 for their red, green, and blue channels. For our particular data set, our mask will AND together a green channel mask, a grays mask, a red pen mask, a green pen mask,
and a blue pen mask. Following this, we will mask out small objects from the images.

The filtering approach that we develop here has several benefits. All relevant filters are centralized in a single file, wsi/filter.py, for convenience. Filters return results in a standard format and the returned datatype can easily be changed (boolean, uint8, float). Critical filter debug information (for example, shape, type, and processing time) is output to the console. Filter results can be easily viewed across the entire data set or subsets of the data set. Multi-processing is used for increased performance. Additionally, filters can easily be combined, strung together, or otherwise modified.

Filters

Let’s take a look at several ways that our images can be filtered. Filters are represented by functions
in the wsi/filter.py file and have filter_ prepended to the function names.

RGB to grayscale

A very common task in image processing is to convert an RGB image to a grayscale image. In this process, the three color channels are replaced by a single grayscale channel. The grayscale pixel value is computed by combining the red, green, and blue values in set percentages. The filter_rgb_to_grayscale() function multiplies the red value by 21.25%, the green value by 71.54%, and the blue value by 7.21%, and these values are added together to obtain the grayscale pixel value.

Although the PIL Image convert("L") function can also be used to convert an RGB image to a grayscale image, we instead use the filter_rgb_to_grayscale() function because having a reference to the RGB image as a NumPy array can often be very useful during image processing.

The follow code opens a slide as a PIL image, converts this to an RGB NumPy array, and then converts this to a grayscale NumPy array.

In the console, we see that the grayscale image is a two-dimensional NumPy array because the 3 color channels have been combined into a single grayscale channel. The data type is uint8 and pixels are represented by integer values between 0 and 255.

Complement

In our whole-slide image training set, the slide backgrounds are illuminated by white light, which means that a uint8 pixel in the background of a grayscale image is usually close to or equal to 255. However, conceptually and mathematically it is often useful to have background values close to or equal to 0. For example, this is useful in thresholding, where we might ask if a pixel value is above a particular threshold value. This can also be useful in masking out a background of 0 values from an image.

The filter_complement() function inverts the values, and therefore, the colors in the NumPy array representation of an image. In the following code, we use the filter_complement() function to invert the previously obtained grayscale image.

Thresholding

Basic threshold

With basic thresholding, a binary image is generated, where each value in the resulting NumPy array indicates whether the corresponding pixel in the original image is above a particular threshold value. So, a pixel with a value of 160 with a threshold of 150 would generate a True (or 255, or 1.0), and a pixel with a value of 140 with a threshold of 150 would generate a False (or 0, or 0.0).

In the following code, we apply a basic threshold with a threshold value of 100 to the grayscale complement of the original image.

Hysteresis threshold

Hysteresis thresholding is a two-level threshold. The top-level threshold is treated in a similar fashion as basic thresholding. The bottom-level threshold must be exceeded and must be connected to the top-level threshold. This processes typically results in much better thresholding than basic thresholding. Reasonable values for the top and bottom thresholds for images can be determined through experimentation.

The filter_hysteresis_threshold() function uses default bottom and top threshold values of 50 and 100. The default array output type from this function is uint8. Because the output of this function is a binary image, the values in the output array will be either 255 or 0. The output type of this function can be specified using the output_type parameter. Note that when performing masking, it is typically more useful to have a NumPy array of boolean values.

In the following code, we perform a hysteresis threshold on the complement of the grayscale image.

In the displayed image, the result is a binary image. All pixel values are either white (255) or black (0). The red display text in the corner can be ignored because it is for informational purposes only and is not present when we save the images to the file system.

Notice that the shadow area along the top edge of the slide makes it through the hysteresis threshold filter even though conceptually it is background and should not be treated as tissue.

Hysteresis threshold filter

In the following output, we see the console output from our filter operations.

In the resulting image, we see that Otsu’s method generates roughly similar results as hysteresis thresholding. However, Otsu’s method is less aggressive in terms of what it lets through for the tissue in the upper-left area of the slide. The background shadow area at the top of the slide is passed through the filter in a similar fashion as hysteresis thresholding. Most of the slides in the training set do not have such a pronounced shadow area, but it would be nice to have an image processing solution that treats the shadow area as background.

Otsu threshold filter

In terms of performance, thresholding using Otsu’s method is very fast, as we see in the console output.

Contrast

For an image, suppose we have a histogram of the number of pixels (intensity on y-axis) plotted against the range of possible pixel values (x-axis, 0 to 255). Contrast is a measure of the difference in intensities. An image with low contrast is typically dull and details are not clearly seen visually. An image with high contrast is typically sharp and details can clearly be discerned. Increasing the contrast in an image can be used to bring out various details in the image.

Contrast stretching

One form of increasing the contrast in an image is contrast stretching. Suppose that all intensities in an image occur between 100 and 150 on a scale from 0 to 255. If we rescale the intensities so that 100 now corresponds to 0 and 150 corresponds to 255 and we linearly rescale the intensities between these points, we have increased the contrast in the image and differences in detail can more clearly be seen. This is contrast stretching.

As an example, in the following code, we perform contrast stretching with a low pixel value of 100 and a high pixel value of 200 on the complement of the grayscale image.

Histogram equalization

Histogram equalization is another technique that can be used to increase contrast in an image. However, unlike contrast stretching, which has a linear distribution of the resulting intensities, the histogram equalization transformation is based on probabilities and is non-linear. For more information about histogram equalization, see https://en.wikipedia.org/wiki/Histogram_equalization.

For example, the following code displays the grayscale image. We increase contrast in the grayscale image using histogram equalization and display the resulting image.

Adaptive equalization

Rather than applying a single transformation to all pixels in an image, adaptive histogram equalization applies transformations to local regions in an image. As a result, adaptive equalization allows contrast to be enhanced to different extents in different regions based on the regions’ intensity histograms. For more information about adaptive equalization, see
https://en.wikipedia.org/wiki/Adaptive_histogram_equalization.

The filter_adaptive_equalization() function uses the scikit-image contrast limited adaptive histogram
equalization (CLAHE) implementation. In the following code, we apply adaptive equalization to the grayscale image and display both the grayscale image and the image after adaptive equalization for comparison.

Color

The WSI tissue samples in the training data set have been H&E stained. Eosin stains basic structures such as most cytoplasm proteins with a pink tone. Hematoxylin stains acidic structures such as DNA and RNA with a purple tone. This means that cells tend to be stained pink, and particular areas of the cells such as the nuclei tend to be stained purple. However, note that appearance can vary greatly based on the types of cells that are stained and the amounts of stain applied.

As an example of staining differences, the following image shows a slide that has pink and purple staining next to another slide where all of the tissue appears purple.

Pink and purple slide

Purple slide

Another factor regarding color is that many slides have been marked with red, green, and blue pens. Whereas in general we would like our filters to include pink and purple colors because these typically indicate stained tissue, we would like our filters to exclude red, green, and blue colors because these typically indicate pen marks on the slides that are not tissue.

The following image shows an example of a slide that has been marked with red pen and some green pen.

Slide marked with red and green pen

Developing color filters that can be used to filter tissue areas can be fairly challenging for a variety of reasons, including:

Filters need to be general enough to work across all slides in the dataset.

Filters should handle issues such as variations in shadows and lighting.

The amount of H&E (purple and pink) staining can vary greatly from slide to slide.

Pen mark colors (red, green, and blue) vary due to issues such as lighting and pen marks over tissue.

There can be color overlap between stained tissue and pen marks, so we need to balance how aggressively stain colors are inclusively filtered and how pen colors are exclusively filtered.

RGB to HED

The scikit-image skimage.color package features an rgb2hed() function that performs color deconvolution on the original RGB image to create HED (Hematoxylin, Eosin, Diaminobenzidine) channels. The filter_rgb_to_hed() function encapsulates rgb2hed(). The filter_hed_to_hematoxylin() and filter_hed_to_eosin() functions read the hematoxylin and eosin channels and rescale the resulting 2-dimensional NumPy arrays (for example, 0 to 255 for uint8) to increase contrast.

The following code converts the RGB image to an HED image. It then obtains the hematoxylin and eosin channels and displays the resulting images.

Notice that the hematoxylin channel does fairly well at detecting the purple areas of the original slide,
which could potentially be used to narrow in on tissue with cell nuclei and thus on regions that can be inspected for mitoses. Both the hematoxylin and eosin channel filters include the background in the resulting image, which is rather unfortunate in terms of differentiating tissue from non-tissue. Also, notice in the eosin channel that the red pen is considered to be part of the eosin stain spectrum.

Green channel filter

If we look at an RGB color wheel, we see that purple and pink are next to each other. On the other side of color wheel, we have yellow and green. Because green is one of our 3 NumPy array RGB color channels, filtering out pixels that have a high green channel value can be one way to potentially filter out parts of the slide that are not pink or purple. This includes the white background because white also has a high green channel value along with high red and blue channel values.

We’ll use the default green threshold value of 200 for the filter_green_channel() function, meaning that any pixels with green channel values of 200 or greater will be rejected.

The green channel filter does a decent job of differentiating the tissue from the white background. However, notice that the shadow area at the top of the slide is not excluded by the filter.

Original slide

Green channel filter

A filter such as the green channel filter most likely would be used in conjunction with other filters for masking purposes. As a result, the default output type for the green channel filter is bool, as we see in the console output. If another output type is wanted, this can be set with the function’s output_type parameter.

Grays filter

Next, let’s utilize a filter that can filter out the annoying shadow area at the top of slide #2. Notice that the shadow area consists of a gradient of dark-to-light grays. A gray pixel has red, green, and blue channel values that are close together. The filter_grays() function filters out pixels that have red, blue, and green values that are within a certain tolerance of each other. The default tolerance for filter_grays() is 15. The grays filter also filters out white and black pixels because they have similar red, green, and blue values.

Notice that in addition to filtering out the white background, the grays filter has indeed filtered out the shadow area at the top of the slide.

Original slide

Grays filter

Like the green channel filter, the default type of the returned array is bool becasue the grays filter will typically be used in combination with other filters. Because the grays filter is fast, it offers a potentially low-cost way to filter out shadows from the slides during preprocessing.

Red filter

Next, let’s turn our attention to filtering out shades of red, which can be used to filter out a significant amount of the red pen color. The red pen consists of a wide variety of closely related red shades. Certain shades are reddish, others are maroonish, and others are pinkish, for example. These color gradations are a result of a variety of factors, such as the amount of ink, lighting, shadowing, and tissue under the pen marks.

The filter_red() function filters out reddish colors through a red channel lower threshold value, a green channel upper threshold value, and a blue channel upper threshold value. The generated mask is based on a pixel being above the red channel threshold value and below the green and blue channel threshold values. One way to determine these values is to display the slide image in a web browser and use a tool such as the Chrome ColorPick Eyedropper to click on a red pen pixel to determine the approximate red, green, and blue values.

In this example with slide #4, we use a red threshold value of 150, a green threshold value of 80, and a blue threshold value of 90. In addition, to help us visualize the filter results, we apply the red filter to the original RGB image as a mask and also apply the inverse of the red filter to the original image as a mask.

Red pen filter

Next, let’s turn our attention to a more inclusive red pen filter that handles more shades of red. Because the filter_red() function returns a boolean array result, we can combine multiple sets of filter_red() threshold values (red_lower_thresh, green_upper_thresh, blue_upper_thresh) using boolean operators such as &. We can determine these values using a color picker tool such as the Chrome ColorPick Eyedropper, as mentioned previously. In addition to determining various shades of red pen on a single slide, shades of red pen from other slides should be identified and included. Note that we need to be careful with pinkish shades of red due to the similarity of these shades to eosin staining.

Using the color picker technique, the filter_red_pen() function uses the following sets of red threshold values.

Compared with using a single set of red threshold values, we can see that the red pen filter is significantly more inclusive in terms of the shades of red that are accepted. As a result, more red pen is filtered. However, notice that some of the pinkish-red from eosin-stained tissue is also included as a result of this more aggressive filtering.

Not red pen

Red pen

Even though the red pen filter ANDs nine sets of red filter results together, we see that the performance is excellent.

Blue filter

If we visually examine the 500 slides in the training data set, we see that several of the slides have been marked with blue pen. Rather than blue lines, many of the blue marks consist of blue dots surrounding particular areas of interest on the slides, although this is not always the case. Some of the slides also have blue pen lines. Once again, the blue pen marks consist of several gradations of blue.

We’ll start by creating a filter to filter out blue. The filter_blue() function operates in a similar way as the filter_red() function. It takes a red channel upper threshold value, a green channel upper threshold value, and a blue channel lower threshold value. The generated mask is based on a pixel being below the red channel threshold value, below the green channel threshold value, and above the blue channel threshold value.

Once again, we’ll apply the results of the blue filter and the inverse of the blue filter as masks to the original RGB image to help visualize the filter results.

Blue pen filter

In filter_blue_pen(), we AND together various blue shade ranges using filter_blue() with
sets of red, green, and blue threshold values to create a blue pen filter that filters out various shades of blue.

The filter_blue() example filters out 0.45% of the slide pixels and the filter_blue_pen() example filters out 0.69% of the slide pixels.

filter_blue: 0.45%
filter_blue_pen: 0.69%

Green filter

We utilize the filter_green() function to filter green color shades. Using a color picker tool, if we examine the green pen marks on the slides, the green and blue channel values for pixels appear to track together. As a result of this, this function has a red channel upper threshold value, a green channel lower threshold value, and a blue channel lower threshold value.

Using a red upper threshold of 150, a green lower threshold of 160, and a blue lower threshold of 140, we see that much of the green ink above the background is filtered out, but most of the green ink above the tissue is not filtered out.

K-means segmentation

The scikit-image library contains functionality that allows for image segmentation using k-means clustering based on location and color. This allows regions of similarly colored pixels to be grouped together. These regions are colored based on the average color of the pixels in the individual regions. This could potentially be used to filter regions based on their colors, where we could filter on pink shades for eosin-stained tissue and purple shades for hematoxylin-stained tissue.

The filter_kmeans_segmentation() function has a default value of 800 segments. We’ll increase this to 3000 using the n_segments parameter. In the following example, we’ll perform k-means segmentation on the original image. In addition, we’ll create a threshold using Otsu’s method and apply the resulting mask to the original image. We’ll then perform k-means segmentation on that image.

Note that there are a couple of practical difficulties in terms of implementing automated tissue detection using k-means segmentation. To begin with, due to the variation in tissue stain colors across the image data set, it can be difficult to filter on “pinkish” and “purplish” colors across all of the slides. In addition, the k-means segmentation technique is very computationally expensive. As we can see in the console output, the compute time increases with the number of segments. For 3000 segments, we have a filter time of approximately 20 seconds, whereas all operations that we have seen up to this point are sub-second. If we use the default value of 800 segments, the compute time for the k-means segmentation filter is approximately 7 seconds.

The sci-kit image library also makes it possible to combine similarly colored regions. One way to do this with the k-means segmentation results is to build a region adjacency graph (RAG) and combine regions based on a threshold value. The filter_rag_threshold() function performs k-means segmentation, builds the RAG, and allows us to pass in the RAG threshold value.

In the following code, we perform k-means segmentation, build a RAG, and apply different RAG thresholds to combine similar regions.

RGB to HSV

Comparing hematoxylin and eosin staining can be challenging in the RGB color space. One way to simplify
this comparison is to convert to a different color space such as HSV (Hue-Saturation-Value).
The scikit-image skimage.color package features an rgb2hsv() function that converts an RGB image
to an HSV image. The filter_rgb_to_hsv() function wraps this scikit-image function. In the HSV color model, the hue is represented by 360 degrees. Purple has a hue of 270 and pink has a hue of 330. We discuss hematoxylin and eosin stain comparison in our later discussion of tile scoring, where we favor hematoxylin-stained tissue over eosin-stained tissue.

As an example, in the wsi/tiles.py file, the display_image_with_rgb_and_hsv_histograms() function takes in an image as a NumPy array in RGB color space and displays the image along with its RGB and HSV histograms. Internally, this function utilizes the filter_rgb_to_hsv() function.

Here we see slide #2 along with its RGB and HSV histograms. Notice that the HSV hue histogram
columns have additionally been colored based on their corresponding hue values to aid in
visual inspection.

Slide 2 RGB and HSV histograms

Summary

This second article in the automatic identification of tissues from big whole-slide images series showed how we investigated image filters and determined a set of filters that can be utilized for effective tissue segmentation with our data set. We masked out non-tissue by setting non-tissue pixels to 0 for their red, green, and blue channels. In Part 3 of this series we’ll discuss morphology operators and combining filters and applying filters to multiple images.