High-content image analysis captures many cellular parameters, but current methods of interpretation of acquired multiple dimensions assume a normal distribution, which is rarely seen in biological data sets. We describe a novel statistically based approach that collapses a set of cellular measurements into a single value, permitting a simplified and unbiased comparison of heterogeneous cellular populations. Differences in multiple cellular responses across two populations are measured using nonparametric Kolmogorov-Smirnov (KS) statistics. This method can be used to study cellular functions, to identify novel target genes and pharmacodynamic biomarkers, and to characterize drug mechanisms of action.

Early high-throughput imaging experiments were capable of measuring only one or a few parameters. A single “reporter” (e.g., average fluorescence intensity) in a cell-based screen yields a limited amount of information, often introducing bias into the results based on differences of cellular factors, such as size, cell cycle stage, and metabolism. High-content multidimensional image analysis addresses this limitation by collecting information from several cellular parameters. This report describes a simple method for delivering an unbiased interpretation of high-content measurements that incorporates diverse cellular measurements and addresses the challenges of analyzing these types of data sets.

Cellular populations rarely fit normal or Gaussian distributions (see Supplementary Figure S1), limiting the usefulness of parametric statistics tools. Our methodology takes advantage of Kolmogorov-Smirnov (KS) nonparametric statistical analysis, which does not assume a normal distribution, yet provides statistically significant results (1). In contrast, utilizing statistical tools that assume a bell curve distribution, when this assumption is incorrect, may lead to improper interpretation of the results (see Supplementary Figure S3). This same limitation also applies to principal component analysis (2-4) and other clustering methodologies (5-7) that also rely on a mean calculated based on a normal data distribution.

Interpretation of cellular changes induced by small-molecule or small interfering RNA (siRNA) treatment is enhanced when considering multiple cellular parameters, in addition to measurements of specific reporters. Typically, KS statistics are applied to individual descriptors (8,9); we expanded this methodology by combining KS values of paired cellular populations. This analysis examines the entire N-dimensional descriptor space, where N is the number of parameters selected for consideration.

KS analysis makes no assumption as to distribution type and determines the maximal difference for a single parameter between two populations, resulting in a dimensionless D-value (8,9) (see Supplementary Figure S2). We determined the significance of drug treatments and RNA interference (RNAi)–based knockdown of target genes on cell populations by establishing critical threshold values for the KS statistic in both treated and untreated cell populations, as described in the Supplementary Materials. To convert multiple KS scores into a single readout, we generated a cumulative score (CS), which is defined as the Euclidean distance in a multidimensional space of descriptors (see Supplementary Figure S4). The D-value is a unitless score that allows a direct comparison of cumulative differences for a set of descriptors between the target and its control data sets. The CS ascribes equal weight to the relative calculated differences measured for each parameter, reflecting an unbiased multidimensional result. Thus, by emphasizing the flexibility of the CS, we leverage the power of multidimensional high-content analysis yielding a simple output that describes the observed global cellular changes.

By integrating information from multiple descriptors across all drugs tested, a CS value circumvents the limitations that are associated with choosing an individual parameter. In cells stained for tubulin and DNA, multiple descriptors detail the diversity of phenotypic changes following treatment of cells with drugs (Figure 1A). Each drug shows distinct effects on the cellular parameters, and no single descriptor effectively captures this diversity across all drugs (Figure 1B). Taxol, a microtubule stabilizer, causes the most apparent changes in reducing the size of cells and increasing the level of tubulin expression in a dose-dependent manner (Figure 1B). Etoposide, a topoisomerase II inhibitor that stabilizes the covalent complexes of topoisomerase II with DNA, only manifests its effects by changing the shape of cells by increasing their cell and nuclear area (Figure 1B). Staurosporine (a broad kinase inhibitor) and nocodazole promote microtubule depolymerization, and both reduce the proportion of the nuclear-to-cell area, as well as tubulin expression (Figure 1B). Thus, if tubulin intensity and cell area were the only parameters gauged, it would not be sufficient to capture global phenotypic changes within cells. However, taking into account data from all descriptors generated in the form of a CS allows us to monitor global cellular phenotypic changes. Furthermore, CS-based quantitative high-content, high-throughput screening (HTS) can be used to generate robust dose-response curves to assess the efficacy of drugs (Figure 1C).

Even in cases where the nodes of biological networks are well-defined, it is advantageous to consider other phenotypic cell changes in addition to the specific pathway reporter. Experimental results in Figure 2 illustrate the benefit of additional descriptors to detail phenotypic diversity of the changes in cells treated with siRNAs against various cellular targets.

Glucose transporter 1 (Glut1) expression was chosen as the primary readout in this experiment. RNAi-based knockdown of hyperpolarization-activated cyclic nucleotide-gated potassium channel 2 (HCN2) and polo-like kinase 1 (PLK1) show equivalent changes in the average Glut1 fluorescence intensity as compared with luciferase, a control gene not expressed in mammalian cells (Figure 2A). With the inclusion of additional phenotypic measurements, it is clear that the effect of knockdown of the two genes is biologically very diverse (Figure 2B). Closer inspection identifies differences in the distribution of Glut1 within cells treated with PLK1 siRNA—distinct from the membrane/cytoplasmic localization observed in both cells treated with siRNA to luciferase and to HCN2, suggesting a potential intracellular trafficking of Glut1. Additionally, further difference between the genes can be observed in the CS of shape between luciferase control or HCN2 and PLK1 siRNA–treated cells, providing additional phenotypical separation. As such, this method addresses one of the major needs in high-content, high-throughput drug discovery platforms; namely, to quantitatively and effectively screen for mechanistic changes in the phenotypes of cells, assisting in target identification and prioritization.

In conclusion, we describe a novel and simple application of the KS method to resolve high-content imaging data with an improved statistical significance as compared with standard parametric statistics tools. We show that the CS method provides a simpler output that allows differentiation of individual responses within complex high-content data sets. We demonstrate application of this methodology to compound profiling and a subgenome-wide siRNA screen. The utilization of multiple phenotypic measures and nonparametric tools in high-content imaging reveal both expected and unexpected biological information that can be used to derive phenotypic signatures of compounds, to obtain information on compound potencies, to characterize hits from siRNA screens, and to uncover novel mechanisms of action of small molecules.

Acknowledgments

We thank Irena Pak and Bonnie Howell (Merck & Co., Inc.; NJ and PA, USA) for help with data analysis and scientific discussions.