The performance of most methods for cancer diagnosis using gene expression data greatly depends on careful
model selection. Least square for classification has no need of model selection. However, a major drawback prevents
it from successful application in microarray data classification: lack of robustness to outliers. In this paper we cast
linear regression as a constrained l1-norm minimization problem to greatly alleviate its sensitivity to outliers, and
hence the name l1 least square. The numerical experiment shows that l1 least square can match the best performance
achieved by support vector machines (SVMs) with careful model selection.

DNA microarray technique has the potential to provide a
more accurate and objective cancer diagnosis than traditional
histopathological approach with its high throughput
capability of simultaneously measuring relative expression
level of tens of thousands of genes. The success, however,
greatly depends upon the supervised learning algorithm
selected to classify gene expression data.

Many well-established methods are available for gene
expression profile classification. According to Lee et al
(2005), they can be classified into four categories: (1) classical
methods, such as Fisher’s linear discriminant analysis,
logistic regression, K-nearest neighbor, and generalized
partial least square, (2) classification trees and aggregation
methods, such as CART, random forest, bagging and boosting,
(3) machine learning methods, such as neural network and support vector machines (SVMs), and (4) generalized
methods, such as flexible discriminant analysis, mixture
discriminant analysis, and shrunken centroid method. The
performance of many methods, however, relies upon careful
choice of model parameters, which can be done via model
selection procedure such as cross validation. For example,
the model parameters for SVMs include kernel parameters
and the penalty parameter C. A recent controversy regarding
the performance comparison between SVM and random
forest just exemplifies the importance of model selection.
The study by Diaz-Uriarte et al. (2006) concludes that
random forest outperforms SVM, and the conclusion in
paper (Stanikov et al, 2008) is totally opposite. The main
difference between these two studies is that model selection
is carefully designed in the latter study but not in
the former study. The incident also shows that model selection may be the obstacle of the extensive application
of SVM in classification of gene expression
profile. Since classification performance is a
nonconvex function of model parameters, it is usually
difficult to find optimal model parameters by model
selection.

Least square for classification, on the other hand, has no
need of model selection. Consider a general classification
problem with N classes. A linear model is built for each
class k

Yk = WTk X + Wk0, k = 1,2,..., N. (1)

The N equations can be grouped into

y=Wx˜ (2)

where y = [y1, y2,...,yN]T, W is a matrix whose kth row is [ wTk, wk0 ], and x˜ =[xT,1]T. For a training dataset {(xi , ti ),i= 1, 2,....,n}, where ti is 1-of-N binary coding vector
of the label of the ith feature xi , i.e., a vector containing
zeros everywhere except 1 in the kth position, if xi belongs
to category k. Denote by X the feature matrix whose kth row is [xTk ,1]
, and T the target matrix whose kth row is tTi.The linear regression model in (2) can be fitted simultaneously
to each of columns of T, and the solution is in the
form

The above approach, however, is very sensitive to outliers,
especially for multicategory classification (N ≥ 3).
Furthermore, when least square for classification is
applied to gene expression data, problems can become more
severe due to the curse of dimensionality caused by the great
number of genes in each sample.

Inspired by the recent progress in sparse signal recovery
via l1 – norm minimization (Candès et al., 2006, Candès
and Tao, 2006; Donoho, 2006), we propose a new approach
to overcome the major drawback of least square for classification
by casting the linear regression problem as a constrained
l1 – norm minimization problem. The obtained
sparse solution is much less sensitive to both outliers and
curse of dimensionality. In addition, multicategory classification is realized via one-versus-rest (OVR) and one-versus-
one (OVO) approaches which decompose the original
multi-category problem into a series of binary problems.
The new method is validated by comparing caner diagnosis
performance with SVMs.

Methods

Binary l1 Least Square

Consider a training dataset {(xi, yi);i=1,...,n},
xi€ Rd, yi €{-1, +1}, where xi represents the ith sample, a d-dimensional column vector containing gene expression
values with d as the number of genes, and yi is the label of
the ith sample. Two classes are described by a liner model

y = [xT,1]w (4)

for any sample x. Applying the linear model to the training
dataset, we have

yi = [xT,1]w, i = 1,2,...,n (5)

The n equations can be grouped into

y = Xw (6)

where y = [y1, y2,...,yn]T, and X is an n × (d +1) matrix whose ith row is [xiT,1]T. Since the number of samples are much
smaller than the number of genes, i.e., n << d, the system in
(6) is underdetermined. The solution is obtained by casting
the original problem as the following constrained l1-norm
minimization problem

Denote by ŵ the solution to (7). Then for any sample x,
the label can be simply assigned as sign([xT ,1]ŵ).

Multicategory l1 Least Square: OVR

Consider a multicategory training dataset {(xi, yi); i=1,...,n}, xi € Rd, yi € {1,2,...n}, where N is the
category number. OVR approach needs to determine for
each class a binary classifier to separate it from the remaining
classes. The N linear models are defined as

Dk(x) = [ XT ,1] wk, k = 1,2,...N (8)

For category k, after changing the labels of those samples
belonging to k to +1, and others to -1, we apply the linear
model to the training dataset

yk = Xwk , k = 1, 2,....,N, (9)

where yk is a label vector containing either +1 or -1. Similarly,
the above N underdetermined systems can be solved
by the following N constrained l1-norm minimization problems

min || wk || 1 subject to Xwk = yk (10)

where k = 1,2,....N.

Denote by kwk̂ the solution to (10). Then for any sample
x, the label can be determined by

arg maxk=1,2,...N Dk (x) = [ X T, 1] Ŵk. (11)

Multicategory l1 Least Square: OVO

In OVO approach, a binary classifier is constructed for
each pair of classes. The linear model for class i against
class j is given by

Di,j(x) = [XT,1]wi,j (12)

For those samples of category i and j, changing their labels
to +1 and -1, applying the linear model gives rise to

yi,j = Xi,j Wi,j (13)

where yi,j is a vector containing either +1 or -1, and Xi,j is a
matrix whose kth row is [XT,1]T with xk belonging to either
category i or j. The underdetermined system is solved by

min || Wi,j ||1 subject to Xi,j Wi,j = yi,j (14)

Since Dj,i = D i,j, the number of the classifiers is (2N) ,i.e., N(N-1)/2, compared to N in OVR approach.

Denote by Ŵi,j the solution to (14). For any sample x, we
calculate

Di (x) = ∑Nj=1,j≠1 sign(Di,j(x)) (15)

with Di,j (x) = ∑Nj=1,j≠1 Xi,jŴij. The label of x is determined by

arg max1,2,...,N Di(x) (16)

Numerical Experiment

Numerical experiment is carefully designed to validate
the cancer diagnosis performance of l1 least square using
gene expression data. The performance metric is classification
accuracy obtained by 10-fold stratified cross validation.
MATLAB R14 is used to implement the new method.
The results are compared with binary SVM (Vapnik, 1998)
and some popular variants of multicategory SVMs including
OVR-SVM (Kressel, 1999), OVO-SVM (Kressel,
1999), DAGSVM (Platt et al., 2000), method by Weston
and Watkins (WW) (Weston and Watkins, 1999), and
method by Crammer and Singer (CS) (2000).

The results of SVMS are obtained from GEMS (Gene
Expression Model Selector), which is software with graphic
user interface for classification of gene expression data. It
is freely available at http://www.gems-system.org/. GEMS
is used by Stanikov et al.(2005) for the comprehensive study
of the performance of multiple classifiers on gene expression
cancer diagnosis. As for model selection, polynomial
kernels are used with orders p = {1,2, 3}, and the penalty
parameter C
= {10-3+0.5n, n = 0, 1, …, 6}.

Six datasets are used in the experiment, which are among
eleven datasets used in reference (Stanikov et al., 2005).
They are available on the website of GEMS in the format
of both GEMS and MATLAB mat file. For easy comparison
and reference, we adopt the names used in reference
(Stanikov et al., 2005). The information about the six
datasets is summarized below.

•

DLBCL (Shipp et al., 2002): The binary dataset comes
from a study of gene expression of two lymphomas: diffuse
large B-cell lymphomas and follicular lymphomas.
Each sample contains 5469 genes. The sample number
is 77.

Brain_Tumor2 (Nutt et al., 2003): There are 4 types of
malignant glioma in this dataset: classic glioblastomas,
classic anaplastic oligoden-drogliomas, non-classic glioblastomas,
and non-classic anaplastic oligodendrogliomas.
The dataset has 50 samples, and the number of genes
is 10367.

All the datasets are normalized by rescaling the gene expression values to be between 0 and 1.

Two methods are used in this experiment to study gene
selection’s impact on classification performance: Kruskal-
Wallis non-parametric one-way ANOVA (KW) (Gibbons,
2003), and the ratio of between classes to within class sums
of square (BW) (Dudoit et al., 2002).

Results

Classification without Gene Selection

Table 1shows the classification accuracy values obtained
by 10-fold stratified cross validation for both l1 least square
and SVMs. The results of SVMs are slightly different from
what is reported by Stanikov et al. (2005) where the five
datasets are also used. A possible explanation is that the
distribution for cross validation in our study is different from
that in paper (Stanikov et al., 2005).

Table 1: Performance without gene selection.

For binary datasets Prostate_Tumor and DLBCL, the performance
of l1 least square is slightly below that of SVMs.
Note that the results of SVMs are obtained by careful model
selection using cross validation, while our method does not
need model selection, and is totally automatic. In addition,
just like SVM, when applied to binary datasets, the
multicategory classifiers of l1 least square are equivalent to
binary classifier for both OVO and OVR approaches.

When applied to classification of multicategory datasets,
OVR- l1 least square can closely match the best performance achieved by SVMs. For both SVM and l1 least square, OVO
approach performs much worse than OVR approach for classifying
9_Tumors dataset.

Classification with Gene Selection

Table 2 shows the best performance achieved by OVR- l1
least square and SVMs when gene selection methods KW
and BW are used. The results show that both l1 least square
and SVMs perform slightly better compared with the performance
without gene selection reported in Table 1. The
improvement ranges from 0 to 9% for SVMS, while only
from 0 to 3.48% for OVR-l1 least square. Again, the performance
of OVR- l1 least square is comparable to SVMs.

Table 2: Performance with gene selection.

Discussion

The success of l1 least square may lie in its sparse linear
model coefficient vector obtained from l1 – norm minimization.
Figure 1 shows the model coefficient vector w which
is the solution of l1 least square for classifying binary dataset
DLBCL. The sparsity suggests that those genes with greater
absolute coefficients could have played more important roles
in classification. As a result, the classification performance
does not depend on all the genes, especially those with very
small absolute coefficients. The sparsity has the potential
to greatly alleviate curse of dimensionality and increase the
robustness to outliers.

Another implication of sparsity is that those genes with larger absolute coefficients may correspond to biological
markers. Hence, sparsity could be also used for gene selection.
We did a small experiment to verify this possibility.
The binary dataset DLBCL is used to fit l1 least square
model. Gene selection is done by choosing M genes with M
largest absolute coefficients. Binary SVM is used to classify
the gene-selected data. The results are compared with
KW and BW methods for gene selection. Figure 2 shows
the performance of the three gene selection methods for M
=10, 20, 30, 40, and 50, respectively. The new method significantly
outperforms both KW and BW methods when a
small number of genes are selected.

Figure1: The sparse coefficient vector.

Figure2: The performance of three gene Selection
methods.

The above gene selection approach is in spirit similar to
lasso (Tibshirani, 1996) formulated as follows

min || Xw - y ||22 subject to || w ||1 ≤ t (17)

where X, w, and y follow the definitions given in section
2.1 for binary l1 least square, and t is the model parameter
for lasso. In addition, lasso can also be used in classification
by replacing (7) with (17) for binary case, (10) with

min || Xwk - yk ||22 subject to || wk ||1 ≤ tk (18)

for multi-category OVR approach, and (14) with

min || Xwi,j - yi,j ||22 subject to || wi,j ||1 ≤ ti,j (19)

for multi-category OVO approach.

Similarly, we can also replace l1 least square regression
with Dantzig selector (Candès and Tao, 2007), which is
given below for binary classification

min || w || subject to || XT(y - Xw) ||∞ ≤ (1 + t-1) √(2logd.σ) (20)

where t is model parameter, andσ is the noise standard deviation.
Dantzig selector for multicategory classification
can be similarly defined.

Both lasso and Dantzig selector for classification, however,
still need to select optimized model parameters by
model selection procedure, such as cross validation.

Conclusion

In this paper, we have described a specialized regression
method for cancer diagnosis using expression data. The new
approach, called l1 least square, casts linear regression as a
constrained l1-norm minimization problem to overcome the
major drawback of least square for classification: lack of
robustness to outliers. Besides binary classifier,
multicategory l1 least square including OVO and OVR approaches
are also proposed.

Numerical experiment shows that OVR- l1 least square
can match the best performance achieved by SVMs with
careful model selection. The main advantage of l1 least
square over other methods including SVMs is that it has no
need of model selection. As a result, the method based on l1
least square is totally automatic. l1 least square also has the
potential to be used for gene selection.

The l1 least square classifier may become a promising
automatic cancer diagnosis tool by consistently distinguishing
gene profile classes. Those genes with great absolute
regression coefficients may serve as biological marker candidates
for further investigation.