) can generate respective cluster indicators representing diﬀerent clus-
tering results: The left one can satisfy both constraints of X
l
and X
u
, but the
right one can only satisfy X
u
. For semi-supervised feature selection, we want
to select f over f

x
k
∈C
j
|x
k
−μ
j
|
2
(2.1)
where μ
j
denotes the mean (centroid) of those instances in cluster C
j
.
K-means is an iterative algorithm that locally minimizes the SSE criterion.
It assumes each cluster has a hyper-spherical structure. “K-means” denotes
the process of assigning each data point, x
k
, to the cluster with the nearest
mean. The k-means algorithm starts with initial K centroids, then it assigns
each remaining point to the nearest centroid, updates the cluster centroids,
and repeats the process until the K centroids do not change (convergence).
There are two versions of k-means: One version originates from Forgy [17] and
the other version from Macqueen [36]. The diﬀerence between the two is when
to update the cluster centroids. In Forgy’s k-means [17], cluster centroids are
re-computed after all the data points have been assigned to their nearest
centroids. In Macqueen’s k-means [36], the cluster centroids are re-computed
after each data assignment. Since k-means is a greedy algorithm, it is only
guaranteed to ﬁnd a local minimum, the solution of which is dependent on
the initial assignments. To avoid local optimum, one typically applies random
restarts and picks the clustering solution with the best SSE. One can refer
to [47, 4] for other ways to deal with the initialization problem.
Standard k-means utilizes Euclidean distance to measure dissimilarity be-
tween the data points. Note that one can easily create various variants of
k-means by modifying this distance metric (e.g., other L
p
norm distances)
to ones more appropriate for the data. For example, on text data, a more
suitable metric is the cosine similarity. One can also modify the objective
function, instead of SSE, to other criterion measures to create other cluster-
ing algorithms.
2.2.2 Finite Mixture Clustering
A ﬁnite mixture model assumes that data are generated from a mixture
of K component density functions, in which p(x
k
[θ
j
) represents the density
function of component j for all j

s, where θ
j
is the parameter (to be estimated)
for cluster j. The probability density of data x
k
is expressed by
p(x
k
) =
K

j=1
α
j
p(x
k
[θ
j
) (2.2)
where the α

s are the mixing proportions of the components (subject to α
j
≥ 0
and

K
j=1
α
j
= 1). The log-likelihood of the M observed data points is then
given by
L =
M

)
k
for k executions of the algorithm.
The third and largest complexity class of practical importance is BTT, for
polynomial time algorithms with bounded probability of error. Unlike ¹T and
ZTT, BTT allows a randomized algorithm to commit both false positive and
false negative errors. This class encompasses algorithms that accept good
inputs a majority of the time and rejects bad inputs a majority of the time.
More formally, a language L ∈ BTT if some randomized algorithm R accepts
s ∈ L with probability
1
2
+
1

else
replace the k
opt
(n, r) selected variables
return(h)
FIGURE 3.8: The randomized variable elimination algorithm [20].
r is then computed via dynamic programming by minimizing the aggregate
cost of removing all N − r irrelevant variables. Note that n represents the
number of remaining variables, while N denotes the total number of variables
in the original problem.
On each iteration, RVE selects k
opt
(n, r) variables at random for removal.
The learning algorithm is then trained on the remaining n−k
opt
(n, r) inputs,
and a hypothesis h

j=1
_
G
j
(E[g
j
[D
k
] −g
j
)
2
p(g
j
[D
k
)dg
j
(5.1)
Currently for an instance i we have the class label c
i
available and perhaps
also some subset of the feature values obs(i) (observed values for instance i).
Let mis(i) be the subset of features values currently missing for instance i, and
x
f
be a particular feature whose value is missing for instance i. If we assume
that for the (k + 1)
th
sampling step this missing feature value is measured
and a value of x was obtained, then the new dataset, denoted (D
k
, x
if
= x),
has the value x for the feature f for instance i. The new mean squared error
would be
F

j=1
(E[g
j
[D
k
, x
if
= x] −E[g
j
[D
k
])
2
p(x
if
= x[D
k
)
where A is independent of i and f. That is, in order to minimize the predicted
mean squared error if the missing value at (i, f) is measured, it is suﬃcient
to maximize the sum of the squared diﬀerences between the Bayes estimates
of g before and and after the value at (i, f) is measured, averaged over the
possible outcomes.
Therefore, to minimize the predicted mean squared error, the objective
function to be maximized is
B(i, f) =
_
X
f
F

j=1
( ˆ g
j
(D
k
, x
if
= x) − ˆ g
j
(D
k
))
2
p(x
if
= x[D
k
) (5.4)
where ˆ g
j
(D
k
) is any reasonable estimate of g
j
from dataset D
k
.
5.4 Implementation of the Active Sampling Algorithm
Algorithm 5.2.1 for active feature value acquisition is general and can be
used with any measure for feature relevance for which the squared-error loss
is reasonable. That is, the choice for the function EstimateRelevances(D) in
the pseudocode can be any estimate of feature relevance that can be estimated
from a dataset with missing values.
In addition, the implementation of the beneﬁt criteria introduced above
also requires the computation of the conditional probabilities p(x
if
= x[D
k
).
Although our active sampling algorithm is quite general, we implemented
it for a particular choice of the model for data generation (i.e., the joint class-
and-feature distribution), which we present below. We then explain how the
conditional probabilities and feature relevances can be computed given the
joint distribution.
Our model is applicable for problems with categorical valued features. That
is, we assume that every feature x
f
takes on a discrete set of values A
f
=
¦1, . . . , V
f
¦.
5.4.1 Data Generation Model: Class-Conditional Mixture of
Product Distributions
We assume that each class-conditional feature distribution is a mixture of
M product distributions over the features. (Although for our implementation
it is not necessary that the number of components is constant across classes,
we make this assumption for simplicity.) That is, the class-conditional feature
distribution for class c ∈ ( is
P(x
1
= x
1
, . . . , x
F
= x
F
[c) =
M

x=1
θ
δ(x,x
f
)
cmfx
(5.6)
where p(c = c) is class probability. The class-and-feature joint distribution is
completely speciﬁed by the parameters αs, θs, and the class probabilities.
Before we describe how the α and θ parameters can be estimated from a
dataset with missing values, we will explain how feature relevances and the
conditional probability p(x
if
= x[D
k
) are calculated if the parameters are
known.
5.4.2 Calculation of Feature Relevances
We use the mutual information between a feature and the class variable as
our measure of the relevance of that feature. That is,
g
f
= I(x
f
; c) = H(x
f
) −H(x
f
[c) (5.7)
Although we are aware of the shortcomings of mutual information as a fea-
ture relevance measure, especially for problems where there are inter-feature
correlations, we chose it because it is easy to interpret and to compute given
the joint class-and-feature distribution. We did not use approaches such as
Relief [10] and SIMBA [7], which provide feature weights (that can be in-
terpreted as relevances), because they do not easily generalize to data with
missing values.
The entropies in Equation 5.7 can be computed as follows:
H(x
f
) = −
C

m=1
α
cm
θ
cmfx
_
(5.11)
5.4.3 Calculation of Conditional Probabilities
Since the instances in the dataset D are assumed to be drawn independently,
we have
p(x
if
= x[D
k
) = p(x
if
= x[x
obs(i)
= x
obs(i)
, c
i
)
=
p(x
if
= x, x
obs(i)
= x
obs(i)
[c
i
)
p(x
obs(i)
= x
obs(i)
[c
i
)
(5.12)
where, as before, x
obs(i)
are features that are observed for instance i that take
on values x
obs(i)
, and c
i
is the class label for instance i.
Therefore, the conditional probability in Equation 5.12 can be written in
terms of the parameters of the joint distribution as
p(x
if
= x[D
k
) =

i
N
T
i
N
i
, where N
i
is the unit normal vector to a piece
of the decision border. Eigenvalues and eigenvectors of such a Σ
EDBFM
matrix turn out to be λ
1
= λ
2
= 0.5, u
1
= [1 0], u
2
= [0 1], suggesting
that the two dimensions have the same discriminative power, while it is clear
that projecting on the ﬁrst dimension results in a minor accuracy loss than
projecting on the second dimension. Exploiting only the normal vectors, we
don’t fully consider the geometry of the decision border, which greatly depends
on the statistics of the classiﬁcation problem. Indeed, if in this example we
consider a square instead of a rectangle, we obtain the same Σ
EDBFM
matrix.
By deﬁning the Σ
EDBFM
matrix as a weighted sum of normal vectors, where
each normal vector is weighted by the length of the related segment of decision
border over the total length of the decision border, we get λ
1
= 0.8, λ
2
= 0.2,
u
1
= [1 0], u
2
= [0 1]; hence the ﬁrst dimension correctly results four times
more important than the second one.
In order to take into account the statistics of the problem, normal vectors
should be appropriately weighted. Then we give the following general form of
the BVQ-based feature extraction (BVQFE) algorithm:
BVQ-based Feature Extraction
1. Train the LVQ {(m
1
, l
1
), . . . , (m
Q
, l
Q
)}, m
i
∈ R
N
, l
i
∈ Y
on a training set T S by using the BVQ algorithm;
2. set the elements of the matrix Σ
BV QFM
to 0;
3. set w
tot
to 0;
4. for each pair y
i
, y
j
∈ Y, where i = j do
1. set the elements of the matrix Σ
BV QFM
ij
to 0;
2. for each pair m
k
, m
z
∈ M defining a piece of
decision border, where l
k
= y
i
and l
z
= y
j
do
1. calculate the unit normal vector to
the decision border as: N
kz
=
(m
k
−m
z
)
m
k
−m
z

i=1
k(x −x
i
),
where k(.) is the kernel function. Diﬀerent forms of the kernel can be chosen.
In the following, we consider the uniform hypercubic window, that is, k(x −
x
i
) = Δ
−N
over a N-dimensional hypercube of side Δ centered on the training
sample x
i
(i = 1, 2, . . . , M) and k(x − x
i
) = 0 elsewhere. With this choice,
after some manipulations, we get
´ w
λ
(Δ) =
M

i=1
δ(d(x
i
, S
λ
) ≤
Δ
2
), (6.6)
where d(x
i
, S
λ
) is the Euclidean distance between x
i
and the piece of decision
border S
λ
, that is, we can approximate the true weights by counting how many
training samples fall “on” (i.e., at a distance less than Δ/2 from) each piece
of decision border S
λ
. In [10, 4] it is proposed to weight the normal vectors
by the volumes of the decision border. It is simple to see that this method is
a special case of the previous one. In fact, when p(x) = p is constant along
each piece of decision border, Equation (6.5) becomes
´
Σ
BV QFM
=
1
p
Λ

s∈t
(y
s
− ¯ y)
2
, where the sum and mean are taken over all
observations s in node t, and N(t) is the number of observations in node t.
For classiﬁcation I(t) = Gini(t), where Gini(t) is the Gini index of node t:
Gini(t) =

i=j
p
t
i
p
t
j
(7.2)
and p
t
i
is the proportion of observations in t whose response label equals i
(y = i) and i and j run through all response class numbers. The Gini index is
in the same family of functions as cross-entropy, −

j=1
p(F
i
= j) imp(y[F
i
= j)
where imp(y) is the impurity of class values before the split, imp(y[F
i
= j) is
the impurity after the split on F
i
= j, n
i
is the number of values of feature F
i
,
and p(F
i
= j) is the (prior) probability of the feature value j. By subtracting
the expected impurity of the splits from the impurity of unpartitioned in-
stances we measure gain in the purity of class values resulting from the split.
Larger values of W(F
i
) imply purer splits and therefore better features. We
cannot directly apply these measures to numerical features, but we can use
any discretization technique and then evaluate the discretized features.
9.2.1 Impurity Measures in Classiﬁcation
There are several impurity based measures for classiﬁcation problems, e.g.,
Two well-known impurity measures are entropy and Gini-index. With entropy
we get the information gain measure, also referred to as mutual information
due to its symmetry:
Gain(F
i
) = H
Y
−H
Y |F
i
= H
Y
+H
F
i
−H
Y F
i
= I(F
i
; Y ) = I(Y ; F
i
) (9.1)
where H
Y
is the class entropy, and H
Y |F
i
is the conditional class entropy
given the value of feature F
i
. Gini-index gain [1] is obtained by the diﬀerence
between the prior and the expected posterior Gini-indices:
Gini(F
i
) =
n
i

j=1
p(F
i
= j)
2
Ginim(F
i
) (9.7)
where Ginim(F
i
) is strongly related with Gini(F
i
) from Equation (9.2):
Ginim(F
i
) =
n
i

j=1
_
p(F
i
= j)
2

j
p(F
i
= j)
2

C

c=1
p(y = c[F
i
= j)
2
_
−
C

c=1
p(y = c)
2
(9.8)
The only diﬀerence between Ginim(F
i
) and Gini(F
i
) is that instead of the
factor
p(F
i
=j)
2
P
j
p(F
i
=j)
2
in Equation (9.2) we have
p(F
i
=j)
P
j
p(F
i
=j)
= p(F
i
= j). However,
the crucial diﬀerence between the myopic Relief, deﬁned by Equation (9.7),
and Gini(F
i
) is in the factor in front of Ginim in Equation (9.7):

L∈LS
Diﬀ(L, x
k
, x
l
) (9.11)
It is simply a normalized sum of diﬀerences over the literal space LS. It
estimates the logical similarity of two instances relative to the background
knowledge.
Both the total distance Diﬀ
T
and the estimates W depend on the deﬁnition
of Diﬀ (Diﬀ
A
is an asymmetric version of Diﬀ). Table 9.1 shows the deﬁnitions
of Diﬀ and Diﬀ
A
.
The ﬁrst two columns represent the coverage of literal L over the instances
x
k
and x
l
, respectively. The coverage denotes the truth value of some par-
tially built clause Cl

c=1
P(y = c[x)C
c,u
where P(y = c[x) is the probability of the class c given instance x. The task
of a learner is therefore to estimate these conditional probabilities. Feature
evaluation measures need not be cost-sensitive for decision tree building, as
shown by [1, 6]. However, cost-sensitivity is a desired property of an algorithm
that tries to rank or weight features according to their importance. We present
the best solutions for a cost-sensitive ReliefF from [25].
There are diﬀerent techniques for incorporating cost information into learn-
ing. The key idea is to use the expected cost of misclassifying an instance with
class c and then change the probability estimates:
ε
c
=
1
1 −p(y = c)
C

u=1
u=c
p(y = u)C
c,u
p

(y = c) =
p(y = c)ε
c

C
u=1
p(y = u)ε
u
(9.12)
Using probabilities (9.12) in the impurity based functions Gain (9.1) and Gini
(9.2), we get their cost-sensitive variations. Similarly, we can use (9.12) in
ReliefF; we only have to replace the 9
th
line in Algorithm 9.3.1 with
else W[i] = W[i] + p

C
u=1
α
u
(9.13)
The use of ¯ p(y = c) instead of p(y = c) in the 9
th
line in Algorithm 9.3.1 also
enables ReliefF to successfully use cost information. For two-class problems,
ReliefF, ReliefF with p

k=1
r
F
(y
k
, b(y
k
)) (9.16)
Note that as M →∞, the problem space is densely covered with instances;
therefore, the nearest hit comes from the same characteristic region as the
randomly selected instance and its contribution in Algorithm 9.2.1 is 0.
We interpret Relief’s weights W(F) as the contribution (responsibility) of
each feature to the explanation of the predictions. The actual quality evalua-
tions for the features in the given problem are approximations of these ideal
weights, which occur only with an abundance of data.
For ReliefF this property is somehow diﬀerent. Recall that in this algorithm
we search nearest misses from each of the classes and weight their contribu-
tions with prior probabilities. This weighting is also reﬂected in the feature
evaluation when M → ∞. Let p(y = c) represent the prior probability of
the class value c, and under the same conditions as for Relief, r
F
(y
k
, b
u
(y
k
))
be the responsibility of feature F for the change of y
k
to the class u. Then
ReliefF behaves as
lim
M→∞
W(F) =
1
m
C

j=1
w
lj
= 1, 0 ≤ w
lj
≤ 1
(10.11)
where W is a k N weight matrix and the other notations are the same as
in (10.1).
In a similar fashion, (10.10) can be reduced to three subproblems that are
solved iteratively.
The subproblem P
1
is solved by
⎧
⎨
⎩
u
i,l
= 1 if
N

i=1
ˆ u
i,l
d(x
i,j
, z
l,j
) (10.14)
and N is the number of features with D
lj
> 0.
In subspace clustering, if D
lj
= 0, we cannot simply assign a weight 0 to
feature j in cluster l. D
lj
= 0 means all values of feature j are the same
in cluster l. In fact, D
lj
= 0 indicates that feature j may be an important
feature in identifying cluster l. D
lj
= 0 often occurs in real-world data such
as text data and supplier transaction data. To solve this problem, we can
simply add a small constant σ to the distance function to make ˆ w
lj
always
computable, i.e.,
D
lj
=
M

M
n=1
1(x
n
∈ N(x
0
))
(11.2)
where 1() is an indicator function such that it returns 1 when its argument
is true, and 0 otherwise.
A particular nearest neighbor method is deﬁned by how the neighborhood
N(x
0
) is speciﬁed. K nearest neighbor methods (K-NN) deﬁne the region at
x
0
to be the one that contains exactly the K closest training points to x
0
according to a p-norm distance metric on the Euclidean space of the input
measurement variables:
D
p
(x
0
, x) = ¦
N

C
j=1
Pr(j[x
0
)(Pr(j[x) − Pr(j[x
0
)))
2
. The idea behind this metric is that if
the value of x for which D(x
0
, x) is small is selected, then the expectation
(11.10) will be minimized.
This metric is linked to the theory of the two-class case developed in [13].
However, a major concern with the above metric is that it has a cancellation
eﬀect when all classes are equally likely [11]. This limitation can be avoided by
considering the Chi-squared distance [9] D(x, x
0
) =

C
j=1
[Pr(j[x) −Pr(j[x
0
)]
2
,
which measures the distance between the query x
0
and the point x, in terms
of the diﬀerence between the class posterior probabilities at the two points.
Furhermore, by multiplying it by 1/ Pr(j[x
0
) we obtain the following weighted
Chi-squared distance:
D(x, x
0
) =
C

j=1
[Pr(j[z) −Pr(j[x
i
= z
i
)]
2
Pr(j[x
i
= z
i
)
(11.12)
r
i
(z) represents the ability of feature i to predict the Pr(j[z)s at x
i
= z
i
. The
closer Pr(j[x
i
= z
i
) is to Pr(j[z), the more information feature i carries for
predicting the class posterior probabilities locally at z.
We can now deﬁne a measure of feature relevance for x
0
as
¯ r
i
(x
0
) =
1
K

z∈N(x
0
)
r
i
(z) (11.13)
where N(x
0
) denotes the neighborhood of x
0
containing the K nearest train-
ing points, according to a given metric. ¯ r
i
measures how well on average the
class posterior probabilities can be approximated along input feature i within
a local neighborhood of x
0
. Small ¯ r
i
implies that the class posterior prob-
abilities will be well approximated along dimension i in the vicinity of x
0
.
Note that ¯ r
i
(x
0
) is a function of both the test point x
0
and the dimension i,
thereby making ¯ r
i
(x
0
) a local relevance measure in dimension i.
The relative relevance, as a weighting scheme, can then be given by w
i
(x
0
) =
R
i
(x
0
)
t
P
N
l=1
R
l
(x
0
)
t
, where t = 1, 2, giving rise to linear and quadratic weightings
respectively, and R
i
(x
0
) = max
j
¦¯ r
j
(x
0
)¦ − ¯ r
i
(x
0
). In [7], the following expo-
nential weighting scheme was proposed:
w
i
(x
0
) = exp(cR
i
(x
0
))/
N

i=1
w
i
(x
i
−y
i
)
2
(11.15)
The weights w
i
enable the neighborhood to elongate less important feature
dimensions and, at the same time, to constrict the most inﬂuential ones. Note
that the technique is query-based because the weights depend on the query
[1].
Since both Pr(j[z) and Pr(j[x
i
= z
i
) in (11.12) are unknown, we must
estimate them using the training data ¦x
n
, y
n
¦
M
n=1
in order for the relevance
measure (11.13) to be useful in practice. Here y
n
∈ ¦1, , C¦. The quantity
Pr(j[z) is estimated by considering a neighborhood N
1
(z) centered at z:
ˆ
Pr(j[z) =

M
n=1
1(x
n
∈ N
1
(z))1(y
n
= j)

M
n=1
1(x
n
∈ N
1
(z))
(11.16)
where 1() is an indicator function such that it returns 1 when its argument
is true, and 0 otherwise.
To compute Pr(j[x
i
= z) = E[Pr(j[x)[x
i
= z], we introduce an additional
variable g
j
such that g
j
[x = 1 if y = j, and 0 otherwise, where j ∈ ¦1, , C¦.
We then have Pr(j[x) = E[g
j
[x], from which it is not hard to show that
Pr(j[x
i
= z) = E[g
j
[x
i
= z]. However, since there may not be any data
at x
i
= z, the data from the neighborhood of z along dimension i are used
to estimate E[g
j
[x
i
= z], a strategy suggested in [8]. In detail, by noticing
g
j
= 1(y = j), the estimate can be computed from
ˆ
Pr(j[x
i
= z
i
) =

x
n
∈N
2
(z)
1([x
ni
−z
i
[ ≤ Δ
i
)1(y
n
= j)

x
n
∈N
2
(z)
1([x
ni
−z
i
[ ≤ Δ
i
)
(11.17)
where N
2
(z) is a neighborhood centered at z (larger than N
1
(z)), and the
value of Δ
i
is chosen so that the interval contains a ﬁxed number L of points:

i
α
i
y
i
x
T
i
x − b, and the coeﬃcients α
i
are the solutions of a convex quadratic problem, deﬁned over the hypercube
[0, C]
l
. The parameter b is also computed from the data. In general, the
solution will have a number of coeﬃcients α
i
equal to zero, and since there
is a coeﬃcient α
i
associated to each data point, only the data points cor-
responding to non-zero α
i
will inﬂuence the solution. These points are the
support vectors. Intuitively, the support vectors are the data points that lie
at the border between the two classes, and a small number of support vectors
indicates that the two classes can be well separated.
This technique can be extended to allow for non-linear decision surfaces.
This is done by mapping the input vectors into a higher dimensional fea-
ture space, φ : '
N
→ '
N

, and by formulating the linear classiﬁcation
problem in the feature space. Therefore, f(x) can be expressed as f(x) =

i
α
i
y
i
φ
T
(x
i
)φ(x) −b.
If one were given a function K(x, y) = φ
T
(x)φ(y), one could learn and use
the maximum margin hyperplane in feature space without having to compute
explicitly the image of points in '
N

. It has been proved (Mercer’s Theorem)
that for each continuous positive deﬁnite function K(x, y) there exists a map-
ping φ such that K(x, y) = φ
T
(x)φ(y), ∀x, y ∈ '
N
. By making use of such
function K (kernel function), the equation for f(x) can be rewritten as
f(x) =

i=1
exp(AR
i
(x
0
)) (11.22)
where A is a parameter that can be chosen to maximize (minimize) the in-
ﬂuence of R
j
on w
j
. When A = 0 we have w
j
= 1/N, thereby ignoring any
diﬀerence between the R
j
’s. On the other hand, when A is large a change in
R
j
will be exponentially reﬂected in w
j
. Thus, (11.22) can be used as weights
associated with features for weighted distance computation:
D(x, y) =
¸
¸
¸
_
N

i=1
(s
i
−d
i
)
2
(11.27)
where s is in the training set, and equal weights are assigned to the feature
dimensions. In the following we show that the weighting schemes implemented
by LaMaNNa increase the margin in the space transformed by the weights.
For lack of space we omit the proofs. The interested reader should see [6].
Consider the gradient vector n
d
= ∇
d
f = (
∂
∂x
1
f
d
, . . . ,
∂
∂x
N
f
d
) computed
with respect to x at point d. Our local measure of relevance for feature j is
then given by
R
j
(s) = [e
T
j
n
d
[ = [n
d,j
[
and w
j
(s) is deﬁned as in (11.21) or (11.22), with

N
j=1
w
j
(s) = 1.
Let
D
2
w
(s, d) =
N

i=1
w
i
(s)(s
i
−d
i
)
2
(11.28)
be the squared weighted Euclidean distance between s and d. The main result
is summarized in the following theorem.
Theorem 1
Let s ∈ '
N
be a sample point and d ∈ '
N
the nearest foot of the perpendic-
ular on the separating surface f(x) = 0. Deﬁne D
2
(s, d) =
1
N

i
[x
i
[, which is
consistent with the distant function used in the original Relief algorithm.
Other distance functions can also be used. Note that ρ
n
> 0 if only if x
n
is correctly classiﬁed by 1-NN. One natural idea is to scale each feature such
that the averaged margin in a weighted feature space is maximized:
max
w
N

N
{n=1,o
n
=1}
(|x
n
−x
s
n1
|
w
−|x
n
−x
s
n2
|
w
) , which can be easily
optimized by using the conclusion drawn in Section 12.2. Of course, we do
not know the set o = ¦o
n
¦
N
n=1
and the vector o. However, if we assume the
elements of ¦o
n
¦
N
n=1
and o are random variables, we can proceed by deriving
the probability distributions of the unobserved data. We ﬁrst make a guess
on the weight vector w. By using the pairwise distances that have been com-
puted when searching for the nearest hits and misses, the probability of the
i-th data point being the nearest miss of x
n
can be deﬁned as
P
m
(i[x
n
, w) =
f(|x
n
−x
i
|
w
)

x
i
∈D\x
n
f(|x
n
−x
i
|
w
)
(12.3)
where f() is a kernel function. One commonly used example is f(d) =
exp(−d/σ), where the kernel width σ is a user-deﬁned parameter. Through-
out the chapter, the exponential kernel is used. Other kernel functions can
also be used, and the descriptions of their properties can be found in [1].
Now we are ready to derive the following iterative algorithm. Although
we adopt the idea of the EM algorithm that treats unobserved data as ran-
dom variables, it should be noted that the following method is not an EM
algorithm since the objective function is not a likelihood. For brevity of nota-
tion, we deﬁne α
i,n
= P
m
(i[x
n
, w
(t)
), β
i,n
= P
h
(i[x
n
, w
(t)
), γ
n
= 1 −P
o
(o
n
=
0[T, w
(t)
), J = ¦w : |w|
2
= 1, w ≥ 0¦, m
n,i
= [x
n
− x
i
[ if i ∈ /
n
, and
h
n,i
= [x
n
−x
i
[ if i ∈ H
n
.
Step 1: After the t-th iteration, the Q function is calculated as
Q(w[w
(t)
) = E
{S,o}
[C(w)]
=
N

{c∈Y,c=y(x)}
P(c)
1 −P(y(x))
[x
(i)
−NM
(i)
c
(x)[−[x
(i)
−NH
(i)
(x)[ (12.6)
where } = ¦1, , C¦ is the label space, NM
c
(x) is the nearest miss of x from
class c, and P(c) is the a priori probability of class c. By using the conclusions
drawn in Section 12.2, it can be shown that ReliefF is equivalent to deﬁning
a sample margin as
ρ =

i=1
a
i
= a
∗
.
THEOREM 12.3
Online I-Relief converges when the learning rate is appropriately selected. If
both algorithms converge, I-Relief and online I-Relief converge to the same
solution.
PROOF The proof of the ﬁrst part of the theorem can be easily done
by recognizing that the above formulation has the same form as the Robbins-
Moron stochastic approximation algorithm [13]. The conditions on the learn-
ing rate η
(t)
: lim
t→+∞
η
(t)
= 0,

i
Pr(y
i
[θ
k
)Pr(θ
k
[α
k
, β
k
)dθ
k
(14.7)
which is the marginal probability for the category of the documents.
It is straightforward to show that PIP(F, k) in Equation (14.2) is equivalent
to PIP(F, k) in Equation (14.3) if we assume that the prior probability density
for the models is uniform, e.g., Pr(M
l
) ∝ 1.
In the example above, the posterior inclusion probability for feature F
1
is
given by
Pr(F
1
[y
k
) = Pr(M
(1,1)
[data) +Pr(M
(1,0)
[data)
=
l
0F
1
k
l
0F
1
k
+l
F
1
k
To get a single “bag of words” for all categories we compute the weighted
average of PIP(F, k) over all categories:
PIP(F) =

k=1
Pr(y = k[F) log Pr(y = k[F)
where ¦1, . . . , C¦ is the set of categories and F is the absence of word F. It
measures the decrease in entropy when the feature is present versus when the
feature is absent.
14.2.5 Bi-Normal Separation (BNS)
The Bi-Normal Separation score, introduced in [4], is deﬁned as
BNS(F, k) = [Φ
−1
(
n
kF
n
k
) −Φ
−1
(
n
kF
n
k
)[
where Φ is the standard normal distribution and Φ
−1
is its corresponding
inverse. Φ
−1
(0) is set to be equal to 0.0005 to avoid numerical problems
following [4]. By averaging over all categories, we get a score that selects a
single set of words for all categories:
BNS(x) =
C

k=1
Pr(y = k)χ
2
(F, k)
14.2.7 Odds Ratio
The Odds Ratio measures the odds of word F occuring in documents in
category k divided by the odds of word F not occuring in documents in
category k. Reference [12] found this to be the best score among eleven scores
for a na¨ıve Bayes classiﬁer. For category k and word F, the oddsRatio is given
by
OddsRatio(F, k) =
n
kF
+0.1
n
k
+0.1
/
n
kF
+0.1
n
k
+0.1
n
kF
+0.1
n
k
+0.1
/
n
kF
+0.1
n
k
+0.1
where we add the constant 0.1 to avoid numerical problems. By averaging
over all categories we get
OddsRatio(F) =
C

k=1
Pr(y = k)OddsRatio(F, k)
14.2.8 Word Frequency
This is the simplest of the feature selection scores. In the study in [14]
they show that word frequency is the third best after information gain and
χ
2
. They also point out that there is a strong correlation between these two
scores and word frequency. For each category k, word frequency (WF) for
word F is the number of documents in category k that contain word F, i.e.,
WF(F, k) = n
kF
.
Averaging over all categories we get a score for each F:
WF(F) =
C

(x
1
,x
2
)∈C
ML
|F
T
(x
1
−x
2
)|
2
(15.10)
It is clear that β = 0 is equivalent to using only cannot-link constraints and
β = 1 is equivalent to using only must-link constraints. In our experiments,
we varied the value of β from 0 to 1 with a stepsize of 0.1. The performances
of the clustering results, measured by NMI, are plotted in Figure 15.3. In the
ﬁgure, the x-axis denotes the diﬀerent values of parameter β and the y-axis
the clustering performance measured by NMI.
0.1 0.3 0.5 0.7 0.9
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Ratio between must−link and cannot−link constraints (Beta)
N
o
r
m
a
l
i
z
e
d