This paper discusses a new weighting method for text analyzing from the view point of supervised learning. The term frequency and inverse term frequency measure (tf-idf measure) is famous weighting method for information retrieval, and this method can be used for text analyzing either. However, it is an experimental weighting method for information retrieval whose effectiveness is not clarified from the theoretical viewpoints. Therefore, other effective weighting measure may be obtained for document classification problems. In this study, we propose the optimal weighting method for document classification problems from the view point of supervised learning. The proposed measure is more suitable for the text classification problem as used training data than the tf-idf measure. The effectiveness of our proposal is clarified by simulation experiments for the text classification problems of newspaper article and the customer review which is posted on the web site.

Due to development of information technology, the effectiveness of knowledge discovery from enormous document data is suggested in much of the literatures on this subject (Hearst. 1999). There are many web sites where customers can post their free comments about merchandise that they bought and used. On the internet, the number of customer reviews is increasing day by day. Therefore it has been easy to get a large amount of document data and analyze it for several purposes. Customer reviews consist of not only free comments but customer information and the degree of satisfaction about items as metadata. The analysis using the metadata is more helpful for knowledge discovery than using only text data. The techniques for text mining are developed for the purpose of getting information. Various methods have been proposed in this research field, for example, vector space model (Manning et al., 2008), (Mikawa et al., 2012), probabilistic model (Hofmann. 1999), (Bishop, 2006) and so on.

In this paper, a vector space model is the focus for document analysis. To construct a vector space model for document analysis, the documents are separated into terms or words (morphemes) by using the morphological analysis (Nagata. 1994). After that, each document is represented by a vector whose elements express the information of word frequency of appearance. Because the vector space is built by the information of word frequency, the characteristics of a document vector model should be remarkable: high dimension and sparseness. Generally speaking, untold thousands of words or more should be treated to represent a document vector using effective words appearing in all documents.

As mentioned above, there are enormous words which are appeared in whole documents. In addition, term frequency of each word varies widely in length. Therefore, the performance of text analyzing depends on term frequency of words which is appeared each documents. That is, it depends on the length of documents. To avoid this, several weighting approach for each word has been proposed. For instance, tf-idf weighting (Salton et al., 1988), PWI (Probability-weighted amount of information) (Aizawa, 2000, 2003), mutual information (McCallum et al., 1998) and so on. And tf-idf weighting is one of the most famous method for weighting terms. However, it is proposed for information retrieval and the effectiveness is empirically shown. Therefore, the theoretical optimality is not proved. In addition, it doesn't use the metadata or side information for weighting each word. Nowadays, it can be easy to get or use those, and by using that information, it supposes to improve the performance of each analysis.

From above discussion, the purpose of this study is to propose a new weighting method for each word from the view point of supervised learning. We show the way of estimating an optimal word weighting by solving maximization problems. The effectiveness of this method is clarified by case experiments of applications to the customer review which are posted on web sites and newspaper articles which are used as a bench mark data.

In section 2, basic formulation of vector space model and weighting methods which have already proposed are explained. In section 3, the proposed method of weighting each word and the way of its estimation is explained. The illustration of simulation experiments in order to clarify the effectiveness of our proposal and the results acquired from the experiments are explained in section 4. Finally, the conclusion of this study is stated in section 5.

Let the set of documents be Δ={d1, d2 ,…, dD}, and the set of categories to which each document in training data belongs be C ={c1, c2 ,…, cN}. Here, D and N are the numbers of documents and categories respectively.

All documents in Δ are separated into morphologies by the morphological analysis. Selecting valid words from the given morphologies by using frequencies, mutual information, or other methods, a word set is constructed to represent documents in the vector space. Let the word set from the all documents in Δ be Σ = {w1, w2 ,…, wW}

Then each document can be expressed as a point in the vector space. Here, W is the number of different valid words which appear in Δ and it is equivalent to the dimension of the vector space. Here, let the frequency of a word wj in the document di be vij and it can be expressed by W-dimensional vector as follows:

where, T means transposition of a vector. That is, the document space can be constructed by regarding each component of vectors as the information about the frequency of each word.

By expressing document as a vector, a similarity or distance metrics between documents can be measured in the vector space. Here, the distance metric of document vectors di and dj can be expressed by using the Euclid distance which is traditionally used as follows:

However, the Euclid distance sometimes cannot work effectively to treat the document data. For example, many elements in a document vector are almost 0 and this property is called “sparseness.” If there are only a few different words between two documents, these documents are judged as having high similarity.

On the other hand, the cosine similarity measure between document vectors has asymptotically good performance in a sparse vector space (Goto et al., 2007), (Goto et al., 2008).

The way of weighting can be expressed the Hadamard product of document vector di and weighting vector f . Here, let weighting vector f be

Here, fk is weighting function for the word wk By using that, let the weighted document vector be d*i

where, let d*i = f ⊙ di is Hadamard product of f and di.

There are several weighting methods for text data analyzing. As mentioned above, the most famous method is is tf-idf weighting. It can be calculated the product of tf(term frequency, that is vij and idf(inverse document frequency). Here, the tf is representing the frequencies of each term. And idf is a monotonically decrease function of an appearance rate of term in documents. Let df(wk) be a number of documents in which the word wk appears. Then the idf weighting for the word wk is given by

Here, let the weight function fk be idf(wk), tf-idf weighting can be expressed as follows:

As mentioned above, tf-idf weighting is not made use of side information or metadata when calculating each word weight. In the basic text classification problem, training data has information which category training data belongs to. Therefore, by making the most use of the information for weighting each word can be improved its performance.

In this section, we propose the optimal weighting method and its estimation for text classification. And we derive the expression of optimal weight is given by the solution of maximization problem.

To formulate the learning algorithm to estimate the optimal weighting from training data set, the centroid of each category is defined as equation (8). The centroid can be calculated by using the training data with known category. Let the centroid vector of category cn (n = 1, 2, …, N) be gn = (g1, g2, …, gnW)T. Then the centroid vector gn is given by

Here, the |cn| is a number of documents contained category cn. And the same as equation (5), weighted centroid of category cn is given by

Normally, training data should be allocated by near the centroid of its category because a new document data is classified by the distance from the centroids. Due to that, the optimization of the weight should maximize the similarity between training data and its centroid.

From the above discussion, the estimation of optimal weighting vector f is to maximize cosine similarity between weighted document vector d*i and weighted centroid vector g*n. If the weighting vector can be obtained by the maximization of the similarity between data and its category’s centroid, the proposed weighting method is expected to have a good performance to classify the data into proper categories. Then, our optimal weighting vector f is given by

Here, f = (f1, f2, …, fW)T is weighting vector, and it satisfies

Equation (10) can be expressed by matrix operation by using the diagonal matrix which is constructed its elements by (fk)2. To use that, the optimal weight of each word can be estimated to maximize cosine similarity between training data di and each category’s centroid gn by using the matrix.

In order to solve the above problem, we introduce the diagonal metric matrix M = [mk] (k = 1, 2, …, W) with the elements (fk)2 that is

Here, M = [mk] is satisfying |M| = 1, and |M| is determinant of M.

From above discussion, the equation (10) can be rewritten as follows:

Supject to |M| = 1

From the above formulation, the following theorem is obtained.

Theorem 1: The metric matrix

which is satisfies the equation (11) is given by

(For the proof, see APPENDIX.)

From above discussion, the similarity between document vector and centroid can be calculated by using the metric matrix

given by the equation (12) as follows:

After calculating similarity among test data and all centroids by equation (13), test data is classified to the most similar category.

In this section, simulation experiments are conducted to verify the effectiveness of our method by using Japanese document data in practice. The experiments are performed for the two data sets, i.e, customer reviews which are posted on a web site and newspaper articles with pre-assigned categories. The suitable weight is estimated by learning through these data. And the effectiveness is confirmed by the performance of classification of test data into the correct categories.

The basic process of experiments is as follows.

Step1: Calculation of all centroids from training data.

Step2: From equation (12), learning metric matrix

from training data.

Step3: From equation (13), to calculate similarity among test data and all category’s centroid. And it classifies that with most similar category.

As mentioned above, two different types of data are used on this experiment. The first one is articles from the Mainichi newspaper in 2005 which are used as a benchmark data for document classification. The second one is customer review that is posted on a web site in order to recognize the performance of our method for real data. News articles in the Mainichi newspaper consist of several categories (economics, sports, politics and so on) and this time, three and five categories are extracted at random. Customer review consists of not only text data but degree of customer’s satisfaction as mentioned above. In this experiment, we use two categories of that which are highest degree of satisfaction and lowest one.

For the sake of comparison, the experiment of the common cosine measure with only term frequency (tf measure) and tf-idf measure between different two data points was also performed. The criteria of evaluation are classification accuracy rate. It’s the ratio of the number of documents which are classified into correct categories to the total number of documents.

The results of experiments are shown in Figure 1, Figure 2 and Figure 3. Figure 1 and Figure 2 show the cases of the newspaper article. The former is the case of three categories and the latter is that of five.

From the Figure 1 and Figure 2, the proposed method is basically superior to tf-idf weighting and tf weighting methods. However, from the Figure 3, the proposed method and the other two methods are almost the same performance. That is, the performance of proposed method is more suitable when newspaper article is used.

From the result, the proposed method is more suitable than conventional methods when newspaper article is used. In the proposed method, category information of training data as weighting parameter is taken into consideration. This property can work well for text classification. On the other hand, tf-idf weighting is not used each category’s characteristics. As the result, the performance is improved.

However, the performance of the proposed method which can be solved regarded as the optimal solution is not improved drastically in the customer review classification. Because customer write their comment freely for merchandise which they bought and used. Due to the fact, it may appear many invalid words and the tendency of words is not different between each category. The customer reviews are too high of freedom of text data for the proposed method to estimate such complex statistical structure. By these reason, the proposed method doesn’t work effectively. It is necessary to improve the method as getting better performance for customer review (real data) as the future work.

To solve for M under the |M| = 1 is equivalent to solve following maximization:

In the following, to simplify the calculation, document d*i and centroid g*n is normalized by its norm. In other words, it can be said ||d*i|| = ||g*n|| = 1. 1) So equation (14) becomes

Here, M is diagonal matrix and |M| =1. Therefore,

Therefore,

is derived.

To maximize the equation (15) under the equation (16), Lagrange multipliers method is used. Here, let Lagrange multiplier be λ, and Lagrange function can be defined

Here, L is partially differentiated with respect to mk. And put them 0,

is derived. To solve the equation (18) for mk,

is derived.

Here, let M?1 be

Then form the equation (19),

is

Here, let set A be A = [akl] and set ak be

Then A is

Therefore,

is derived. From the equation (22), A = λM?1, is acquired. By the characteristics of M,

is derived. Therefore, λ becomes

Accordingly, form equations (19) and (25),

becomes

Here, because A is diagonal matrix, |A| is given by

Consequently,

is derived.

Accordingly, from equations (26) and (28), each

becomes

The proof is complete. □

In case of ||d*i|| ≠ 1, ||g*n|| ≠ 1, it can be easy to extend.

1.
Aizawa A.
(2000)
The Feature Quantity: An Information Theoretic Perspective of Tfidf-like Measures
[Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval]
P.104-111

11.
Nagata M.
(1994)
A Stochastic Japanese morphological analyzer using a forward-DP backward-A* best search algorithm
[Proceeding of the 15th International Conference on Computational Linguistics]
P.201-207