Abstract

In many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between the primary categories of a document, which represent the topics that are central to it, and its secondary categories, which represent topics that the document only touches upon. We contend that this distinction, so far neglected in text categorization research, is important and deserves to be explicitly tackled. The contribution of this paper is threefold. First, we propose an evaluation measure for this preferential text categorization task, whereby different kinds of misclassifications involving either primary or secondary categories have a different impact on effectiveness. Second, we establish several baseline results for this task on a well-known benchmark for patent classification in which the distinction between primary and secondary categories is present; these results are obtained by reformulating the preferential text categorization task in terms of well established classification problems, such as single and/or multi-label multiclass classification; state-of-the-art learning technology such as SVMs and kernel-based methods are used. Third, we improve on these results by using a recently proposed class of algorithms explicitly devised for learning from training data expressed in preferential form, i.e., in the form “for document di, category c′ is preferred to category c′′”; this allows us to distinguish between primary and secondary categories not only in the classification phase but also in the learning phase, thus differentiating their impact on the classifiers to be generated.

Keywords

Notes

Acknowledgements

We thank Tiziano Fagni for indexing the WIPO-alpha collection and Andrea Esuli for useful discussions on Kendall distance. Thanks also to Lijuan Cai, Shantanu Godbole, Juho Rousu, Sunita Sarawagi, Domonkos Tikk, and S. Vishwanathan for clarifying the details of their experiments. This work has been partially supported by the project “Tecniche di classificazione automatica per brevetti”, funded by the University of Padova.

Crammer, K., & Singer, Y. (2002). A new family of online algorithms for category ranking. In Proceedings of the 25th ACM International Conference on Research and Development in Information Retrieval (SIGIR’02) (pp. 151–158). Tampere, FIGoogle Scholar

Hersh, W., Buckley, C., Leone, T., & Hickman, D. (1994). Ohsumed: An interactive retrieval evaluation and new large text collection for research. In Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR’94) (pp. 192–201). Dublin, Ireland.Google Scholar

Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning (ECML’98) (pp. 137–142). Chemnitz, Germany.Google Scholar

Lam, W., & Ho, C. Y. (1998). Using a generalized instance set for automatic text categorization. In Proceedings of the 21st ACM International Conference on Research and Development in Information Retrieval (SIGIR’98) (pp. 81–89). Melbourne, Australia.Google Scholar

Voorhees, E. (1998). Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of the 21st ACM International Conference on Research and Development in Information Retrieval (SIGIR’98) (pp. 315–323). Melbourne, Australia.Google Scholar

Yang, Y., Zhang, J., & Kisiel, B. (2003). A scalability analysis of classifiers in text categorization. In Proceedings of the 26th ACM International Conference on Research and Development in Information Retrieval (SIGIR’03) (pp. 96–103). Toronto, Canada.Google Scholar