ECML/PKDD-2002 Tutorial

Text Mining and Internet Content Filtering

Index

In the recent years, we have witnessed an impressive growth of the
availability of information in electronic format, mostly in the form of
text, due to the Internet and the increasing number and size of digital
and corporate libraries. The overwhelming amount of text is hardly to
consume for an average human being, who faces an information overload
problem. As traditional Data Mining (or more properly, Knowledge
Discovery in Databases, KDD) is about finding patterns in data, Text
Data Mining (Text Mining, TM for short) is about uncovering patterns in
data when the data is text. In other words, the goal of TM is turning
the information buried in text into valuable knowledge that alleviates
information overload.

TM is an emerging research and development field that address the
information overload problem borrowing techniques from data mining,
machine learning, information retrieval, natural-language understanding,
case-based reasoning, statistics, and knowledge management to help
people gain rapid insight into large quantities of semi-structured or
unstructured text. TM includes several text processing and
classification techniques, as text categorization, clustering and
retrieval, information extraction, and others, but it also involves the
development of new methods for information analysis, digesting and
presentation.

A prototypical application of TM techniques is Internet information
filtering. The easiness of Internet-based information publishing and
communication makes it prone to misuse. For instance, websites devoted
to pornography, racism, terrorism, etc. are daily accessed by easily
influenced under age persons. Also, Internet email users have to bear
intrusive unsolicited bulk email that makes it less valuable and more
expensive as a communication means. Internet filtering through TM
techniques is a promising work field that will provide the Internet
community with more accurate and cheap systems for limiting youngsters
access to illegal and offensive Internet content, and for alleviating
the unsolicited bulk email problem.

The goal of this tutorial is making the audience familiar to the
emerging area of Text Mining, in a practical way. This goal will be
achieved by realizing the concepts about the field through two Text
Categorization [39, 40, 33] applications, focused on Internet
information filtering: the detection of offensive websites [6], and the
detection of unsolicited bulk email (see e.g. [2, 11, 21, 29, 30]).
Being relatively simple, these applications will allow the audience to
understand the main topics in Text Mining.

The tutorial is of interest for both researchers and practitioners of
KDD and machine learning (and thus, for those attending to ECML or
PKDD). Researchers will get a practical overview of the TM field from
the point of view of applied, interactive KDD proccess. Practitioners
will get a better understanding of the specific problems of KDD when the
data is text, and their relation with the recurrent problems in KDD.

A basic knowledge of machine learning and KDD is recommended.
Familiarity with the Java programming language is interesting.

The tutorial is divided into two main parts. The first part of the
tutorial is an overview of TM topics, focusing in the specific problems
of TM in relation to KDD. The concepts will be covered in a
classification task oriented fashion, where a number of supervised and
unsupervised learning tasks will be reviewed. The second part will
realize the concepts in TM through the detailed analysis of the two
previously mentioned Internet filtering tasks. Indeed, regarding the
detection of offensive websites, an operational system will be quickly
produced by reusing a number of open-source tools, including the Muffin proxy system and the Waikato Environment for
Knowledge Analysis (WEKA) learning library.

In particular, the tutorial will cover the following topics:

1. TM: what is it and what is it not? This section will
cover introductory topics (see e.g. [17, 20, 37]), will state the main
specific problems in TM (in relation to KDD), and will include a review
of hot Text Mining applications.

José María Gómez Hidalgo is a lecturer and
researcher at the Computer Science
School of the Universidad Europea
CEES, in Madrid, Spain. He has been developing his research work on
the area of Natural Language Engineering for around eight years, in
which he has taken part in several R&D projects, most of which
involving text content analysis, user profiling, information filtering
and related topics. In 2002/03 he will be leading a team at the
Universidad Europea CEES in a European Commission funded R&D project
focused on the development of a offensive web content filtering tool,
called POESIA. He has
published a number of research reports and articles related to the
topics covered in the tutorial (including [21, 34, 13, 23, 22, 9, 35,
26]).

José María has been a lecturer for seven years at the
Computer Science Schools of the Universidad
Complutense de Madrid, Colegio Universitario Domingo de Soto, and
Universidad Europea CEES. He has also given several courses by demand
of corporate firms. In the present term, he is teaching a Natural
Language Processing course at the Universidad Europea CEES, among
others.

[34] L.A. Ureña, M. de Buenaga, and J.M. Gómez
Hidalgo. Integrating linguistic resources in TC through WSD. Computers
and the Humanities, May 2001.

[35] L.A. Ureña, J.M. Gómez, and M. de Buenaga.
Information retrieval by means of word sense disambiguation. In
Proceedings of the TSD 2000 Third International Workshop on TEXT, SPEECH
and DIALOGUE, 2000.