\section{Introduction}
\label{sec:intro}
Minimally supervised extraction of tuples of entities involved in a particular relation has been the subject of considerable recent investigation\cite{ref:dipre,ref:snowball,ref:espresso}. These methods start from a small set of known tuples in the relation of interest, and bootstrap extractors that can find more tuples in the relation from text. For example, if the relation of interest is (corporate) ``acquisition'', such a learned extractor may include patterns that extract the pair $(\mathtt{Acme Corp.}, \mathtt{XYZ Inc.})$ from the sentence \emph{``XYZ Inc. was acquired by Acme Corp. for 10 million in cash.}''.
Such extractors for specific relations are useful in question answering and text mining applications, but they do not provide an inventory of the relations that may involve entities of a given type, and that will help us characterize specific entities of the type. In this study, we investigate methods that from a small seed set learn how to extract a wide range of entity \emph{attributes}. These attributes are for instance expressed in natural language with relational nouns (``the revenue of $X$'') or reduced clauses (``a company headquartered in\ldots''). In contrast to previous methods that seek specific relationsips like ``headquarters of'' or ``instance of'' and their values~\cite{ref:snowball,ref:pantel-ravi04}, we seek to identity all the commonly used attributes of an entity type, such as the \emph{capital}, \emph{population}, and \emph{GDP} attributes of countries.
Attribute extraction is of potential value in search and text mining. \newcite{ref:pasca} show the prevalence of attribute mentions in Web queries, and discuss how recognizing attributes can improve search results. \newcite{ref:pasca} consider attribute extraction as a building block for the automatic creation of knowledge bases from text. \newcite{ref:probst_attr_val} state that applications in the business domain can greatly benefit from the knowledge of product attributes and their values.
We adopt a semi-supervised bootstrapping approach to avoid the cost of annotating a large training set for supervised learning. Earlier systems that use bootstrapping for extraction differ in the types of resources used and the types of decisions made during the extraction process. Large unlabeled collections, including (portions of) the Web are used either for estimating the confidence of candidate extractions or as a source of further instantiations of the relations of interest. KnowItAll~\cite{ref:knowitall} crawls the Web looking for fixed patterns corresponding to a desired relation. Since the Web provides redundant evidence for many relations of interest, the system can afford to favor precision over recall. However, when operating on smaller corpora (for example, product reviews) such methods can fail to produce useful results because of their low recall. Other systems such as DIPRE~\cite{ref:dipre} and SnowBall~\cite{ref:snowball} assume functional relationships, for instance that each organization has a single location. Since some attributes express one-to-many relationships, we cannot rely on functionality in the present work.
Our methods build on the work of \newcite{ref:espresso} and use co-training to recognize attributes of a new entity type. As \newcite{ref:collins99unlabeled}, we split features into two sets: \emph{content} features that depend only on the candidate tup,e and \emph{context} features that depend only on the context in which the candidate tuple occurs. Given these two sets and an initial set of seed features, the method iterates between finding relevant tuples and find relevant patterns. By expressing the problem in terms of features rather than fixed expressions, we are able to improve recall while maintaining high precision. After the iterative steps, the list of resulting candidate tuples are re-ranked using additional features. A mutual information based measure is used to induce patterns and tuples that are highly correlated.
The rest of the paper is organized as follows. Previous work is discussed in Section~\ref{sec:rel_work}. The methods are presented in Section~\ref{sec:arch}, and experimental results in Section~\ref{sec:results}. We conclude with a summary and discussion of future directions.