Unintended Information Revelation

project overview

With an increased number of documents being generated by different individuals
and departments in a business or government and placed in their website, there
is a potential of the business or govenment department releasing information
unintended for public consumption or that which is inconsistent with its
overall goals, objectives and operation. While manually identifying such
unintentionally revealed information is plausible it is not efficient and
effective. The goal of Unintended Information Revelation(UIR) is to explore
automated solutions to identify candidate document subsets that reveal more
information. These can be manually reviewed and take action to withhold, if
desired, the information snippet in question. The UIR problem refers to a
phenomenon where information synthesized from multiple documents is more
than the information provided by the sum of the individual documents.
Snippets of information from stand-alone documents by themselves may seem
innocuous. However, when put together the synthesized information may reveal
more information than intended by the authors.

The proposed solution takes the view that the concepts and associations in a
particular domain can be represented as a probabilistic network with the nodes
representing concepts and edges between them representing associations. A
domain expert or automated learning from sample documents can assign weights to
the nodes and edges to quantify the relevance of these concepts and
associations to a particular domain. A document from this domain is viewed as a
sub-graph of this probabilistic network and its footprint in the domain
map is a measure of the information it reveals. Footprints can be computed for
document subsets based on the concepts and associations that occur in them and
the underlying network in the domain map and compared with the footprints of
member documents. It is claimed that document subsets that reveal more
information have larger footprints than the sum of its member documents.

A combination of techniques from Natural Language Processing(NLP), Information
Extraction(IE), and Information Retrieval are being explored to solve the UIR
problem. The probablistic network for a domain, once generated, is a treasure
trove for analysts to mine information on the business/government department's
operations. The concepts and associations extracted using Information
Extraction can prepare the website for the Semantic Web proposed by the World
Wide Web Consortium.