Abstract

With rapidly growing interconnected collections of cultural materials, we need new approaches to information organization. We propose that schematic models which describe the content of documents rather than descriptions about the documents are the key for this new generation of descriptive systems. We explore process-oriented models implemented with linguistic frames as an approach to organizing a collection of rich content such as descriptions of history. We show how linguistic frames can implement the state-change model and how those frames might be applied to the organization of content such as history texts and complex materials such as historical newspapers. We focus on verb frames because we base our approach on state changes in entities which are described with verbs. We propose that systems of information organization for rich content such as historical texts be based on a "fabric" of entities and events. For instance, we describe incorporating into the fabric complex entities such as those with multiple parts. We also describe disaggregating complex events with sequences of related events which we call flows, as well as propose a flexible approach to grouping events and a schematic description of people including their mental states. Our approach is a companion to the original content and explicitly allows for versioning, metadata, and hierarchies of entity classes, as well as partonomies, functionality and instances. While our focus here is on history, the structures we define should be able to be applied across a variety of fields and they should be useful as targets for text mining. In a subsequent paper, we extend the fabric to use discourse elements to describe the relationships among events.

1. Process-Oriented Information Organization for Rich Content

1.1. Focus on Entities and Events

Information organization has traditionally been concerned with document-level description. With burgeoning amounts of rich content now available, we need better support for direct access to that content. We propose that schematic models which describe the content of documents rather than descriptions about the documents are the key for this new generation of descriptive systems. We call this a model-oriented approach, because we propose that events be modeled as inter-related threads.[1, 2, 3, 8]

Only relatively recently have attempts been made to describe events systematically. Yet events are integral to software engineering which uses formalisms such as state-machines, Petri Nets, state charts, transition networks, and programming languages. Events have also played a growing role in formal approaches to narrative (e.g., [7, 16]), in document preservation metadata (e.g., [15]), in linguistics (e.g., [9]), and in cognitive science (e.g., [24]). Some formal models for events are beginning to be applied to the study of history; these include text processing of historical newspapers [4], models for interactive timelines [2], analysis of named events [21, 22], and specification of events using semantic web notation [18, 23].

[3] applied our qualitative process model for entities and events [8] to the description of scientific research reports. That model defined an event as the state of one entity causing a state-change in one or more other entities. The original definition has been expanded to cases when processes create or destroy entities [5]. While events may be included in ontologies, such static descriptions are limited compared to connected events in a threaded narrative or events in an executable script.

1.2. Causal Relationships in Narrative Timelines

While traditional timelines present separate events without context or inter-connection, we have developed timelines which tell a story or provide an explanation for how the events are related. Some of this work was based on models of plots in narratives (e.g., [2, 16]). [2] developed an interactive narrative timeline to present an overview of several explanations for the outbreak of the American Civil War. The interface supports the presentation and browsing of several different causal threads (Figure 1).

Figure 1. In the interactive timeline from [2], the user is given a guided tour though several explanations of the causes of the American Civil War as threads through a set of events (highlighted). Text in the boxes on the right describes the events and the thread which connects them.

In addition, [2] discussed how the state-change approach to causal attributions of [8] might be used in narratives, with a mostly hard-wired representation of events. A schematic representation will enhance the ability to interact with and extend the descriptions of historical events. In addition, the schematic representations could incorporate different perspectives and would allow transitions among them. They would also provide links to supporting evidence for perspectives and allow personalization of explanations and narrative presentations (see [6]).

1.3. Frame Semantics

In this paper, we explore how linguistic frames can implement the state-change model and how those frames might be applied to the organization of content such as history texts and complex materials such as historical newspapers. Linguistic frames are structures which bring together the related aspects of a concept. The theory of Frame Semantics [11] proposes that meaning is constructed by the activation of and interconnection of frames. According to this theory, Frames are typically activated by linguistic units. Thus, the verb "to give" activates the "giving" frame. Several linguistic units may activate the same frame; "to donate" would also activate the "giving" frame. Once the frame has been activated, a person knows to expect to find core frame elements: a donor, a recipient, and a theme (i.e., something being transferred between them). In addition, there may be non-core elements such as the purpose or place of the action. The frame elements are, roughly, semantic roles with the frame unifying them. Semantic roles have been recognized as fundamental to information science (e.g., [14]). In order to explore the theory of frame semantics, a large collection of frames and associated linguistic units was developed in the FrameNet project [20]. The frames were constructed by examining examples in a variety of texts for elements which were consistently associated with a frame. These accompanying elements were then clustered into semantic roles.

1.4. Application to Information Organization

We propose that systems of information organization for rich content such as historical texts be based on a "fabric" of entities and events. In developing the entity-event fabric, we explore the use of verb frames from the FrameNet project. Thus, we envision an incremental development in which the fabric initially is used for relatively simple applications (e.g., Figure 1) and its scope is gradually extended. As described in [6], this fabric would also be extended with discourse elements. Eventually, the schematic environments could be used for interactive text generation such as question answering systems, tutoring systems, and conversational historical agents (see [6]). This work in indexing historical texts stems from previous work on text mining historical newspapers and on designing new types of timelines (see Figure 1). This work also complements our research on conceptual modeling of scientific research reports [3, 5] even though historical texts are much less structured than scientific research reports.

1.5. Road Map

Section 2 of this paper describes the use of FrameNet for coding texts. Section 3 considers complex entities. Section 4 examines several types of hooks into the fabric. Section 5 discusses next generation information organization. Section 6 is the conclusion.

2. Process Descriptions with FrameNet Verb Frames

As described above, the FrameNet project [20] has developed a set of frames for terms from everyday English. Linguistic verb frames generally have several advantages as a representation for processes. First, natural language texts can be interpreted with frames. Second, while there are many frames, they are finite and akin to a controlled vocabulary. Third, verb frames interweave with and place constraints on entities. The FrameNet set of frames in particular is the result of much attention to consistency. FrameNet frames have been used for annotating short passages but not previously proposed for large-scale information organization. We propose using FrameNet frames as a platform for indexing and representing rich corpora, to use the semantic roles associated with the frames rather than emphasizing the underlying cognitive theory.1

In the current FrameNet corpus, about 10,000 lexical units are mapped to about 1,100 frames.2 The frames could be used to enhance tagging history corpora and we discuss some of the possibilities here. Schematic modeling explicates relationships that may be difficult to extract on the fly from unstructured texts.

As an example of how frames could be used, consider the statement:

President Lincoln signed the Emancipation Proclamation and freed the slaves.

The core for the FrameNet freeing_from_confinement frame is:

An Agent brings about the end of a Theme's captivity at a Location of Confinement

The core frame has core elements: Agent, Theme, and Confinement. Thus the freeing_from_confinement frame links the Agent [Emancipation Proclamation], the Theme [the slaves], and the Location of Confinement [slavery]. The statement declares that there is a change in the slaves' state of confinement.

A second frame, sign_agreement, has the core:

A Signatory [Lincoln] signs an Agreement [Emancipation Proclamation], thereby taking on a commitment encoded in the Agreement [emancipation].

The sign_agreement frame would be linked to the freeing_from_confinement frame through the latter's non-core frame element Means. Further, we could link to the physical pen and inkwell used for the signing directly by way of the non-core Instrument frame element for the sign_agreement frame (see Figure 1 in [6]). We can also note that Lincoln did not himself take on the commitment encoded in the agreement but did that in his role as President.

Refinements in the FrameNet frames may be needed in some instances to describe events adequately. For instance, the freeing_from_confinement frame seems to contemplate freeing from a physical location and may be too concrete for the act of freeing from the institution of slavery. Similarly, the Emancipation Proclamation is not an agreement.

Frames are a way of developing consistent ways of organizing and defining entities and verbs (i.e., processes) to describe the complexity of the world. Because FrameNet frames reflect everyday usage of native speakers, that collection of frames is not fully developed for technical or domain-specific texts. Fortunately for the application to history, most history texts do not use highly domain-specific language. Frames for many domain-specific topics remain to be determined so any frame descriptions we need in those areas will be ad hoc.3

3. Complex Entities

Entities are objects, locations, or concepts. The entities associated with historical events are complex and are generally richer than those usually considered in thesauri and classification systems. In our approach entities are associated with specific attributes. Our entities are somewhat related to frames and schemas from artificial intelligence or structured objects and classes from object-oriented programming languages. However, we emphasize the development of standardized templates for those schemas. Our approach emphasizes descriptors and qualitative states for attributes because those are well suited for mapping verbal descriptions in natural language which are essentially categories. Many verbs describe state changes in attributes and in contexts. Moreover, verbs help to define the attributes of entities which are recognized by speakers of a language. Those attributes are captured as semantic roles in the FrameNet frames.

With the wide range of entities in historical materials, we cast a big net and allow many gradations such as different religious groups or different geographic areas. In some cases, these entities are formally recognized; in other cases, they are ad hoc. The latter may include sets of largely independent-but-related entities that can also be treated as single complex entities.

An automobile is a complex entity. It has attributes as a whole but is composed of parts that are themselves entities. Similarly, an organism is a complex entity (see [5]). The components need to be modeled along with their relationship to the whole. Parts are often identified based on belonging to subsystems and serving specific functions. The "part-of" relationships form partonomies. Moreover, in complex entities, there may be several layers. A state change in one of the low-level components many affect the entire entity. For example, a mutation in DNA may produce illness in an entire organism. Despite the complexity, there need to be standard and extensible ways of describing the relationships. This is beginning to be done for biological systems [25]. While the interactions among components can be very complex, people tend to use relatively simple qualitative models in analyzing phenomena [12].

Complex entities may also evolve in stages across time. Some stages, such as those in an insect's lifespan, are highly predictable and readily treated as states. In other cases, the stages might be roles a person takes on at different times. Stages of some entities, such as political movements (called "trends" in [2]), are less predictable and less structured.

4. Hooks and Knots in the Fabric

4.1. Events, Flows, and the Entity-Event Fabric

Following [3, 8], an event can be viewed as a state change triggered by the state of another entity. In this view, a simple causal relationship is integral to the event. We adopt the term "flow" for a set of events which are related and follow one another (cf. [5]). They can be modeled with formalisms such as state machines, Petri nets, programs, flow-charts, workflows, and activity diagrams. Similarly, connected threads or scripts of events are familiar from cognitive psychology, and may also be used to model flows.

There are different types of flows. They may follow a single prescribed sequence, they may also have branching points, or they may be relatively ad hoc collections of events. Many complex activities combine all of these. For instance, a court trial has a strict overall structure but parameters such as the number of witnesses called may vary and the statements of the witnesses may be unpredictable. High-level descriptions can usually be broken down into lower-level, more detailed components. For instance, a statement that one army conquered another might be broken out into the flow describing how one army landed, attacked, and defeated the other. And, each of those statements could probably be broken down further.

We envision a fabric of interconnected entities and events. Initially, the entities and events in the fabric would be drawn from a limited number of sources but eventually they could come from many sources and be interwoven. The events would be interconnected at several levels. For instance, we might say that Hannibal attacked Rome and might also go into detail about those attacks. It is impossible to specify all the events in history and there will therefore be gaps in the fabric, but there is generally coherence in historical texts. To capture that coherence, we extend the fabric with discourse elements in [6]. For instance, the events in Figure 1 are connected by causal threads. In other cases, events are interlinked and interwoven with explanations which provide elaboration and clarification to the event descriptions.

4.2. Grouping and Scenarios

We need a flexible approach to grouping events. The specification of scenarios is a particularly important type of grouping in our model. A scenario is a set of inter-related entities and events. Grouping by scenario keeps all of the relevant components connected. It is related to the notion of a narrative (see [6]) but narratives are focused on presentations while scenarios are part of the fabric and may be considered a source from which the narratives are drawn.

Scenarios can be largely independent from other parts of the fabric. We may have a good understanding of the synchronization of events grouped in a scenario but not much about how that set of events relates to the broader fabric. As a stand-alone unit, the events in the scenario could be executed in order with the entities acquiring the successive states. Moreover, some entities are highly transient thus, they may be defined just within the scope of the scenario or situation.

4.3. Schematic Description of People

People are entities in our framework; they are extremely complex (see Section 3). Our descriptive requirements are far beyond anything offered by current standards such as FOAF. In this section we describe some possible approaches for developing structured descriptions of people. These could have a broad range of possible applications such as providing a
formal structure for lifelogs. Similar complex descriptions will also be needed for other very complex entities such as organizations and institutions.

Biologically, people can be described as organisms with partonomies and stages of development (see Section 3). For describing typical states and behaviors, we might adopt Maslow's hierarchy of needs (see [10]). This could describe attributes such as hunger which motivate behaviors and other types of physical states such as diseases. For effective information organization, we might share some structures adopted by the user modeling community.

We have a rich language for describing mental activities and internal events such as emotions, expectations, opinions, beliefs, goals, intentions, and strategies. Emotions are readily described with states and potentially other mental activities can be described with states as well. A "goal" is the ambition to place one or more entities into specific states. Goals are accomplished through plans. In some cases, the plan may be very specific or it may simply be a strategy which evolves and adapts according to conditions. As such, the plan can be modeled as a flow with decision points.

Mental events have triggered much of history but are primarily internal and subjective. Higher-level psychological needs and dynamics are even more controversial. Nonetheless we can adopt a descriptive system so long as we document its use and are consistent in our application.

Because individuals engage in a changing set of social relationships, we need flexible mechanisms for defining those relationships. For instance, couples may act as a single composite entity in some situations, but in other situations those individuals may act separately. Eventually, we might also model knowledge, social norms, cultural standards and folk psychologies.

5. Next Generation Information Organization

While traditional information organization has been oriented to indexing information resources, there is an increasing need to support direct interaction with full text and other types of rich content. Our approach could be considered as a type of semantic markup (e.g., [17]). However, our approach is better considered as a companion to the original content rather than as markup with its limitations. For instance, our approach explicitly allows for versioning.

Initially, our approach may usefully be deployed on a small scale in ad hoc situations. This approach can be extended to cover large collections of rich content although the sheer scale of that task makes it challenging. One advantage compared to some previous projects is that we are primarily focused on description rather than inference. With inference, small errors can be easily amplified.

5.1. Fixed Definitions versus Language as Used

By employing the semantic structures of natural language such as those captured by FrameNet as the basis for information organization we accept that language changes and the concepts people use are malleable. While it is possible to employ frames as a fixed dictionary of definitions, the FrameNet frames are actually a snapshot of how people use language. Those definitions can evolve. Ideally, we would track and be able to explain the changes.

5.2. Metadata

All components of the discourse model should have associated metadata. The components need many of the same elements that documents do. Those elements would likely resemble Dublin Core with elements such as creator (i.e., who authored the component) and date of creation. Further, there should be values for beginning and end times of state changes although in some cases those may be uncertain or relative to other events. Different versions of events need to be kept distinct. We need to note the author and endorsers of each version. And, we should include a record describing the evidence for the version (see [6]). Finally, there should be links to descriptive standards which are used such as for the collection of frames, directories and catalogs of entity instances, and classification systems for entity classes.

5.3. Hierarchies for Entity Classes

We envision that entities would be organized by classification systems. While FrameNet incorporates clustering of noun frames, at this point classification systems are better developed. Classification systems use "type-of" or "kind-of" hierarchies and they typically provide inheritance of attributes across levels of abstraction. Because the content we are considering is so varied, there would not be a single overarching classification system but a family of classification frameworks. These might be related to Raganathan's faceted classifications [19] 4 though broader and deeper than anything currently implemented.

5.4. Partonomies and Functionality

Complex entities can usually be subdivided into parts. Sometimes these would be physical parts and sometimes they are sub-assemblies each with a specific function. Partonomies are important for all types of complex systems whether they are mechanical systems such as automobiles or social systems such as organizations. The importance of partonomies for describing biological systems is increasingly recognized [25]. They are integral to all types of system analysis. Moreover, partonomies are interrelated with class hierarchies; for example, some parts are shared across trucks, cars, and bicycles while others are not.

Functionality describes how a part of a system contributes to the entire system. Functionality is a type of normative expectation for the contexts of use of an entity (i.e., the verbs and entities associated with an entity). For example, social organizations have parts, often related to functionality. The utility of understanding functionality is illustrated by the Australian "series" system for archives which organizes materials by functionality [13]. An extended process-oriented information organization model might link verbs to entities in a way similar to the way that methods are associated with entity classes in object-oriented programming languages.

5.5. Directories of Instances

Classification systems and partonomies generally describe entities in the abstract. We also need systems for describing entity instances which fill specific positions, slots, or roles. For example, for a full understanding of Lincoln signing the Emancipation Proclamation, it is necessary to know that Lincoln was President and what some of the powers of a President are. [4] developed a directory for the structure of the top-layers of the US federal government and for office-holders at positions with the government across several years. While people may fail to follow organizational structures and rules, it is still useful to know the normative expectations.

Many types of instance information are needed to model the wide range of interrelated topics covered in a daily newspaper. We call organized presentation of such material directories, catalogs, gazetteers, or almanacs. There are already many sets of instance records such as genealogical, military and government records, as well as meteorological, sports, and financial databases but they need to be unified and cross-referenced. This could probably be done with template-enhanced RDF and could include typical processes associated with the fields of the templates. For instance, members of an organization are expected to perform certain activities based on their roles.

6. Conclusion

There has been relatively little investigation of systematic approaches to the organization of and interaction with digital collections of historical materials. As increasing amounts of rich content are becoming available there is a critical need to develop better ways to organize and support access to that content. Having large amounts of historical data available will allow the study of history to focus more on local communities and a broader range of individuals. We have proposed a new approach to the organization of such information based on schematic modeling of entities and events. We have drawn on software engineering, system dynamics, cognitive science, and linguistics as inspirations for modeling. Specifically, we have proposed that events serve as the underlying fabric for history descriptions and that a systematic structural description will provide more clarity than traditional text descriptions. We envision a progression of increasingly ambitious applications of the fabric. The initial versions would map simple secondary texts into graphical displays such as shown earlier in Figure 1. Later steps would coordinate several texts and, ultimately, readers and authors might engage directly with the fabric. In [6] we describe discourse structures which may help to support such applications.

Notes

1 The semantic roles of verb frames can be applied to the state-change model of [4, 9]. The verb defines the dimensions associated with the state change. "To be born" or "to die" describe state changes in the dimension of being alive.

2 FrameNet also includes features such as inheritance of some frames by others and attributes of frame elements such as semantic typing which may be useful but we do not consider using them in this proposal.

3 We can view science as an approach to finding good sets of entities and frames which apply across specified situations.

About the Author

Robert (Bob) B. Allen is now a Visiting Foreign Research Scholar at the University of Tsukuba, Japan. In the past several years, his research has emphasized specification and interaction with narrative chains such as those from stories, from scientific explanations, and from narrative history. Dr. Allen has been Program Chair or General Chair of several major conferences. He was Editor in Chief of the ACM Transactions on Information Systems for 10 years and Chair of the Publications Board of the ACM. He was a Visiting Professorial Fellow at Victoria University of Wellington and he has worked at the iSchools at Drexel and the University of Maryland. Earlier, he was Senior Scientist in the Information Science Research Group at Bellcore and was a Member of Technical Staff in Research at Bell Laboratories. He received his PhD in Social and Cognitive Experimental Psychology from UCSD.