Security is becoming a weak point of energy and communications infrastructures, commercial stores, conference centers, airports and sites with high person traffic in general. Practically any crowded place is vulnerable, and the risks should be controlled and minimized as much as possible. Access control and rapid response to potential dangers are properties that every security system for such environments should have. The INDECT project is aiming to develop new tools and techniques that will help the potential end users in improving their methods for crime detection and prevention thereby offering more security to the citizens of the European Union.

In the context of the INDECT project, work package 4 is responsible for the Extraction of Information for Crime Prevention by Combining Web Derived Knowledge and Unstructured Data. This document describes the first deliverable of the work package which gives an overview about the main methodology and description of the XML data corpus schema and describes the methodology for collection, cleaning and unified representation of large textual data from various sources: news reports, weblogs, chat, etc.

This section provides an overview of deliverable 4.1, the list of participants and their roles as well as a thorough description of the annotation schemes used in publicly or under licence available corpora.

The aim of work package 4 (WP4) is the development of key technologies that facilitate the building of an intelligence gathering system by combining and extending the current-state-of- the-art methods in Natural Language Processing (NLP). One of the goals of WP4 is to propose NLP and machine learning methods that learn relationships between people and organizations through websites and social networks. Key requirements for the development of such methods are: (1) the identification of entities, their relationships and the events in which they participate, and (2) the labelling of the entities, relationships and events in a corpus that will be used as a means both for developing the methods.

In this report, we provide an overview and a thorough review of the annotation schemes used to accomplish the above goals. Based on our review, we propose a new annotation scheme able to extend the current schemes. The WP4 annotation scheme is used for the tagging of the XML data corpus that is being developed within workpackage 4. Our general objectives can be summarised as follows:

Our first objective is the study and critical review of the annotation schemes employed so far for the development and evaluation of methods for entity resolution, co-reference resolution and entity attributes identification.

Based on the first objective, our second goal is to propose a new annotation scheme that builds upon the strengths of the current-state-of-the-art. Additionally, the new annotation scheme should be extensible and modifiable to the requirements of the project.

Given an XML data corpus extracted from forums and social networks related to specific threats (e.g. hooliganism, terrorism, vandalism, etc.); an annotation and knowledge representation scheme that should provide the following information:

•The different entity types according to the requirements of the project.

•The grouping of all references to an entity together.

•The relationships between different entities.

•The events in which entities participate.

Additionally the annotation and knowledge representation scheme should be extensible to include new semantic information..

The WP4-annotation & knowledge representation scheme allows the identification of several types of entities, groups the same references into one class, while at the same time allows the identification of relationships and events.

The inclusion of a multi-layered ontology ensures the consistency of the annotation, and allows the satisfaction of the requirements of extensibility and modifiability of the current scheme.

2.1.2.2WP4-annotation & knowledge representation scheme applications

The WP4-annotation & knowledge representation scheme facilitates the use of inference mechanisms such as transitivity to allow the development of search engines that go beyond simple keyword search. This is accomplished by the use of a multi-layered ontology.

Additionally, the rich annotation offers a benchmark for the evaluation of NLP methods as well as a significant resource for their development and fine-tuning.

In this report we focus on the annotation schemes used in a set of 6 publicly or under license available corpora. These datasets/annotation schemes are the following:

•Automatic Content Extraction (ACE)

The first dataset is the Automatic Content Extraction Dataset (release: LDC2007E63) [2]. This dataset is provided by the Linguistic Data Consortium [1] under license. This dataset has been produced using a variety of sources, such as news, broadcast conversations, etc. Table 1.1 provides an overview of the dataset properties. More importantly, ACE annotation also focuses on co-reference resolution, identifying relations between entities, and the events in which these participate.

The goal of the 2009 Knowledge Base Population track (KBP) [3] is to augment an existing knowledge representation with information about entities that is discovered from a collection of documents. A snapshot of Wikipedia infoboxes is used as the original knowledge source. The document collection consists of newswire articles on the order of 1 million. The reference knowledge base includes hundreds of thousands of

entities based on articles from an October 2008 dump of English Wikipedia. The annotation scheme in KBP focuses on the identification of entity types of Person (PER), Organization (ORG), and Geo-Political Entity (GPE).

•NetFlix

NetFlix [9] is a movie rental site that has started a competition to improve upon its movie recommendation engine. The movie rating data contain over 100 million ratings from 480 thousand randomly-chosen, anonymous Netflix customers over 17 thousand movie titles. It is straightforward that NetFlix focuses on a domain-specific task, hence its annotation is well-suited for this domain.

The purpose of this section is to present the annotation scheme employed by the Automatic Content Extraction (ACE) project[2]. The vast amount of electronic information, most notably lying around the web, provides a huge resource that can be exploited to enhance the development of natural language understanding applications.

However, in order to take advantage of this potential, it is essential to develop technology that extracts content from human language automatically. This is the objective of the ACE project, i.e. the development of content extraction technology that supports the automatic processing and exploitation of language data in text form. Language data is derived from a variety of sources such as newswire, forums, blogs, etc. The ACE scheme supports a large number of Natural Language Processing (NLP) applications by extracting and representing language content, i.e. the meaning conveyed by the data.

The specific objective of the ACE project is to develop technology to automatically infer from human language data the following:

■The named entities being mentioned in text.

■The relations that exist among the identified entities.

■The events in which the identified entities participate.

■All references to an entity and its properties.

It should also be mentioned that the ACE data sources include audio and imaged data in addition to pure text. In addition to English, ACE has also released datasets for Chinese and Arabic. Based on the above, the ACE project consists of the following four tasks:

1.Entity Detection & Characterization (EDC)

2.Relation Detection & Characterization (RDC)

3.Event Detection & Characterization (EDC)

4.Entity Linking Tracking (LNK)

In the following sections, we describe each of the tasks in terms of: (1) the language data that they annotate, (2) the categorization scheme employed to organize the annotated data, and (3) the exact annotation used in text. For each task and type of data annotation we provide a number of examples to allow the comprehension of the annotation framework.

The goal of the ACE EDC task is the recognition of entities, not just names. This means that all mentions of an entity, i.e. a name, a description, or a pronoun are identified and then classified into equivalence classes. Therefore, co-reference resolution (Entity Linking Tracking task) is important. ACE classifies entities into one of seven main types, which are further divided to more specific subtypes. These five main types are the following:

■Persons (PER)

■Organizations (ORG)

■Location (LOC)

■Facility (FAC)

■Geographical/Social/Political (GPE)

■Vehicle (VEH)

■Weapon (WEA 3.1.1.1 Persons (PER)

This type is used to annotate entities that refer to a distinct person or a set of people. For instance, a person might be specified by his/her name (e.g. George Robertson), occupation (e.g. the lawyer), family relation (e.g. uncle), pronoun (e.g. he), or any combination of these. The Persons is further subdivided into the following subtypes:

This subtype is used to classify an entity which refers to a group of people unless the group can also be characterized as an Organization or GPE. This not represented by a formal organization (e.g. The ancient Greeks). For example:

The mention of an organization or a group of organizations in a given document gives rise to an entity type of Organization. Note that an Organization entity must have been established in a formal manner. Some examples of organizations are firms, government units, sports teams, music groups and others. This entity type is divided into the following subtypes:

This subtype refers to entities that are related to governmental affairs, politics, or the state. Note that the entire government of a GPE is excluded from this subtype and should be classified as GPE.ORG as we will see later. This subtype also includes military organizations that are connected to the government of a GPE. Some examples are the following:

This subtype refers to organizations, which primarily focus on providing entertainment services, but excludes giant organizations such as Disney, which is a commercial organization. Some examples are the following:

[The York Theater Company] decided to increase the number of plays. . . [Beatles] was one of the most famous music groups that . . .

This subtype includes organizations that focus on organizing or participating in athletic events. These events can be professional, amateur or scholastic. This subtype also includes groups whose sports are board or card games. Some examples are the following:

[The International Football Federation] set new rules . . . [Manchester United] has lost the European championship, because . . .

Finally, it should be noted that many organization might fit to more than one subtype. In these cases, ACE annotators assign organizations to the most specific subtype.

This type includes composite entities that consist of a population, a government, a physic al location, and a nation (or province, state, county, city, etc.). All the mentions of these aspects are marked as GPE. For example, in the phrase the people of U.K, there are two mentions that are marked, i.e. people and U.K. This is because these mentions are co referenced, as they refer to different aspects of a single GPE. The government of a country is also treated as a reference to the same entity represented by the name of the country. Thus, the Greece and Greece's government are mentions of the same entity. Note however that specific units within a government are tagged as organizations. GPE type is divided into the following subtypes:

This subtype includes groupings of GPE that can function as political entities (e.g. [the European Union]).

It should be noted that: (1) non-political clusters of GPE are marked as Location (e.g. [the Northern Italy]), (2) coalitions of governments are marked as Organizations (e.g. [the NATO]). Additionally, each GPE entity is associated to a role that can be Person, Organization, Location, or GPE. This judgment depends on the relations that the entity enters into.

Places referring to geographical or astronomical regions and do not constitute a political entities give rise to Location entities. For example, the Ouse river, Mountain Everest, or the solar systems are location entities. This type is further divided into the following subtypes:

In ACE a facility is defined as a functional, primarily man-made structure. This type includes buildings such as airports, stadiums, factories, museums, prisons, etc. They can be considered as artifacts of the domains of civil engineering or architecture. Facilities are further subdivided into the following:

Vehicle (VEH) & Weapon (WEA) are two types which are also included in ACE 2008.

However, the entity task guidelines do not describe these types. 3.1.2 Relation Detection & Characterization (RDC)

The goal of this task is to identify and characterize the relations between two target entities that have been identified in the EDT task. Each relation takes two arguments, i.e. the entities that participate in the relation. Each identified relation has to be assigned one of the seven class types. These types and their subtypes are the following:

All other social relationships that to do not fit into the above subtypes are assigned to PER SOC.Other. For example:

PER-SOC.Other ([George'sflatmates, the George])

3.1.2.3 Employment/Membership/Sub sidiary (Tag: EMP-ORG)

This relation includes employment relations between PERs and an ORG or GPE, subsidiary relations between ORGs and GPEs, and membership relations between one of PER, ORG, GPE and an ORG. It has the following subtypes:

A Discourse relation captures part-whole or membership relations, which are established only for the purposes of the discourse. The group entity referred to is not an official entity relevant to world knowledge. For example:

The goal of this task is to identify and characterize events according to five predefined types. Each event is tagged by its textual anchor, full extend, and participating entities. For each event type there is a salient entity. A salient entity can be the object of the event (Object Salient Events), or the agent of the event (Agent Salient Events). Table 2.1 shows this classification. In the following examples, square brackets are used to denote the extend of an event, curly brackets are used to denote the anchor of an event, while parenthesis are used to identify the salient entity.

The goal of the Entity Linking task is to group all references to an entity and its properties together. While an Entity is an object or set of objects in the world that can be referenced by their name, a nominal phrase, or a pronoun, a Composite Entity results from linking an Entity to all attributive mentions of its properties.

All specific and generic entities are linked with the predicates and other attributive mentions that ascribe properties to them. This ensures that each composite entity consists of a set of strings, which either refer to or describe a given entity in text. The following relations are examined for entity linking.

Cross-type Metonymy can happen when a composite entity consists of EDT entities that can be assigned to different EDT types depending on the context. One example is that of ORGs and the FACs they occupy. While in the EDT stage these two characteristics are tagged separately (ORG & FAC) depending on context, in this stage group entities of different types are grouped together into a composite entity by creating links between them when they refer to different aspects of the same underlying object. For example:

[The White House] announced yesterday that. . .

[John Smith reports from the White House park] . . .

In this example, the first mention of White House is of type EDT.ORG. However, the second mention is of type EDT.FAC. Each of these mentions will be linked, since they evoke different aspects of the same underlying entity.

The goal of the KBP track at the 2009 Text Analysis Conference is to evaluate the ability of automated systems to discover information about named entities and to incorporate this information in a knowledge source [3]. KBP consists of be two related tasks: Entity Linking, where names must be aligned to entities in a knowledge base, and Slot Filling, which involves

mining attributes of entities from text.

In contrast to ACE, KBP focuses on the following types of entities:

■Person (PER)

■Organization (ORG)

■Geo-Political Entity (GPE)

The description of the KBP scheme does not provide any details regarding the categorization of the top-level types to more specific ones. However, as in the ACE evaluation, GPEs include inhabited locations with a government such as cities and countries. Wikipedia infoboxes are the basis for the reference knowledge base; An infobox is a data structure that allows the description of a target entity through a set of desired attributes called slots. There is one generic infobox for each entity type.

Table 2.2 shows these generic infoboxes and their slots that include the attributes of entities. As it can be observed, KBP provides a richer scheme in terms of entities attributes and relations than ACE.

On the other hand, ACE provides a clear classification of relation types, which ensures consistency and avoids duplications. In the next section, we present the advantages and disadvantages of using infoboxes as a knowledge representation scheme as opposed to having a fixed set of relations or events. Based on that discussion, we propose an extended version of ACE that includes infoboxes in the next chapter.

As it has already been mentioned in the first chapter, NetFlix is a movie rental site that has started a competition to improve upon its movie recommendation engine. The movie rating data contain over 100 million ratings from 480 thousand randomly-chosen, anonymous NetFlix customers over 17 thousand movie titles. The ratings are on a scale from 1 to 5 (integral) stars. Training data consist of a file for each movie. The first line of each file contains the movie id followed by a colon. Each subsequent line in the file corresponds to a rating from a customer and its date in the following format: CustomerID, Rating, Date.

In the introductory section we also mentioned that the WePS workshop [4, 5] focused on two tasks, i.e. clustering web pages to solve the ambiguity of search results, and extracting 18 kinds of attribute values for target individuals whose names appear on a set of web pages.

The WePS development data consist of 47 ambiguous names and up to 100 manually clustered search result for each name. The test data consists of 30 dataset where each one corresponds to one ambiguous name. The sources used to obtain the names were Wikipedia biographies, ACL'08 committee members and US census data. In average, there are 18.64 different people per name, but the predominant person for a given name owns half of the documents. A sample cluster set for target Abby Watkins is given below:

<?xml version = "1.0" encoding="UTF-8"?>

<clustering name = "Abby Watkins ">

<entity id="7">

<doc rank= "111" />

</entity>

<entity id="2">

<doc rank= "81" />

</entity>

<entity id="0">

<doc rank= "21"/>

</entity>

<entity id="14">

<doc rank= "99 " />

<doc rank="36"/>

<doc rank= "52 " />

</entity>

</clustering>

For the second task the organizer distributed the target Web pages in their original format, (i.e., html), and the participants were to expected to extract attribute values from each page. The individual names associated with a particular page were given, and the attribute values for that person should be extracted. Web pages containing multiple individuals sharing the same name will not be given. Table 2.3 lists the attributes used in the task and annotation scheme

It is apparent that the annotation scheme of ACE provides a rich scheme for the identification, grouping of entities and the discovery of the relations and events, in which they participate. However, ACE does not include a knowledge base, which would enhance its extensibility and modifiability according to the domain or genre of interest.

The extensibility of ACE to specific domains of interest is essential, since it would allow the development of methods focusing on domain-specific threats, such as hooliganism, vandalism, terrorism, and other.

KBP has a significant advantage over ACE's annotation and knowledge representation scheme in that it can be easily extended. This is a consequence of the use of Wikipedia infoboxes, in which one can introduce new slot names describing attributes, relations, or events related to an entity.

However, infoboxes are not the ideal representation scheme, since they can introduce duplication and loss of integrity. This is verging on something that should be classified as a major problem with this representation. The following example illustrates this problem:

In Table 2.4, we observe that although both of the entities refer to presidents of the United States, the corresponding infoboxes differ in two slots, i.e. education-University Degree, and website-URL. This is due to the each slot pair refers to the same underlying concept. For example education-university degree refers to the education someone has received, while website-URL refer to his/her official website. This inconsistency has been caused by the same property that offers extensibility, i.e. the ability to add new slot names in the created infoboxes.

To summarize, this section has provided an overview of the annotation & knowledge representation schemes used in ACE, KBP, NetFlix and WePS. It is apparent that ACE provides a rich scheme, which however is not easily extensible or modifiable, as it lacks structural relationships between objects of interest, while at the same time a knowledge representation scheme is absent.

In contrast, KBP is essentially a subset of ACE in terms of annotation. However, KBP uses a knowledge base and Wikipedia infoboxes as a means to represent knowledge. This allows having an easily extensible and modifiable scheme, yet it introduces duplications and does not ensure integrity. In the next section, we aim to overcome the above limitations by proposing a extended annotation scheme of ACE, which includes the use of an ontology.

In this chapter we outline the deficiencies of KBP and ACE, proposing an extension of ACE annotation as the WP4 annotation and knowledge representation scheme. We also argue that a clear and consistent ontology design is a necessity for any application that requires sophisticated search and reasoning and for overall efficiency in knowledge management. We also propose the use of the Proton ontology [8]. The choice of this ontology was motivated by the fact that this ontology already conforms to the ACE annotation guidelines.

D4.1 aims to focus on analysis of security related data from websites, blogs, chats and other social medium. The project aims to analyse data related to hooliganism, terrorism and other types of crime. The AGH (Prof. Wieslaw Lubaszewski's) team has initiated the task of data collection. This section describes the ongoing effort and the methodology employed. It does not include the actual data as this is currently being collected. The current effort is directed towards collecting data on football hooliganism and sale of human organs. In parallel to this, the Ostrava team (Mr Adam Nemcek) has also started work on data collection on similar topics.

The current data collection activity follows the following methodology:

•Only highly relevant data will be collected to ensure that machine learning systems trained using the data will not be swamped by noise.

•The data will be multi-lingual covering a number of different languages. Currently, data is being collected in Polish and Czech.

•Specialised crawlers will be used to help with 1. and 2. and to lessen the need for manual filtering. Both Ostrava and AGH already have built their specialised crawlers. In addition, open source crawlers are also available.

•The subset of the collected data will be annotated using the annotation scheme described in this report. This annotation will be detailed in that it will identify all relevant potential threats, the participants, the locations, the time and connections between entities involved. End users (i.e. the police) will be used to verify the correctness of the annotation where it is necessary.

Data from websites, blogs and social networks especially user forums etc. do not always follow strict HTML standards. These are usually ill formed and usually requires cleaning and preprocessing before it becomes usable by any natural language processing pipeline. However, manual cleaning of such data is neither feasible or acceptable as NLP systems developed within the project need to be robust enough to handle such data.

For the above reasons, we propose to employ standard supervised machine learning methods to automatically convert ill-formed data into a well formed corpus.

Italian officials say two train cars filled with liquefied natural gas have derailed and exploded in western Italy, killing at least 14 people.

</p>

</body> </html>

It can be observed that the tags Sender, Date, Time, Text have been inserted in the HTML code to allow the recognition of entities, dates, and text within a blog entry.

The above pair constitutes a single training example. A number of such pairs will be collected to form a training set for a supervised machine learning system such as an SVM. The task of the SVM is to predict the location of different tags (e.g. sender, recipient, posting date etc.) at specific points in the input. This can be formulated as a binary decision problem. A separate

SVM will need to be trained for each tag. And, SVMs can be used in a pipeline to generate all the missing tags.

The aim of the new annotation scheme is build upon the strengths of ACE annotation scheme and the KBP annotation & knowledge representation scheme. As mentioned in section 2.2, ACE provides a clear classification of relation types, which ensures consistency and avoids duplications.

This should be the primary characteristic of the new annotation scheme. Secondly, ACE annotation already defines a subclass relationship. Wikipedia infoboxes which are used as knowledge bases for KBP are a set of subject-attribute-value triples that lists the key aspect of the articles subject. However using infoboxes as a knowledge representation scheme has the following disadvantages:

■Multiple templates exists for the same class

■Multiple attribute names for the same property

■Attributes lack domains or datatypes

However the infobox classes and attributes can be mapped to corresponding ACE entity and relation annotation scheme. So we can view KBP as a subset of ACE. However for the purpose of this project combining the good features of both annotation schemes seems to be the way ahead.

ACE has clearly defined guidelines for events which the KBP annotation does not address. Meanwhile the infoboxes can be easily extended. However there is no clear ontology defined by these schemes. So an ontology based upon ACE annotation scheme should be implemented.

The need of a better defined ontology is necessary for the following reasons: Query capabilities

One key advantage of using an ontology is that we can go beyond keyword queries and ask SQL like queries. Ontologies allow various kinds on inference mechanisms such as transitivity to allow sophisticated queries. For example to answer Which person got the golden boot at 2006 FIFA world cup? we might have to realize that footballer is a subclass of person.

An ontology can be a catalyst for acquisition of further knowledge, largely automated maintenance and growth of the knowledge base. As the knowledge is changing and ever growing, automated extension is a desired characteristic which could be built on to the ontology. There are lot of existing ontologies that use automated knowledge acquisition or extension based on the current ontology relations. For example Gene Ontology (GO) [6] generates more detailed concepts from existing GO concepts by utilizing syntactic relations among the existing concepts.

For example, the hyponymy relation between two concepts chemokine binding and C-C chemokine binding can be inferred from the hyponymy relation between the subconcepts chemokine and CC chemokine. In other words, one way to expand an ontology is build upon the relationship between the terms in the existing ontology based on syntactic, dependency and semantic information extracted from the original text containing these terms.

Another such example is the CROSSMARC [7] ontology in which new instances for the existing concepts are learned from domain specific corpus using machine learning approaches. Initially the domain specific corpus is annotated with existing concept instances automatically using the existing ontology. To identify new instances a single Hidden Markov Model (HMM) is trained for each set of instances of a particular concept. HMM parameters are calculated from the annotated domain specific corpus using maximum likelihood estimation. Simply put the HMM learns the context in which the instances occur and use it to detect new instances belonging to the training instance concept.

It can be an enabler for semantic search on the web, for detecting entities and relations in web pages and reasoning about them in expressive logics. For example probabilities can be attached to the concepts and properties during ontology building thus allowing us to reason using probabilistic logic. Such kind of extension reduces the problem in reasoning when only partial information about the concept or the instance exist in the ontology and allows reasoning with partial and imprecise information.

The central idea is to create an ontology compatible with ACE annotation. This can be done, if the top layer of the ontology reflects the entities defined in ACE. In addition to the ACE entities the top layer also contains a separate class for events and other properties we are interested in such as TimeInterval to denote some timestamp.

An ontology that satisfies the above is Proton ontology. It was developed to be complaint with ACE annotation scheme among others. Proton[4] is divided into four modules: system, top, upper and knowledge management. Figure 3.1 shows the four modules with the classes it contains.

It contains 38 classes of slightly specialized entities that are specific for typical knowledge management tasks and applications.

As can be observed in Figure 3.1, all of the ACE entity types are incorporated in the top module. Proton was designed to be general purpose and domain independent. The top layer starts with very basic entity classes:

These are further specialized into generally defined entities: meetings, military conflicts, employment (job) positions, commercial, government, and other organizations, people, and various locations. It also covers numbers, time, money, and other specific values.

Additionally, the featured entity types have their characteristic attributes and relations defined for them (e.g. subRegionOf property for Locations, has Position for Person-s, locatedIn for Organization-s, hasMember for Group-s, etc.). Specialization of the classes is achieved with the help of upper layer. For example mountain as a specific type of location and user as a subclass of agent. Separating the ontology into two layers allows for domain specific extensions.

The top module contains the most general classes as per the requirement of the project. The subclasses of these classes belong to the upper module. For example the top class happening includes the subclasses event and situation. Situation is further specialized into jobposition and role. Figure 3.2 shows the top module classes.

The design is an object oriented design. The subclasses inherit the properties from its super classes. For example person inherits properties from agent and object. Apart from the inherited properties, it also has its own properties such as hasPosition (this relates entity person to jobPosition ) and hasRelative (this relates person to another one). In this case the hasRelative relationship is of bidirectional many-to-many type. Figure 3.3 shows the specialization of this relationship.

Proton architecture also incorporates events into the top layer. The top layer class happening has event, situation, timeinterval and jobposition as its subclasses. This class thus incorporates both static (situation) and dynamic happenings (event).

Dynamic events include subclasses like accident, military conflict and sports event. Static events include holding a position like board member or manager. The rationale is that both static and dynamic event has a temporal marker associated with it, for example a sport has start time and end time. Building an ontology in such a way allows user to search, for example, all U.K. prime minister before 1980. The knowledge management module further contains specialization that are specific to knowledge management task. For example the top class agent contains informationsource subclass that belongs to the knowledge management module. The instance e-commerce of informationsource contains collection of documents relating to activities and entities concerning electronic commerce.

This section provides an example of the WP4 annotation and knowledge representation scheme on three texts extracted from the web. The first text fragment is a weblog on hooliganism; the second is a news report on violent events between hooligans of UK football teams, while the third is a partial transcript of a conversation between terrorists.

The goal in this section is to demonstrate the feasibility of extending the ACE annotation scheme and the associated ontology to new genres. In the following example, entities and their corresponding references are annotated with their corresponding ACE tag.

A new sub-class of event say hooliganism can be introduced to handle events related to hooligan activities. Hooliganism can be further specialized into events indicating the seriousness of the event, for example: minor, severe, critical and others. Let us assume that the events E1 to E3 are less harmful than E4 to E7, since the agents in E1 to E3 did not actually execute their attack. Their corresponding types are shown in Table 4.1.

The extension of the ontology is straightforward, since the PROTON already defines an event class. Hooliganism and its subclasses (children) can be added under the event node in Proton. This means that the top layer remains the same, while the new subclasses can be directly added to the upper layer in the PROTON hierarchy.

Additionally, each of the identified event types in the ontology can be assigned a variety of attributes, which indicate for example whether an event was completed or not, or weights illustrating the severity of the event. The latter allows the development of methods which

Omar Khyam [PER.Individual]: We had five Bengalis [GPE.NAT] last year. Guess how we [PER.Group] got them [GPE.NAT] in. From Bangladesh [GPE.NAT] all the way across India [GPE.NAT] into Pakistan [GPE.NAT]... we [PER.Group] bribed the guy [PER.Individual]. You know when you [PER.Individual] go to the check-in, it would all be set up.

Omar Khyam [PER.Individual]: Yeah, just walk straight through bruv normal, just act as if you are a Pakistani [GPE.NAT].

Shazad Tanweer [PER.Individual]: I live in Faisalbad [GPE.NAT] Omar Khyam [PER.Individual]: That's not a problem

Omar Khyam [PER.Individual]: All right bruv [PER.Individual]. Get your parents to pick you up. Or your family ... And that way you will breeze through the airport seriously. Even if they [ORG.GOV] are following you [PER.Individual] - it doesn't really count. Chill out, proper chill out ... until we [PER.Group] contact you and then we'll pick you [PER.Individual] up.

o E1: [Guess how we {got} them {in}. From Bangladesh all the way across India into Pakistan]

o E2: [Guess how we {got} them {in} From Bangladesh all the way across India into Pakistan]

o

E3

We {bribed} the guy.

o

E4

We {bribed} the guy.

o

E5

[when you {go to the check-in}, it would all be set up.]

o

E6

[Even if they {are following} you]

o

E7

[Even if they {are following} you]

Event ID

Event Type

E1

Transportation.Illegal

E2

Transportation.Illegal

E3

FinancialTransaction.Illegal

E4

FinancialTransaction.Illegal

E5

Transportation.Legal

E6

LawEnforcement. Tracking

E7

LawEnforcement. Tracking

Table 4.2 Event types for identified events

In the same vein as in the previous example, we can extend our ontology with different types and subtypes of events. For example, it is apparent that the first two events refer to an illegal transportation, the next two refer to illegal financial transactions, the fifth refers to a legal transportation, and finally the last two refer to law enforcement activities. The corresponding event types which extend the PROTON ontology are shown in Table 4.2

Since the ontology we propose will be based around ACE annotation scheme, we can easily incorporate publically available datasets whose annotations either are directly compliant to ACE or can be mapped to one. In cases where the new dataset has new entity, it can be plugged into the most relevant position in the hierarchy in the ontology. Suppose we did not have a specific category for the entity "asteroids". Since the ontology design starts with very general (system module) and spans to specific instances (upper module) we could still plug "asteroids" as an instance of "object" class. At worst case we can fit any new entity to the "entity" class in the system module.

•NetFlix mapping

Regarding NetFlix, we can view the data as stating X commented on movie Y. In case of the ontology movie would fall under the class movie subclass of mediaproduct which itself is a subclass of product. Comment meanwhile is a static event since it was given at a specific time by X.

•WePS mapping

Similarly with the WePS dataset we can view the annotation as stating X owns document Y. Since documents are clustered for each name, we can visualize that person owning that web page (artifact).

•KBP mapping

Finally, as we have already mentioned, KBP can be considered as a subset of ACE, hence the infobox name slots can be easily mapped to their corresponding events or relations included in PROTON.

It is well understood that KBP annotation has problems regarding consistency and clarify in its definitions. ACE annotation has the basic characteristic that we look for a clean and consistent design. However the knowledge we are trying to maintain can change and evolve over time. As a result an extensible framework for knowledge representation is essential. A multi-layered ontology such as Proton seems to be the way forward. An ontology additionally allows us the use of more powerful and expressive queries.

This report has provided a thorough overview of the current-state-of-the-art on the annotation schemes employed for the identification of entities and the attributes that characterize them. The survey part focused on the annotation schemes used publicly and under license available datasets.

In particular, we presented the ACE scheme, which annotates a number of different entity types, relations between them and the events, in which they participate. Following that we presented the KBP annotation and knowledge representation scheme, which in terms of annotation coverage can be considered as a subset of ACE. Additionally, two smaller annotation schemes were discussed, i.e NetFlix and WePS-2.

Based on the critical survey we proposed a new annotation & knowledge representation scheme that extends ACE, so that the new annotation scheme has the following properties:

•It is extensible, in order to fit to the requirements of the project.

This is particular useful in the early stages of the project where the requirements are not fully specified.

Extensibility is achieved by using an ontology, which allows the addition of new entities, relations, and events, while at the same time avoids duplication and ensures integrity (as opposed to the KBP scheme).

About Me

'Mission statement'.
I am convinced that jewish individuals and groups have an enormous influence on the world. The MSM are, for almost all people, the only source of information, and these are largely controlled by jewish people.
So there is a huge under-reporting on jewish influence in the world.
I see it as my mission to try to close this gap. To quote Henry Ford: "Corral the 50 wealthiest jews and there will be no wars." `(Thomas Friedman wrote the same in Haaretz, about the war against Iraq! See yellow marked area, blog 573)
If that is true, my mission must be very beneficial to humanity.