Languages

RATIO

Hackathon 8-10 October 2018

Results

1 Background of the Unconference/Hackaton

During the SPP RATIO Kick-Off conference hold in April this year, it was proposed to organize a group meeting in which the members of the RATIO projects had the oportunity to discuss ideas and work together on concrete problems and challenges. The idea was to have a common dataset to perform different tasks on it and at same time to have a common base for discussion. The shared tasks had to be not too specific, but clear enough to be reusable by different groups.

As a response for that initiative, a three-day event was planned (from 8-10 October) in the CITEC in Bielefeld. The intention of this event was to start working on tasks that could be of interest for several groups working in the RATIO program, as well as a basis for the organization of shared tasks that gave visibility to the RATIO program.

The event was thought as an “unconference” that gave space to discussion and brainstorming and that could become a “hackathon” if some groups decided to actually implement their ideas.

The initial proposal for the shared task was focused on the ranking of argument pairs according to six argument quality dimensions and the relations among them: which argument is stronger, more convincing, more believable, more important, clearer to understand and better justified?. The plan was to release the corresponding dataset before the event.

In order to collect the data for this task, two crowdsourcing1 questionnaires were designed in which pairwise arguments of the same stance towards a given topic were presented. The arguments were taken from the microText corpus. Each questionnaire contained fifty argument pairs. One questionnaire included only pos argument pairs and the other questionnaire only con argument pairs. The raters were asked to select which of the two arguments was “better” in each dimension (i.e., either argument1 or argument2). Each questionnaire was rated three times (i.e. three judgments in crowdsorcing terms)2. The inter-rater agreement was calculated with the Fleiss-kappa coefficient. However, no agreement was obtained among the three ratings. Therefore, it was decided not to use that dataset.

Unconference topic: “Same Side Classification”

As a result of the lack of agreement in the previously suggested dataset, a new topic for the unconference was set up. The suggested topic was the "Same Side Classification": given a pair of arguments of the same stance (either two pro or two con arguments) towards the main claim, can we say whether they are both pro or both con a given claim?, i.e. are they on the same side?.

Dataset: The argsme corpus

A larger corpus composed of around 1,500 debate topics from debatewise.org was provided by Benno Stein and Henning Wachsmuth from their project args.me. Pairs of arguments with the same stance were formed with the arguments of the args.me corpus. For each pair, the topic is mentioned alongside two arguments on the topic, and whether these arguments are on the same side or not (i.e., whether the two arguments have the same stance or not). The resulted dataset has around 12,000 pairs of arguments "on the same side" and around the same number of argument pairs NOT "on the same side". A similar dataset was formed with the arguments of the microText corpus. The args.me and microText datasets, and a script for testing classifiers using k-fold cross validation (80-20) were provided.

Resources

Different resources for internal communication of the participants were used, such as:

Slack: for posting documents and messages in different communication channels.

GitLab: the RATIO project space set up in the GitLab of Bielefeld University was used as a repository of documents and code.

E-mail lists: for communicating with all members via e-mail. Google docs: some groups used it as a mean for working on shared documents and slides.

2 Description of the Event

First day

On Monday the 8th of October, the "unconference/hackaton" started at 9:00 am. with a short welcome note directed to the participants by Prof. Philipp Cimiano. Then, the participants introduced themselves and briefly mentioned their areas of interest. In the next session, the proposed topic for the task and the available data were described by Olivia Sanchez-Graillet and Christian Witte in order to frame the tasks. Henning Wachsmuth kindly supported us with his expertise and provided more details on the args.me corpus which was shared before the event via the above mentioned repository and about one of his papers which was used by the quality argumentation group.

We continued with the organization of the working groups. After deliberation, four groups were suggested: 1) baselines for "the same side classification" task, 2) argument representation and frameworks, 3) argument quality and 4) argument retrieval. However, since there were only two people in the argument retrieval group, the two participants decided to join other groups. Therefore, the final groups were: 1) baselines for "the same side classification" task (five members), 2) argument representation and frameworks (9 members) and 3) argument quality (10 members). Afterwards, each group started working in the assigned rooms for them. All participants only met during the coffee breaks and lunch. We worked until 5:30 pm on that day.

Second day

We continued working in groups during the morning and afternoon. On that day we had our social events. First we visited the Computermuseum in Paderborn. Afterwards we had a little promenade in the Bielefeld Center, and finally, we had dinner in a Pizza Restaurant in Bielefeld. We had a nice evening together!

Third day

In the morning we had the group presentation session in which each group had 20 mins. to report their work. After the coffee break we met again in our groups in order to establish the next steps of our work. Finally, after lunch, we had the closing session in which each group explained the possible next steps of their works.

As next steps, the baseline group proposed to set up a paper about the task carried out, to define a reasonable baseline as well as some related approaches, and to establish milestones for that work. The representation group expressed their interest in writing a survey that could be published. Finally, the group working on the quality of arguments suggested the development of a platform for argument quality evaluation. The platform could be developed as the shared effort of members of some of the projects participating in RATIO. Coordination of tasks and provision of the infrastructure necessary to host the platform would be required.

In this session, we also commented on general issues about the organization of this event as well as the organization of future events, such as a summer school in argumentation technology and the annual RATIO meeting. Other suggestions were: instead of calling the event Hackathon, just calling it unconference, "project meeting" or something more appropriate for its purpose; start later on the first day of the event, considering the travel time to Bielefeld; have shorter presentations and smaller working groups; and stick to the given topic and do not change it a few days before the event. For future meetings, perhaps it would be convenient to have a common site where the participants could suggest working tasks for the event, form groups and join them before the event takes place. In such a way that the participants know the topic and related tasks in advance and they can get prepared and bring research questions and ideas to the meeting.

At the end of the session we performed a votation to select the best group work and gave the prizes to the winners.

A more detailed summary of the results of each working group is provided below. The slides of the group presentations are also available in the Slack channel, in GitLab and at Hackathon_Slides.

3 Summary of the group reports and closing session

The "Baselines for the same side classification task" Group

Some baselines for predicting whether two arguments were on the same side or not were implemented. An SVM was used on the dataset provided for the task. The method used simple semantic, lexical and syntactic features (e.g. semantic features: sentiments and name entities; lexical features: 1-3-grams tokens and lemmas; syntactic features: 1-3-grams POS).

Baselines with topic and without topic were compared. The best results were achieved when considering the topic and using syntactic and lexical features: F1 0.57 and accuracy of 0.61.

Next steps:

Describe the task in more detail

Improve the results

Publish it

The "Argument Representation and Frameworks" Group

The group revised different argumentation frameworks and analyzed two possible options of argument representation. They proposed a feature-based analysis of arguments which considered Ethos, Pathos and Logos.

The group concluded that the representation of arguments, argumentation frameworks and argumentation tools depends on the discipline, objects of investigation, goals and research interests. The baseline systems for argument analysis can be enhanced with additional features and background knowledge.

Next steps:

Implementation of the approach for argument analysis and stance detection: selection of an appropriate framework, experiment with different representations as input, and enrich the data by adding other features or knowledge bases.

Evaluation of the appropriateness of the chosen framework and the performance of the representations.

Investigate whether there could be any improvement from using the derived information over the baseline system.

Publish the results.

Coordination would be needed for:

Finding and collecting the requirements and features required for argument representation and for revising existing representation projects, as well as their strengths and weaknesses.

Moderation of the discussion.

Reviewing the argumentation from different disciplines.

Creating a common vocabulary/glossary of terms/concepts used in the argumentation domain.

The "Argument Quality" Group

Different argument quality dimensions were analyzed based on the taxonomy proposed by Wachsmuth et al. (2017). The three main argument quality aspects which are associated with several quality dimensions are: a) logical quality in terms of the cogency of an argument; b) rhetorical quality in terms of the persuasive effect of an argument or argumentation; and c) dialectical quality in terms of the reasonableness of the argumentation for resolving issues. The different quality dimensions and sub-dimensions associated to that argument aspects are: Cogency: local acceptability, local relevance, and local sufficiency. Effectiveness: credibility, emotional appeal, credibility, appropriateness and arrangement. Reasonableness: global acceptability, global relevance and global sufficiency.

From this analysis, the group concluded that the development of a platform for argument quality evaluation would be of great interest for the argumentation technology community. The platform could be developed as the shared effort of those members of the different projects of RATIO who would like to work on it. For that purpose, coordination in terms of organization and assignment of tasks and project follow-ups would be required, together with the provision of the infrastructure necessary to host the platform and to carry out its further maintenance.

Some ideas for initial features of the platform are:

The platform would provide a framework for evaluating the different argument quality dimensions3 of the given argument datasets.

The users would be able to provide their data and classifiers to the platform.

The users could upload their classifiers to predict different type of argument quality dimensions and they would be evaluated on a test dataset that resides in the platform and that is unknown by the users.

It was suggested to revise other existent related platforms such as TIRA (by Benno Stein's group) and argugrader (by Chris Reeds et al.). The former is used for evaluation of shared tasks whilst the later performs analysis and grading of arguments.

2In the crowdsourcing platform used, one judgment can be made by different people. In this experiment, the permanence of a rater in the main task depended on her performance on the quality test questions.

3 Different types of argumentation and argumentation granularities should also be considered.