ArgMining22-Shared-Task

Shared task predicting validity and novelty

View the Project on GitHub phhei/ArgsValidNovel

Shared task on predicting validity and novelty of arguments

In recent years, there has been increased interest in understanding how to assess the quality of arguments systematically. Wachsmuth et al. proposed a framework for quality assessment consisting of the following top dimensions: logic, rhetoric, and dialectic. Regarding the dimension of logic, there has been some work to assess the quality of an argument or conclusion automatically.

Recently, there has also been interest in the generation of conclusions or arguments. In order to guide the process of automatically generating a conclusion, our assumption is that we might need metrics that can be automatically computed to estimate the suitability and quality of a certain conclusion. Two important metrics/objectives are that the conclusion is valid, that is, that the conclusion “follows” from the premise. At the same time, it is easy to produce conclusions that “follow” from the premise by repeating (parts of) the premise in the conclusion, trivially generating a “valid” but vacuous conclusion. In this sense, it is important to assess whether conclusions/arguments are not only valid, but also novel.

We define validity as requiring the existence of logical inferences that link the premise to the conclusion. In contrast, novelty requires the presence of novel premise-related content and/or combination of the content in the premises in a way that goes beyond what is stated in the premise. Hence, a conclusion that is valid but not novel could be a repetition, a paraphrase or a summary of the premise, and only a novel conclusion offers a piece of information that extends what is already covered by the premise – whether it supports or contests the premise.


Tasks

We divide the task of Validity-Novelty-Prediction into two subtasks.

  1. Task A: The first task consists of a binary classification task along the dimensions of novelty and validity, classifying a conclusion as being valid/novel or not given a textual premise.
  2. Task B: The second subtask will consist a comparison of two conclusions in terms of validity / novelty

Participants can choose whether to address Task A or Task B, or both.

Subtask A: binary novelty-validity-classification

Given a premise and a conclusion in natural language, the task is to predict:

  1. whether the given conclusion is valid with respect to the premise
  2. whether the given conclusion is novel with respect to the premise

Hence, we expect two binary decisions as output.

Example: Vegetarianism

Premise: The notion of man’s dominion over animals need not be thought of as a blank check for man to exploit animals. Indeed, it may be appropriate to connect the notion of “dominion” to stewardship” over animals. Yet, humans can be good stewards of animals while continuing to eat them. It is merely necessary that humans maintain balance, order, and sustainability in the animal kingdom. But, again, this does not require the abandonment of meat-eating.

Conclusion Validity Novelty
Two-party systems are more stable no no
Man’s “dominion” over animals does not imply abandoning meat. yes no
The idea of “domiminism” is unnecessary. no yes
Dominion over animals can and should be used responsibly yes yes

Datasets & Evaluation

Please read the Data Description beforehand.

If you use the data, please cite our overview paper.

Evaluation: we consider the f1_macro-score recognizing instances as correctly predicted only if validity and novelty are both correctly predicted.

Subtask B: recognizing relative validity / novelty

Given a premise and two conclusions A and B in natural language, the task is to predict:

  1. whether conclusion A is better than conclusion B in terms of validity
  2. whether conclusion A is better than conclusion B in terms of novelty

There are three possible labels for this task: better/worse/tie.

Example: US offshore oil drilling

Premise: These large ships release significant pollution into the oceans, and carry some risk of hitting the shore, and causing a spill.

Conclusion A Conclusion B Validity Novelty
Transporting offshore oil to shores by ship has environmental costs. Need for water does not qualify water as a right. A > B A > B
Oil drilling releases significant pollutants into the ocean Transporting offshore oil to shores by ship has environmental costs. A = B A < B

Datasets & Evaluation

Please read the Data Description beforehand.

If you use the data, please cite our overview paper.

Evaluation: We require following format for each instance for each aspect (validity/ novelty).

We consider the average of the f1_macro-score for validity and novelty.

Timeline

System Description Paper Submission

Please finde more info here

Terms and Conditions

By participating in this task you agree to these terms and conditions. If, however, one or more of these conditions is a concern for you, email us, and we will consider if an exception can be made.

Results

Table entries are ranked with main evaluation metric.

Task A

Team mF1 Valid&Novel mF1 Valid mF1 Novel
CLTeamL-3 45.16 74.64 61.75
AXiS@EdUni-1 43.27 69.8 62.43
ACCEPT-1 43.13 59.2 70.0
CLTeamL-5 43.1 74.64 58.95
CSS(*) 42.4 70.76 59.86
AXiS@EdUni-2 39.74 66.69 61.63
CLTeamL-2 38.7 65.03 61.75
CLTeamL-1 35.32 74.64 46.07
CLTeamL-4 33.11 56.74 58.95
ACCEPT-3 30.13 58.63 56.81
ACCEPT-2 29.92 56.8 48.1
NLP@UIT 25.89 61.72 43.36
RoBERTa 23.9 59.96 36.12
CSS 21.08 51.61 43.75
Harshad 17.35 56.31 39.0

CSS(*): post deadline submission of CSS, after a output formatting bug was detected and corrected

Task B

Team mean of mF1 Val & mF1 Nov mF1 Valid mF1 Novel
NLP@UIT 41.5 44.6 38.39
AXiS@EdUni 29.16 32.47 25.86
RoBERTa 21.46 19.82 23.09

Task Organizers

Newsletter/ Google-Group: https://groups.google.com/g/argmining2022-shared-task