Shared task predicting validity and novelty
In recent years, there has been increased interest in understanding how to assess the quality of arguments systematically. Wachsmuth et al. proposed a framework for quality assessment consisting of the following top dimensions: logic, rhetoric, and dialectic. Regarding the dimension of logic, there has been some work to assess the quality of an argument or conclusion automatically.
Recently, there has also been interest in the generation of conclusions or arguments. In order to guide the process of automatically generating a conclusion, our assumption is that we might need metrics that can be automatically computed to estimate the suitability and quality of a certain conclusion. Two important metrics/objectives are that the conclusion is valid, that is, that the conclusion “follows” from the premise. At the same time, it is easy to produce conclusions that “follow” from the premise by repeating (parts of) the premise in the conclusion, trivially generating a “valid” but vacuous conclusion. In this sense, it is important to assess whether conclusions/arguments are not only valid, but also novel.
We define validity as requiring the existence of logical inferences that link the premise to the conclusion. In contrast, novelty requires the presence of novel premise-related content and/or combination of the content in the premises in a way that goes beyond what is stated in the premise. Hence, a conclusion that is valid but not novel could be a repetition, a paraphrase or a summary of the premise, and only a novel conclusion offers a piece of information that extends what is already covered by the premise – whether it supports or contests the premise.
We divide the task of Validity-Novelty-Prediction into two subtasks.
Participants can choose whether to address Task A or Task B, or both.
Given a premise and a conclusion in natural language, the task is to predict:
Hence, we expect two binary decisions as output.
Premise: The notion of man’s dominion over animals need not be thought of as a blank check for man to exploit animals. Indeed, it may be appropriate to connect the notion of “dominion” to stewardship” over animals. Yet, humans can be good stewards of animals while continuing to eat them. It is merely necessary that humans maintain balance, order, and sustainability in the animal kingdom. But, again, this does not require the abandonment of meat-eating.
Conclusion | Validity | Novelty |
---|---|---|
Two-party systems are more stable | no | no |
Man’s “dominion” over animals does not imply abandoning meat. | yes | no |
The idea of “domiminism” is unnecessary. | no | yes |
Dominion over animals can and should be used responsibly | yes | yes |
Please read the Data Description beforehand.
If you use the data, please cite our overview paper.
Evaluation: we consider the f1_macro-score recognizing instances as correctly predicted only if validity and novelty are both correctly predicted.
Given a premise and two conclusions A and B in natural language, the task is to predict:
There are three possible labels for this task: better/worse/tie.
Premise: These large ships release significant pollution into the oceans, and carry some risk of hitting the shore, and causing a spill.
Conclusion A | Conclusion B | Validity | Novelty |
---|---|---|---|
Transporting offshore oil to shores by ship has environmental costs. | Need for water does not qualify water as a right. | A > B | A > B |
Oil drilling releases significant pollutants into the ocean | Transporting offshore oil to shores by ship has environmental costs. | A = B | A < B |
Please read the Data Description beforehand.
If you use the data, please cite our overview paper.
Evaluation: We require following format for each instance for each aspect (validity/ novelty).
We consider the average of the f1_macro-score for validity and novelty.
Please finde more info here
By participating in this task you agree to these terms and conditions. If, however, one or more of these conditions is a concern for you, email us, and we will consider if an exception can be made.
Table entries are ranked with main evaluation metric.
Team | mF1 Valid&Novel | mF1 Valid | mF1 Novel |
---|---|---|---|
CLTeamL-3 | 45.16 | 74.64 | 61.75 |
AXiS@EdUni-1 | 43.27 | 69.8 | 62.43 |
ACCEPT-1 | 43.13 | 59.2 | 70.0 |
CLTeamL-5 | 43.1 | 74.64 | 58.95 |
CSS(*) | 42.4 | 70.76 | 59.86 |
AXiS@EdUni-2 | 39.74 | 66.69 | 61.63 |
CLTeamL-2 | 38.7 | 65.03 | 61.75 |
CLTeamL-1 | 35.32 | 74.64 | 46.07 |
CLTeamL-4 | 33.11 | 56.74 | 58.95 |
ACCEPT-3 | 30.13 | 58.63 | 56.81 |
ACCEPT-2 | 29.92 | 56.8 | 48.1 |
NLP@UIT | 25.89 | 61.72 | 43.36 |
RoBERTa | 23.9 | 59.96 | 36.12 |
CSS | 21.08 | 51.61 | 43.75 |
Harshad | 17.35 | 56.31 | 39.0 |
CSS(*): post deadline submission of CSS, after a output formatting bug was detected and corrected
Team | mean of mF1 Val & mF1 Nov | mF1 Valid | mF1 Novel |
---|---|---|---|
NLP@UIT | 41.5 | 44.6 | 38.39 |
AXiS@EdUni | 29.16 | 32.47 | 25.86 |
RoBERTa | 21.46 | 19.82 | 23.09 |
Newsletter/ Google-Group: https://groups.google.com/g/argmining2022-shared-task