Overview
United Nations resolutions encode collective reasoning at scale: negotiated positions, implicit premises, and carefully structured conclusions. This shared task evaluates how well modern systems can recover these underlying argumentative structures from text.
- What you do: predict paragraph-level labels and argumentative relations on a held-out test set.
- Who can participate: anyone (students are very much welcome).
- Model policy: systems must rely exclusively on open-weight models ≤ 8B; closed/commercial models are not permitted.
Tasks
The shared task consists of two subtasks aligned with the workshop theme “Understanding and evaluating arguments in both human and machine reasoning.”
Subtask 1: Argumentative Paragraph Classification
For each paragraph, predict (a) whether it is preambular or operative, and (b) assign a subset of 141 predefined tags as a multi-label classification problem.
Subtask 2: Argumentative Relation Prediction
Given a paragraph, predict which other paragraphs it is related to (indices), and label each link with one or more relation types: contradictive, supporting, complemental, modifying.
Data
We provide a training set and a held-out test set. Both in JSON schema to enable easy processing and reproducible development. We encourage participants to explore the data and design their systems accordingly. To make the task more accessible to non-French speakers, we provide English translations for the dataset.
- Training data: drawn from the UN-RES dataset (Gao et. al 2025), 2,695 parsed UN resolutions in French (plus machine-generated English translations using
Helsinki-NLP/opus-mt-fr-en). -
Test data: 45 parsed documents (resolutions and recommendations) from the UNESCO International Bureau of Education’s International Conference on Education (1934–2008), each contains up to three resolutions, annotated at paragraph level in French (we provide machine-generated English translations for the test set using
gpt-4.1-mini). Below is an example from the test set:{ "TEXT_ID": "ICPE-25-1962_RES1-FR_res_54", "RECOMMENDATION": 54, "TITLE": "LA PLANIFICATION DE L'ÉDUCATION", "METADATA": { "structure": { "doc_title": "ICPE-25-1962_RES1-FR", "nb_paras": 58, "preambular_para": [], "operative_para": [], "think": "" % how paragraphs are classified into preambular and operative } }, "body": { "paras": [ { "para_number": 1, "para": "La Conférence internationale de l'instruction publique, Convoquée à...", "type": null, "tags": [], "matched_paras": {}, "think": "", % how tags are assigned to the paragraph "para_en": "The International Conference on Education, convened in..." }, ... ] } }All submissions must fill in the values for the following fields:
METADATA.structure:
preambular_paras: list of paragraph indices (int) classified as preambularoperative_paras: list of paragraph indices (int) classified as operativethink: string describing the reasoning process (e.g., LLM thinking output)
paras:
type: "preambular" or "operative"tags: list of tag labels (strings)matched_paras: dictionary of paragraph indices (int) linked by content or reference as keys, and relation types ("contradictive", "supporting", "complemental", "modifying") as valuesthink: string describing the reasoning process (e.g., LLM thinking output)
Participants must enable the thinking mode of the their LLMs to reason about the relationships between paragraphs.
Tags are provided in a separate CSV file named
evaluation_dimensions_updated.csv, along with dimensional and categorical metadata that participants may use in their systems.
Licensing note: training data follow a restricted UN license; by participating, teams agree not to redistribute the training data publicly.
Evaluation
Systems are evaluated using a combination of automated metrics and empirical auditing.
- Automated Metric: F1 scores (scikit-learn) for classification accuracy.
- Empirical Metric: LLM-as-a-Judge using an open-weight LLM with a fixed prompt (0-100 scale) to assess reasoning quality.
Final ranking is based on the average of both metrics. We will update the leaderboard live during the evaluation phase.
Submission
Participants submit predictions for the test set in the required JSON format.
Submission package
- Predictions: strict
JSONoutputs conforming to the schema - System paper: non-anonymous paper with max. 6 pages (ACL format), excl. references; optional unlimited appendices
- Code: in the paper add a link to a public repository (e.g., GitHub)
Compress your filled-out JSON test set and system paper into a single ZIP file for upload.
Allowed techniques are flexible (e.g., in-context learning, retrieval-augmented generation, etc.), but only open-source LLMs ≤ 8B may be used. Please also include a team name in your system paper for the leaderboard announcement.
Upload Your Submission
Please submit your ZIP file via OpenReview at the following link: [TBA]
Important dates
All deadlines are 11:59 PM UTC-12:00 (“anywhere on Earth”).
Organizers
University of Zurich, Zurich, Switzerland.
- Yingqiang Gao — Postdoctoral Researcher, Linguistic Research Infrastructure
- Anastassia Shaitarova — Postdoctoral Researcher, Department of Computational Linguistics
- Reto Gubelmann — Research Group Leader, Digital Society Initiative & Department of Computational Linguistics
- Patrick Montjouridès — Postdoctoral Researcher, Institute of Education
FAQ
- Can I use closed-source or commercial models? No. Submissions must rely exclusively on open-source models ≤ 8B.
- Can I use external data? The task does not impose strict constraints on use of unsupervised training data, but please document what you use in your system paper.
- How do I submit? Submit JSON predictions for the test set via the evaluation platform (link above) and upload your system paper by the paper deadline.
- What if my JSON does not match the schema? Non-conforming submissions will not be evaluated.
- Where do I ask questions? Email the organizers (anyone).