UZH Shared Task @ ArgMining Workshop 2026

Team	F1 Score	LLM Judge
🥇 DataMiners	0.892	94.85
🥈 NLP-Lab-X	0.887	92.72
🥉 Team Alpha	0.865	89.68
4. ArgWizards	0.823	84.45
5. Resolvers	0.791	78.20
6. ArgWorker	0.650	62.10
7. GradientHacker	0.622	58.35
8. DeepThinkers	0.595	55.20
9. LogicGates	0.551	50.10
10. RandomSeed	0.420	35.00

Overview

United Nations resolutions encode collective reasoning at scale: negotiated positions, implicit premises, and carefully structured conclusions. This shared task evaluates how well modern systems can recover these underlying argumentative structures from text.

What you do: predict paragraph-level labels and argumentative relations on a held-out test set.
Who can participate: anyone (students are very much welcome).
Model policy: systems must rely exclusively on open-weight models ≤ 8B; closed/commercial models are not permitted.

Tasks

The shared task consists of two subtasks aligned with the workshop theme “Understanding and evaluating arguments in both human and machine reasoning.”

Subtask 1: Argumentative Paragraph Classification

For each paragraph, predict (a) whether it is preambular or operative, and (b) assign a subset of 141 predefined tags as a multi-label classification problem.

Subtask 2: Argumentative Relation Prediction

Given a paragraph, predict which other paragraphs it is related to (indices), and label each link with one or more relation types: contradictive, supporting, complemental, modifying.

Data

We provide a training set and a held-out test set. Both in JSON schema to enable easy processing and reproducible development. We encourage participants to explore the data and design their systems accordingly. To make the task more accessible to non-French speakers, we provide English translations for the dataset.

Training data: drawn from the UN-RES dataset (Gao et. al 2025), 2,695 parsed UN resolutions in French (plus machine-generated English translations using Helsinki-NLP/opus-mt-fr-en).
Test data: 45 parsed documents (resolutions and recommendations) from the UNESCO International Bureau of Education’s International Conference on Education (1934–2008), each contains up to three resolutions, annotated at paragraph level in French (we provide machine-generated English translations for the test set using gpt-4.1-mini). Below is an example from the test set:
{ "TEXT_ID": "ICPE-25-1962_RES1-FR_res_54", "RECOMMENDATION": 54, "TITLE": "LA PLANIFICATION DE L'ÉDUCATION", "METADATA": { "structure": { "doc_title": "ICPE-25-1962_RES1-FR", "nb_paras": 58, "preambular_para": [], "operative_para": [], "think": "" % how paragraphs are classified into preambular and operative } }, "body": { "paras": [ { "para_number": 1, "para": "La Conférence internationale de l'instruction publique, Convoquée à...", "type": null, "tags": [], "matched_paras": {}, "think": "", % how tags are assigned to the paragraph "para_en": "The International Conference on Education, convened in..." }, ... ] } }
All submissions must fill in the values for the following fields:

METADATA.structure:
- preambular_paras: list of paragraph indices (int) classified as preambular
- operative_paras: list of paragraph indices (int) classified as operative
- think: string describing the reasoning process (e.g., LLM thinking output)
paras:
- type: "preambular" or "operative"
- tags: list of tag labels (strings)
- matched_paras: dictionary of paragraph indices (int) linked by content or reference as keys, and relation types ("contradictive", "supporting", "complemental", "modifying") as values
- think: string describing the reasoning process (e.g., LLM thinking output)
Participants must enable the thinking mode of the their LLMs to reason about the relationships between paragraphs.

Tags are provided in a separate CSV file named evaluation_dimensions_updated.csv, along with dimensional and categorical metadata that participants may use in their systems.

University of Zurich, Department of Computational Linguistics

www.cl.uzh.ch cl_uzh ZurichNLP

Train and Test Set

Download on Hugging Face

Licensing note: training data follow a restricted UN license; by participating, teams agree not to redistribute the training data publicly.

Evaluation

Systems are evaluated using a combination of automated metrics and empirical auditing.

Automated Metric: F1 scores (scikit-learn) for classification accuracy.
Empirical Metric: LLM-as-a-Judge using an open-weight LLM with a fixed prompt (0-100 scale) to assess reasoning quality.

Final ranking is based on the average of both metrics. We will update the leaderboard live during the evaluation phase.

Submission

Participants submit predictions for the test set in the required JSON format.

Submission package

Predictions: strict JSON outputs conforming to the schema
System paper: non-anonymous paper with max. 6 pages (ACL format), excl. references; optional unlimited appendices
Code: in the paper add a link to a public repository (e.g., GitHub)

Compress your filled-out JSON test set and system paper into a single ZIP file for upload.

Allowed techniques are flexible (e.g., in-context learning, retrieval-augmented generation, etc.), but only open-source LLMs ≤ 8B may be used. Please also include a team name in your system paper for the leaderboard announcement.

Upload Your Submission

Please submit your ZIP file via OpenReview at the following link: [TBA]

Important dates

All deadlines are 11:59 PM UTC-12:00 (“anywhere on Earth”).

1 Feb 2026

Train and test data release

18 March 2026

Evaluation and submission starts

1 April 2026

Submission ends

15 April 2026

Evaluation ends; results notification

24 April 2026

Paper submission due

1 May 2026

Reviews to authors

12 May 2026

Camera-ready version due

July 2026

ArgMining 2026 Workshop

Organizers

University of Zurich, Zurich, Switzerland.

Yingqiang Gao — Postdoctoral Researcher, Linguistic Research Infrastructure
Anastassia Shaitarova — Postdoctoral Researcher, Department of Computational Linguistics
Reto Gubelmann — Research Group Leader, Digital Society Initiative & Department of Computational Linguistics
Patrick Montjouridès — Postdoctoral Researcher, Institute of Education

FAQ

Can I use closed-source or commercial models? No. Submissions must rely exclusively on open-source models ≤ 8B.
Can I use external data? The task does not impose strict constraints on use of unsupervised training data, but please document what you use in your system paper.
How do I submit? Submit JSON predictions for the test set via the evaluation platform (link above) and upload your system paper by the paper deadline.
What if my JSON does not match the schema? Non-conforming submissions will not be evaluated.
Where do I ask questions? Email the organizers (anyone).

Reconstructing the Reasoning in United Nations Resolutions

Leaderboard (Placeholders)