Training an LLM to write high-quality unit tests — without any ground-truth unit test labels — by competing a test generator and a code generator against each other in an adversarial RL loop.
Collecting high-quality unit tests at scale is expensive. Instruction–code pairs are not. UTRL leverages this asymmetry: it converts an instruction–code dataset into a training signal for unit test generation through two interlocking rewards.
Reward the test generator when its tests successfully flag code solutions drawn from the code generator while still passing the ground-truth code. This drives tests toward sharp, discriminative edge cases.
Reward functional correctness of test cases, but clip the denominator at τ so the model cannot game the reward by emitting 2–3 trivial tests copied from the prompt.
The two generators alternate. Each iteration, the code generator is pushed to produce solutions harder to distinguish from ground truth, and the test generator is pushed to invent increasingly precise failure modes. Iteration 2's discrimination reward starts 25 points below iteration 1's saturation level — confirming the code generator genuinely improved — and then climbs back up as tests adapt.
Evaluating a generated unit test is subtle: we want tests that are both discriminative (rank good code above bad code) and comprehensive (behave the way a rigorously curated ground-truth test suite would). We introduce two complementary metrics that probe these two axes separately.
Quantifies the discriminativeness of the unit test.
Step 1. Sample N candidate code solutions for a given programming task.
Step 2. Use the generated unit test to select the solution that passes the most test cases.
Step 3. Evaluate the selected solution with the human-written ground-truth unit test and report code score / accuracy.
Quantifies how closely the generated test replicates GT evaluation.
Step 1. Sample N code solutions for a task.
Step 2. Score every solution twice — once with the generated test, once with the GT test — yielding two score vectors.
Step 3. Report Spearman's rank correlation between the two vectors. A higher correlation means the generated test induces the same code ranking as a comprehensive GT suite.
We evaluate on 945 competitive programming tasks from TACO and 511 from LiveCodeBench-v2, using the two metrics above.
Yes — and it beats frontier models. A small Qwen3-4B trained with UTRL generates unit tests that surpass GPT-4.1, GPT-4o, and SFT baselines across both evaluation axes.
Yes — it reaches parity with RL that uses ground-truth unit tests. The code generator trained against UTRL's evolving tests hits 15.3% pass@1, essentially matching the 15.9% upper bound from RL with human-written tests.
Two different failure modes. Validity design matters at least as much as the discrimination signal itself.
Yes — iteration 2 produces a harder curriculum. In iteration 1, the test generator saturates against the initial code distribution. Training the code generator against those tests makes its outputs harder to distinguish from ground truth, which reopens the discrimination problem for the next iteration.
Yes. UTRL effectively improves Llama-3.1-8B-Instruct with only 50 training steps, lifting its unit test quality well above the instruction-tuned baseline.
Unit tests are the dominant verifiable reward in code-focused RL, but they have been a bottleneck — their curation cost limits how far we can scale RL-for-code. UTRL shows that high-quality tests can emerge from the instruction–code data we already have, via adversarial self-play. The same loop also trains a strong code generator as a byproduct.
For broader deployment, the recipe is: (i) collect instruction–code pairs (easy), (ii) run UTRL, (iii) use the resulting unit test generator as a verifier for further RL — or as a standalone code evaluator.
@article{lee2025learning,
title = {Learning to Generate Unit Test via Adversarial Reinforcement Learning},
author = {Lee, Dongjun and Hwang, Changho and Lee, Kimin},
journal = {arXiv preprint arXiv:2508.21107},
year = {2025}
}