UTRL · ICLR 2026

Learning to Generate Unit Test via
Adversarial Reinforcement Learning

1KAIST  ·  2Microsoft Research

Training an LLM to write high-quality unit tests — without any ground-truth unit test labels — by competing a test generator and a code generator against each other in an adversarial RL loop.

TL;DR
A unit test generator learns to expose faults in code sampled from a code generator, while the code generator learns to pass those tests. Neither side needs human-written unit tests — the only supervision is an instruction–code pair. The result: Qwen3-4B trained with UTRL produces unit tests that beat GPT-4.1 and GPT-4o, and the co-trained code generator matches RL with ground-truth tests.

01Method


Collecting high-quality unit tests at scale is expensive. Instruction–code pairs are not. UTRL leverages this asymmetry: it converts an instruction–code dataset into a training signal for unit test generation through two interlocking rewards.

R_disc · discrimination

Reward the test generator when its tests successfully flag code solutions drawn from the code generator while still passing the ground-truth code. This drives tests toward sharp, discriminative edge cases.

R_valid · validity (clipped)

Reward functional correctness of test cases, but clip the denominator at τ so the model cannot game the reward by emitting 2–3 trivial tests copied from the prompt.

Discrimination reward
Computing the discrimination reward for a generated unit test T. Step 1. Each test case in T is executed against the ground-truth code C*; only the test cases that pass (here T1, T3, T5) are kept as functionally valid, so invalid test cases cannot contaminate the reward. Step 2. M = 6 code solutions C1…C6 are sampled from the code generator. A code solution is counted as "detected" if it fails at least one valid test case. In this example 4 out of 6 solutions are detected (C1, C3, C4, C5), giving a discrimination reward of 4 ⁄ 6 = 0.667.

The adversarial loop

The two generators alternate. Each iteration, the code generator is pushed to produce solutions harder to distinguish from ground truth, and the test generator is pushed to invent increasingly precise failure modes. Iteration 2's discrimination reward starts 25 points below iteration 1's saturation level — confirming the code generator genuinely improved — and then climbs back up as tests adapt.

UTRL algorithm
UTRL alternates between training the unit test generator (Step 1) and the code generator (Step 2).

02Evaluation metrics for unit tests


Evaluating a generated unit test is subtle: we want tests that are both discriminative (rank good code above bad code) and comprehensive (behave the way a rigorously curated ground-truth test suite would). We introduce two complementary metrics that probe these two axes separately.

Metric 1 · Best-of-N improvement

Quantifies the discriminativeness of the unit test.

Step 1. Sample N candidate code solutions for a given programming task.
Step 2. Use the generated unit test to select the solution that passes the most test cases.
Step 3. Evaluate the selected solution with the human-written ground-truth unit test and report code score / accuracy.

Metric 2 · Unit test fidelity

Quantifies how closely the generated test replicates GT evaluation.

Step 1. Sample N code solutions for a task.
Step 2. Score every solution twice — once with the generated test, once with the GT test — yielding two score vectors.
Step 3. Report Spearman's rank correlation between the two vectors. A higher correlation means the generated test induces the same code ranking as a comprehensive GT suite.

03Experiments


We evaluate on 945 competitive programming tasks from TACO and 511 from LiveCodeBench-v2, using the two metrics above.

Q1 · Unit test quality
Does UTRL actually make an LLM better at writing unit tests?

Yes — and it beats frontier models. A small Qwen3-4B trained with UTRL generates unit tests that surpass GPT-4.1, GPT-4o, and SFT baselines across both evaluation axes.

  • Over base model: Best-of-N code score on Qwen3-8B jumps from 0.4300.578 (+34%). Code accuracy: 9.8% → 14.9%.
  • Over frontier LLMs: UTRL-4B unit tests beats GPT-4.1 and GPT-4o — despite being 4B parameters.
  • Over SFT baselines: SFT on human-written unit tests reaches 0.458 Best-of-N score, and SFT on teacher-distilled reasoning+tests still lags behind. RL generalizes where SFT memorizes.
  • Unit test fidelity: UTRL-4B reaches 0.794, and UTRL-14B reaches 0.827 — matching or surpassing GPT-4.1 (0.800) and far above SFT-with-GT-tests (0.566).
Best-of-N improvement across unit test generators
Best-of-N selection using generated unit tests as evaluators. UTRL (green) dominates across 4 different code generators.
Comparison against SFT baselines
Head-to-head against SFT baselines. UTRL outperforms both SFT on ground-truth unit tests (DUT) and SFT on teacher-distilled reasoning+tests (Dreason+UT) in both Best-of-N improvement and unit test fidelity — without ever using unit test labels.
Unit test fidelity comparison
Unit test fidelity (Spearman's R with GT evaluation). UTRL-trained models produce tests that rank code in the same order as rigorously curated GT tests, surpassing GPT-4.1.
Q2 · Code generator side-effect
Does the adversarial loop also train a useful code generator?

Yes — it reaches parity with RL that uses ground-truth unit tests. The code generator trained against UTRL's evolving tests hits 15.3% pass@1, essentially matching the 15.9% upper bound from RL with human-written tests.

  • RL with GPT-4o-generated tests saturates under 12%. A strong-but-static test distribution is not enough — adaptive tests are what drive continued improvement.
  • SFT on ground-truth code collapses to 3.6% pass@1 on unseen eval tasks, showing that direct imitation overfits while adversarial RL generalizes.
  • This means UTRL is a viable path to training code generators even in domains where high-quality unit tests are infeasible to collect.
Pass@1 accuracy of code generators
UTRL code generator (green) nearly matches RL with GT unit tests (yellow), and clearly beats both GPT-4o-test RL (blue) and SFT (dashed).
Q3 · Ablation on validity reward
What happens without the validity reward — or without its clipping?

Two different failure modes. Validity design matters at least as much as the discrimination signal itself.

  • Drop R_valid entirely: the generator emits tests where >50% are functionally invalid. Discrimination reward stays high but tests are unreliable as evaluators.
  • Keep R_valid but remove clipping: the unit test generator collapses — it learns to emit only the 2–3 trivial test cases to trivially maximize the validity ratio. Coverage dies.
  • Clipped validity (τ = 12): forces the model to keep producing enough tests that a high validity ratio actually reflects real reasoning effort.
Ablation on validity reward
Effect of the validity reward and its clipped denominator. Without Rvalid, the generated tests are riddled with invalid cases; without clipping, the model collapses to just a few trivial tests. The full UTRL design keeps both count and validity high.
Q4 · Iterative training
Does alternating between the two generators keep producing gains?

Yes — iteration 2 produces a harder curriculum. In iteration 1, the test generator saturates against the initial code distribution. Training the code generator against those tests makes its outputs harder to distinguish from ground truth, which reopens the discrimination problem for the next iteration.

  • Iteration 1 saturates. Trained against the initial code generator, the discrimination reward plateaus around 0.626 after ~50 steps with only a ~0.02 gain — the test generator has learned everything it can from easy code.
  • Iteration 2 starts harder. After the code generator is updated via RL against iteration 1's tests, its outputs are substantially closer to ground truth. Re-evaluating on this new distribution, the discrimination reward drops from 0.626 to 0.375 — a real, measurable increase in difficulty.
  • The test generator recovers — on a harder task. Continued training lifts the reward from 0.375 to 0.447, and the resulting iteration-2 tests yield better Best-of-N performance than iteration 1, even surpassing GPT-4.1.
Effect of iterative training on discrimination reward
Discrimination reward across iterations. Iteration 1 saturates quickly; after updating the code generator, iteration 2 starts from a lower reward (harder task) and climbs meaningfully — evidence that the adversarial loop is generating genuine curriculum, not just more compute.
Q5 · Does UTRL work on other LLM families?
It also works when applied to Llama-3.1-8B-Instruct

Yes. UTRL effectively improves Llama-3.1-8B-Instruct with only 50 training steps, lifting its unit test quality well above the instruction-tuned baseline.

  • Best-of-N score on Qwen3-8B code: 0.380 → 0.494 (+30%).
  • Code accuracy: 7.7% → 11.9% on Qwen3-8B, 7.8% → 11.6% on Qwen3-4B.
  • The framework is not tied to a specific base model — it relies only on instruction–code pairs and a verifiable reward.

04Why this matters


Unit tests are the dominant verifiable reward in code-focused RL, but they have been a bottleneck — their curation cost limits how far we can scale RL-for-code. UTRL shows that high-quality tests can emerge from the instruction–code data we already have, via adversarial self-play. The same loop also trains a strong code generator as a byproduct.

For broader deployment, the recipe is: (i) collect instruction–code pairs (easy), (ii) run UTRL, (iii) use the resulting unit test generator as a verifier for further RL — or as a standalone code evaluator.

05Citation


@article{lee2025learning,
  title   = {Learning to Generate Unit Test via Adversarial Reinforcement Learning},
  author  = {Lee, Dongjun and Hwang, Changho and Lee, Kimin},
  journal = {arXiv preprint arXiv:2508.21107},
  year    = {2025}
}