Learning to Generate Unit Test via Adversarial Reinforcement Learning

ICLR 2026

Dongjun Lee¹, Changho Hwang², Kimin Lee^1,†

¹KAIST ²Microsoft Research

arXiv Code Training Data Evaluation Data

Overview

Unit testing is a core practice in programming, enabling systematic evaluation of programs produced by human developers or large language models (LLMs). Given the challenges in writing comprehensive unit tests, LLMs have been employed to automate unit test generation, yet methods for training LLMs to produce high-quality unit tests remain underexplored.

High-quality tests must do more than execute successfully: they should detect subtle faults and reliably separate better code solutions from near-correct ones. Therefore, unit test generation is an open-ended task with no single fixed answer. Existing training pipelines are often bottlenecked by high-quality unit test annotations, which are expensive and not available at scale. This makes pure imitation-based training (e.g., SFT on instruction-test pairs) fundamentally limited.

We argue RL is better suited to train LLMs for unit test generation because it can optimize for utility signals of generated tests, rather than strictly imitating reference tests. Based on this idea, we propose UTRL, an adversarial reinforcement learning framework where a unit test generator and a code generator provide feedback to each other to improve test quality without relying on heavy reference unit test annotation.

Method

Our key insight is to define rewards for unit test generation based on relationships between unit test generation and code generation. In perspective of a code generation, the generated code should pass entire test cases in the unit test. Otherwise, in perspective of a unit test generation, the generated unit test should be able to detect subtle errors in arbitrary code solutions, and at the same time, each test case in the unit test should be functionally correct and executable. Inspired by this relationship, we train unit test generator and code generator in adversarial manner, where (1) the unit test generator is trained to maximize a discrimination reward, encouraging it to produce tests that reveal faults in the code generator’s solutions; and (2) the code generator is trained to maximize a code reward, en- couraging it to produce solutions that pass the unit tests generated by the unit test generator. In the following section, we provide details of the rewards for training unit test generator, where we utilize the weighted sum of two reward temrs: discrimination reward and validity reward.

Discrimination reward

This reward measures whether generated unit tests can distinguish generated code solutions (sampled from a code generator LLM) from ground-truth solutions. In short, it quantifies how well tests expose subtle errors across multiple candidate programs.

Validity reward

This reward measures how many generated test cases are functionally correct. It is computed by executing each test cases in the generated unit tests against the ground-truth code solution, and checking whether the test case can be executed successfully without runtime errors and failed assertions. This reward encourages the unit test generator to produce functionally correct and executable test cases, which are essential for reliable code evaluation.

UTRL

Based on these rewards, we alternate between training the unit test generator and the code generator. Unit test generator is optimized to maximize the weighted sum of discrimination and validity rewards, while code generator is optimized to generated code solutions that can pass the unit tests generated by the unit test generator.

By repeating these steps, the code generator produces more higher-quality (near perfect) code solutions, which in turn provide more challenging discrimination task for the unit test generator, rendering the unit test generator learn to generate more higher-quality unit tests.

Evaluation Metrics

Ultimately, the quality of a unit test should be judged by how well it captures subtly faulty code. Well-known proxy metrics such as Line Coverage and Mutation Score are useful, but they mainly reflect whether tests touch specific execution paths. They are limited in evaluating whether a test can catch highly subtle failures, such as code solutions that still pass more than 90% of oracle test cases. To enable a more thorough evaluation of generated unit tests, we propose two metrics for LLM-generated unit tests.

1. Best-of-N Improvement

We sample candidate code solutions with diverse code LLMs (Qwen3-4B, 8B, 14B, GPT-4o) using 32-shot generation. For each task, we use generated unit tests to select a single best candidate, then evaluate that selected code with human-curated oracle unit tests. This metric captures how well generated tests identify high-quality solutions in practice.

2. Unit Test Fidelity

Unit Test Fidelity measures alignment between scores induced by generated unit tests and scores induced by human-curated unit tests. For each problem, we collect score vectors over hundreds of sampled code solutions, then compute Spearman's \(R\) between the oracle-score vector and the generated-test score vector. Higher correlation indicates that generated tests induce code evaluation behavior closer to oracle tests.

Key Results

Code generation improvement with UTRL-generated unit tests — Code generation improvement using UTRL-generated unit tests.

UTRL surpasses supervised fine-tuning: Models trained with UTRL (without ground-truth unit tests) generate higher-quality unit tests than models trained with SFT using ground-truth unit tests, showing that learning to detect LLM-generated code solutions is more effective than directly imitating reference tests.
Small models trained via UTRL outperform large closed-source models: Qwen3-4B trained with UTRL generates higher-quality unit tests than GPT-4o and GPT-4.1, achieving 14.9% accuracy versus GPT-4o's 10.6% and GPT-4.1's 13.7% (when evaluating Qwen3-8B code generation).
Unit tests from small UTRL models improve larger code generators: Unit tests generated by UTRL-trained 4B models effectively improve code generation performance of much larger models (e.g., 32B and GPT-4o).

Please check our paper for more details.

BibTeX

@inproceedings{lee2026utrl,
  title={Learning to Generate Unit Test via Adversarial Reinforcement Learning},
  author={Lee, Dongjun and Hwang, Changho and Lee, Kimin},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2026},
  url={https://arxiv.org/abs/2508.21107}
}