Running FLARE on FormulationBench¶

FLARE consumes pairs of MILP formulations — a markdown description and a runnable Gurobi script per formulation — and decides whether one is a constructive reformulation of the other. The FormulationBench dataset ships both artifacts for every formulation, so it is the easiest way to drive FLARE end-to-end.

Prerequisites¶

formulation-bench and milp-flare installed in the same environment.
Docker running and the flare-agent:latest image built (see Docker).
A harness credential available on the host (see Agent Harnesses).

Verifying a dataset pair¶

from pathlib import Path

from formulation_bench import Dataset
from milp_flare import FLARE, FormulationInput
from milp_flare.harness import ClaudeCodeHarness

ds = Dataset.load()
p1 = ds.problem("p1")
a = p1.formulation("a")
b = p1.formulation("b")

harness = ClaudeCodeHarness(model="claude-opus-4-7", effort="medium")
flare = FLARE(harness=harness)

a_in = FormulationInput(
    formulation_md=a.render_markdown(), solve_py=a.gen_solve_py()
)
b_in = FormulationInput(
    formulation_md=b.render_markdown(), solve_py=b.gen_solve_py()
)

result = flare.verify(a_in, b_in, output_path=Path("runs/p1_a_b"))

print("is_reformulation:", result.is_reformulation)
print("duration_s:", result.duration_s)
print("cost_usd:", result.cost_usd)

FormulationInput carries the two artifacts the agent needs. Formulation.render_markdown() and Formulation.gen_solve_py() produce them directly from the dataset — see the FormulationBench API reference.

Lean definitions¶

FLARE writes Lean files against the same MILPFormulation / MILPReformulation structures used by FormulationBench. They are documented in the FormulationBench docs:

A copy of Common.lean (and a minimal Lake skeleton) is bundled with milp_flare and copied into the agent working directory at runtime.

Inspecting the run artifacts¶

output_path is populated with everything FLARE produced:

runs/p1_a_b/
├── config.json            # Harness + model configuration
├── result.json            # Final verdict, token usage, cost
└── wd/                    # Agent working directory (bind-mounted into the container)
    ├── A/
    │   ├── formulation.md       # Input written by FLARE
    │   ├── solve.py             # Input written by FLARE
    │   └── Formulation.lean     # Output written by the agent
    ├── B/
    │   ├── formulation.md
    │   ├── solve.py
    │   └── Formulation.lean
    ├── Reformulation.lean       # The proof produced by the agent
    ├── agent_output.jsonl       # Stream of agent turns (tail -f live)
    └── compile_log.txt          # Output of the post-hoc Lean compile

result.json records the individual sub-checks behind the verdict: whether each Formulation.lean was written and compiled, whether Reformulation.lean contains a def : MILPReformulation, whether it compiled, and whether it is sorry-free.

Using FLARE on a non-dataset pair¶

FormulationInput does not depend on formulation_bench — build the two inputs yourself from any markdown + solve.py pair and pass them to FLARE.verify. The markdown should follow the formulation template documented in FormulationBench.