Running FLARE on FormulationBench¶
FLARE consumes pairs of MILP formulations — a markdown description and a runnable Gurobi script per formulation — and decides whether one is a constructive reformulation of the other. The FormulationBench dataset ships both artifacts for every formulation, so it is the easiest way to drive FLARE end-to-end.
Prerequisites¶
formulation-benchandmilp-flareinstalled in the same environment.Docker running and the
flare-agent:latestimage built (see Docker).A harness credential available on the host (see Agent Harnesses).
Verifying a dataset pair¶
from pathlib import Path
from formulation_bench import Dataset
from milp_flare import FLARE, FormulationInput
from milp_flare.harness import ClaudeCodeHarness
ds = Dataset.load()
p1 = ds.problem("p1")
a = p1.formulation("a")
b = p1.formulation("b")
harness = ClaudeCodeHarness(model="claude-opus-4-7", effort="medium")
flare = FLARE(harness=harness)
a_in = FormulationInput(
formulation_md=a.render_markdown(), solve_py=a.gen_solve_py()
)
b_in = FormulationInput(
formulation_md=b.render_markdown(), solve_py=b.gen_solve_py()
)
result = flare.verify(a_in, b_in, output_path=Path("runs/p1_a_b"))
print("is_reformulation:", result.is_reformulation)
print("duration_s:", result.duration_s)
print("cost_usd:", result.cost_usd)
FormulationInput carries the two artifacts the agent needs.
Formulation.render_markdown() and Formulation.gen_solve_py()
produce them directly from the dataset — see the
FormulationBench API reference.
Lean definitions¶
FLARE writes Lean files against the same MILPFormulation /
MILPReformulation structures used by FormulationBench. They are
documented in the FormulationBench docs:
A copy of Common.lean (and a minimal Lake skeleton) is bundled with
milp_flare and copied into the agent working directory at runtime.
Inspecting the run artifacts¶
output_path is populated with everything FLARE produced:
runs/p1_a_b/
├── config.json # Harness + model configuration
├── result.json # Final verdict, token usage, cost
└── wd/ # Agent working directory (bind-mounted into the container)
├── A/
│ ├── formulation.md # Input written by FLARE
│ ├── solve.py # Input written by FLARE
│ └── Formulation.lean # Output written by the agent
├── B/
│ ├── formulation.md
│ ├── solve.py
│ └── Formulation.lean
├── Reformulation.lean # The proof produced by the agent
├── agent_output.jsonl # Stream of agent turns (tail -f live)
└── compile_log.txt # Output of the post-hoc Lean compile
result.json records the individual sub-checks behind the verdict:
whether each Formulation.lean was written and compiled, whether
Reformulation.lean contains a def : MILPReformulation, whether it
compiled, and whether it is sorry-free.
Using FLARE on a non-dataset pair¶
FormulationInput does not depend on formulation_bench — build the
two inputs yourself from any markdown + solve.py pair and pass them
to FLARE.verify. The markdown should follow the formulation template
documented in
FormulationBench.