1 project found
A benchmark framework for evaluating how well language models can create structured plans and faithfully execute them step-by-step.