Now Recruiting
Infra Software Engineering Expert
About Polymath
Polymath is an applied research lab building the simulation environments that the next generation of AI agents will be trained and evaluated in. We partner with leading model labs to push the frontier of long-horizon agent capabilities — environments where agents must plan, use tools, recover from errors, and work autonomously for hours or days at a time.
We recently announced our $8M seed round led by Base10, with Y Combinator and a roster of angels alongside. We're a small team of researchers, engineers, and operators, and we ship.
About the role
You'll design and build reinforcement learning environments and tasks that push frontier coding models past their current limits. Polymath partners with senior practitioners to author the high-fidelity scenarios AI systems train and are evaluated against; in this role, that means standing up realistic software systems — containers, clusters, networks, services — and turning them into well-scoped tasks an agent can attempt, fail at, and learn from. Engagements range from authoring a single environment to longer-running collaborations on task suites and grading rubrics.
The interesting judgment calls here are technical and pedagogical at once. What does a non-trivial Kubernetes debugging task look like when the agent has shell access? How do you instrument a distributed system so success and failure are unambiguous? Where's the line between a task that's tractable and one that's actually hard? You'll be drawing on real systems engineering experience — Docker, Kubernetes, Terraform, networking, infra — to construct problems that look like the work senior engineers actually do.
Responsibilities
- Design and build RL environments and tasks that target real software engineering and systems work
- Containerize and orchestrate the underlying services, networks, and infrastructure each task depends on
- Specify clear success criteria, failure modes, and grading signals for agent attempts
- Stress-test your own environments against current frontier coding models and iterate on difficulty
- Collaborate with Polymath researchers and other domain experts on task suites and rubric design
- Document environments and tasks so other engineers and graders can extend them
Who we're looking for
- Have substantial professional experience in software engineering and the full software development lifecycle
- Are fluent with containerization, networking, and infrastructure as everyday tools
- Have hands-on depth with Docker, Kubernetes, and Terraform
- Bring a systems engineering mindset — comfortable reasoning across services, hosts, and network boundaries
- Can scope a hard technical problem into a task with crisp inputs, outputs, and grading signals
- Are interested in how frontier coding models actually behave, and in building environments that expose their weaknesses
Nice-to-haves
- Depth in performance engineering or distributed systems
- Background in cybersecurity, including offensive or defensive tooling
- DevOps or SRE experience running production systems at scale
- ML engineering experience, especially around training or evaluating large models
- Operating systems internals expertise