New York, New York, United States

Staff Software Engineer - Agent Runtime & Infrastructure

Job Description:

The Role

You'll own two critical workstreams — the agent runtime and backend infrastructure powering every trade in our fleet, and the migration of model hosting and agent deployment to fully in-house infrastructure. This is staff-level ownership from architecture through to 3am incident response.

What You'll Build

Agent Runtime & Backend (50%)

Plugin runtime — per-agent position tracking, trailing stop execution, and DSL state management
Scanner gateway and rules engine — YAML-configurable evaluation layer between signals and execution
Centralised profit-trailing service — protecting open positions even when agents are offline
Execution layer — the MCP server bridging agents to 48+ platform tools, including position creation, market data, and exchange state
Real-time data pipelines — enriched intelligence flowing through Redis, Postgres, and ClickHouse

Model & Agent Hosting Migration (30%)

Migrate agent deployment to fully owned infrastructure — isolated workspaces, cron scheduling, state persistence, and one-command skill deployment
Lead the move from external LLM APIs to self-hosted inference — own the decision and the execution
Build agent telemetry to capture every trade decision, scanner evaluation, and signal score across the fleet
Zero-downtime CI/CD pipelines for shipping updates to 50+ live agents without exposing open positions

Infrastructure & Operations (20%)

Monitoring and alerting for agent failures, orphaned positions, and state corruption
Cloud infrastructure management on AWS/EKS with infrastructure-as-code
Own incident response — in a live trading system, every minute of downtime is real capital at risk

What You Bring

Must-haves:

Strong production backend engineering in at least two of: Go, Python, Node.js/TypeScript — Go preferred
Experience building backend services from scratch — APIs, job scheduling, state management, distributed systems
Solid understanding of real-time, low-latency systems — websockets, sub-second evaluation, condition-based triggers
Production experience with Postgres, Redis, and an analytics DB such as ClickHouse or BigQuery
Kubernetes experience — deploying, scaling, and debugging on AWS EKS
You have owned a system end-to-end — designed, built, deployed, operated, and fixed it under pressure

Strong plus:

Experience with LLM infrastructure — model serving, inference optimisation, vLLM, TGI, or managed endpoints
Background in trading systems, exchange APIs, or fintech where uptime has direct financial consequences
Onchain infrastructure experience — wallet operations, RPC nodes, DEX integration
Experience building multi-agent platforms or CI/CD pipelines for live trading systems

This is not a DevOps role. You'll spend 80% of your time writing code that ships to production — because at our stage, the best person to operate a system is the person who built it. If you are a backend engineer who wants to build the foundational infrastructure for a new category of autonomous financial software, this is your role.

Required Skills:

Ethereum Solidity