New York, New York, United States

Staff Software Engineer - Agent Runtime & Infrastructure

 Job Description:

The Role

You'll own two critical workstreams — the agent runtime and backend infrastructure powering every trade in our fleet, and the migration of model hosting and agent deployment to fully in-house infrastructure. This is staff-level ownership from architecture through to 3am incident response.

What You'll Build

Agent Runtime & Backend (50%)

  • Plugin runtime — per-agent position tracking, trailing stop execution, and DSL state management
  • Scanner gateway and rules engine — YAML-configurable evaluation layer between signals and execution
  • Centralised profit-trailing service — protecting open positions even when agents are offline
  • Execution layer — the MCP server bridging agents to 48+ platform tools, including position creation, market data, and exchange state
  • Real-time data pipelines — enriched intelligence flowing through Redis, Postgres, and ClickHouse

Model & Agent Hosting Migration (30%)

  • Migrate agent deployment to fully owned infrastructure — isolated workspaces, cron scheduling, state persistence, and one-command skill deployment
  • Lead the move from external LLM APIs to self-hosted inference — own the decision and the execution
  • Build agent telemetry to capture every trade decision, scanner evaluation, and signal score across the fleet
  • Zero-downtime CI/CD pipelines for shipping updates to 50+ live agents without exposing open positions

Infrastructure & Operations (20%)

  • Monitoring and alerting for agent failures, orphaned positions, and state corruption
  • Cloud infrastructure management on AWS/EKS with infrastructure-as-code
  • Own incident response — in a live trading system, every minute of downtime is real capital at risk

What You Bring

Must-haves:

  • Strong production backend engineering in at least two of: Go, Python, Node.js/TypeScript — Go preferred
  • Experience building backend services from scratch — APIs, job scheduling, state management, distributed systems
  • Solid understanding of real-time, low-latency systems — websockets, sub-second evaluation, condition-based triggers
  • Production experience with Postgres, Redis, and an analytics DB such as ClickHouse or BigQuery
  • Kubernetes experience — deploying, scaling, and debugging on AWS EKS
  • You have owned a system end-to-end — designed, built, deployed, operated, and fixed it under pressure

Strong plus:

  • Experience with LLM infrastructure — model serving, inference optimisation, vLLM, TGI, or managed endpoints
  • Background in trading systems, exchange APIs, or fintech where uptime has direct financial consequences
  • Onchain infrastructure experience — wallet operations, RPC nodes, DEX integration
  • Experience building multi-agent platforms or CI/CD pipelines for live trading systems

This is not a DevOps role. You'll spend 80% of your time writing code that ships to production — because at our stage, the best person to operate a system is the person who built it. If you are a backend engineer who wants to build the foundational infrastructure for a new category of autonomous financial software, this is your role.

  Required Skills:

Ethereum Solidity