Job Openings Senior Staff Software Engineer, Reliability

About the job Senior Staff Software Engineer, Reliability

About hireworks
hireworks is building a community of top talent in key international markets by unlocking unparalleled access to positions at leading U.S. based companies. As your employer, hireworks will ensure you have a seamless interview, onboarding, and employee experience - providing ongoing support and resources along the way. Established in 2023, hireworks is forging corp-to-corp relationships with leading U.S. based organizations looking to grow their teams with best-in-class talent around the world. Working with hireworks means unlocking access to a network of local peers and mentors and career opportunities through our client network.

About our client
Our client is building artificial intelligence to make the physical world more responsive. The company is pioneering what it calls the Recognition Economy, a future where repetitive tasks disappear and being recognized unlocks seamless access, comfort, and personalized experiences across everyday environments. From transforming parking into a frictionless drive-in, drive-out experience for millions of users to expanding its intelligence layer across industries such as retail and hospitality, the company is developing technology that makes real-world interactions more intuitive and efficient. As the organization continues to grow, it is looking for builders, innovators, and problem solvers who want to help shape the next generation of intelligent infrastructure for physical spaces

Position Overview
Our client is seeking a Staff Software Engineer focused on Reliability to own reliability across their entire platform and drive the comprehensive practices that ensure system availability, resilience, and observability for our mission-critical mobility infrastructure. In this role, you will build reliability from first principles, architecting failover systems, implementing chaos engineering, and improving our observability foundation to maintain 99.9%+ uptime as we scale to new markets.

As the technical owner of our reliability posture, you will tackle challenges like external service failover, dependency mirroring, and database replication, working alongside highly technical teams across the organization to influence architecture decisions and establish company-wide reliability standards. You will join the Product Foundations team, playing a key role in building the foundational infrastructure that powers the
future of mobility commerce.

What You'll Do

  • Own the overall reliability posture for the platform, establishing
    practices, metrics, and systems that ensure 99.9%+ uptime across all services
  • Design and implement automatic failover mechanisms for critical external dependencies like Twilio for SMS/voice and Stripe for payments with circuit breakers, retry policies, and degraded mode operations
  • Architect and build active-passive or active-active regional deployment
    strategies with database replication, automated failover, and DNS-based traffic routing including disaster recovery planning and testing
  • Establish comprehensive monitoring using Datadog for APM, logs, and metrics correlation
  • Implement synthetic monitoring, SLO-based alerting, on-call rotation, and escalation policies while building service health dashboards that show customer impact
  • Own the incident management process including workflows, tooling,
    post-mortem culture, runbook automation, and MTTR reduction initiatives to drive down mean time to recovery from detection to resolution
  • Drive adoption of resilience patterns across all services including health checks, graceful degradation, feature flags, rate limiting, backpressure mechanisms, and chaos engineering practices
  • Build and maintain local mirrors for critical dependencies with artifact caching, dependency pinning, and vulnerability scanning to prevent build failures from upstream outages

About You

  • 8+ years of engineering experience including software engineering, reliability
    engineering, SRE practices, or production operations at scale
  • Demonstrate expert-level reliability engineering skills including hands-on experience with multi-region architectures, failover automation, circuit breakers, chaos engineering, and disaster recovery
  • Utilize production observability expertise with deep experience implementing monitoring, alerting, tracing, and logging systems at scale – specifically Datadog or similar APM platforms in high-load environments
  • Apply strong systems thinking with proven ability to design resilient distributed systems that gracefully handle failures, network partitions, and external dependency outages
  • Demonstrate database and data systems knowledge including replication strategies, backup/restore procedures, connection pooling, query optimization, and experience with both relational and NoSQL databases
  • Leverage cloud platform expertise with production experience operating and ensuring reliability of systems on AWS including multi-region deployments, load balancing, and DNS-based failover
  • Possess experience with AI-powered development tools such as Claude Code, GitHub Copilot, or similar agentic coding tools for enhanced productivity – context engineering in particular
  • Exhibit excellent technical communication with ability to influence technical decisions across teams, document complex systems, conduct post-mortems, and establish reliability standards organization-wide
  • Demonstrate expert-level Java and/or Scala proficiency with strong
    understanding of JVM performance, concurrency, and operational
    characteristics

Our Stack

  • Languages + Frameworks: TypeScript, React, Scala (principally), Java (limited)
  • Datastores: MySQL, PostgreSQL, Snowflake
  • Cloud: AWS
  • Version control: Git & GitHub
  • AI Tooling: Copilot on GitHub
  • Observability: Datadog
Benefits
hireworks is cultivating a growing community of top talent across Colombia, Argentina, Brazil and Bulgaria. In addition to unlocking access to positions at top tier U.S. based companies, we offer a variety of benefits to enhance your experience:
  • Competitive Pay – compensation that reflects your experience and accomplishments.
  • Remote Flexibility – work from anywhere within your local country (Colombia, Argentina or Brazil), with the option to use co-working space as available locally.
  • Paid Time Off – ample vacation days to rest and recharge.
  • Public Holidays – all local federal holidays are fully paid days off.