Senior Staff Software Engineer, Reliability

Buenos Aires, Argentina

Job Openings Senior Staff Software Engineer, Reliability

About the job Senior Staff Software Engineer, Reliability

About hireworks
hireworks is building a community of top talent in key international markets by unlocking unparalleled access to positions at leading U.S. based companies. As your employer, hireworks will ensure you have a seamless interview, onboarding, and employee experience - providing ongoing support and resources along the way. Established in 2023, hireworks is forging corp-to-corp relationships with leading U.S. based organizations looking to grow their teams with best-in-class talent around the world. Working with hireworks means unlocking access to a network of local peers and mentors and career opportunities through our client network.

About our client
Our client is building artificial intelligence to make the physical world more responsive. The company is pioneering what it calls the Recognition Economy, a future where repetitive tasks disappear and being recognized unlocks seamless access, comfort, and personalized experiences across everyday environments. From transforming parking into a frictionless drive-in, drive-out experience for millions of users to expanding its intelligence layer across industries such as retail and hospitality, the company is developing technology that makes real-world interactions more intuitive and efficient. As the organization continues to grow, it is looking for builders, innovators, and problem solvers who want to help shape the next generation of intelligent infrastructure for physical spaces

Position Overview
Our client is seeking a Staff Software Engineer focused on Reliability to own reliability across their entire platform and drive the comprehensive practices that ensure system availability, resilience, and observability for our mission-critical mobility infrastructure. In this role, you will build reliability from first principles, architecting failover systems, implementing chaos engineering, and improving our observability foundation to maintain 99.9%+ uptime as we scale to new markets.

As the technical owner of our reliability posture, you will tackle challenges like external service failover, dependency mirroring, and database replication, working alongside highly technical teams across the organization to influence architecture decisions and establish company-wide reliability standards. You will join the Product Foundations team, playing a key role in building the foundational infrastructure that powers the
future of mobility commerce.

What You'll Do

Own the overall reliability posture for the platform, establishing
practices, metrics, and systems that ensure 99.9%+ uptime across all services
Design and implement automatic failover mechanisms for critical external dependencies like Twilio for SMS/voice and Stripe for payments with circuit breakers, retry policies, and degraded mode operations
Architect and build active-passive or active-active regional deployment
strategies with database replication, automated failover, and DNS-based traffic routing including disaster recovery planning and testing
Establish comprehensive monitoring using Datadog for APM, logs, and metrics correlation
Implement synthetic monitoring, SLO-based alerting, on-call rotation, and escalation policies while building service health dashboards that show customer impact
Own the incident management process including workflows, tooling,
post-mortem culture, runbook automation, and MTTR reduction initiatives to drive down mean time to recovery from detection to resolution
Drive adoption of resilience patterns across all services including health checks, graceful degradation, feature flags, rate limiting, backpressure mechanisms, and chaos engineering practices
Build and maintain local mirrors for critical dependencies with artifact caching, dependency pinning, and vulnerability scanning to prevent build failures from upstream outages

About You

8+ years of engineering experience including software engineering, reliability
engineering, SRE practices, or production operations at scale
Demonstrate expert-level reliability engineering skills including hands-on experience with multi-region architectures, failover automation, circuit breakers, chaos engineering, and disaster recovery
Utilize production observability expertise with deep experience implementing monitoring, alerting, tracing, and logging systems at scale – specifically Datadog or similar APM platforms in high-load environments
Apply strong systems thinking with proven ability to design resilient distributed systems that gracefully handle failures, network partitions, and external dependency outages
Demonstrate database and data systems knowledge including replication strategies, backup/restore procedures, connection pooling, query optimization, and experience with both relational and NoSQL databases
Leverage cloud platform expertise with production experience operating and ensuring reliability of systems on AWS including multi-region deployments, load balancing, and DNS-based failover
Possess experience with AI-powered development tools such as Claude Code, GitHub Copilot, or similar agentic coding tools for enhanced productivity – context engineering in particular
Exhibit excellent technical communication with ability to influence technical decisions across teams, document complex systems, conduct post-mortems, and establish reliability standards organization-wide
Demonstrate expert-level Java and/or Scala proficiency with strong
understanding of JVM performance, concurrency, and operational
characteristics

Our Stack

Languages + Frameworks: TypeScript, React, Scala (principally), Java (limited)
Datastores: MySQL, PostgreSQL, Snowflake
Cloud: AWS
Version control: Git & GitHub
AI Tooling: Copilot on GitHub
Observability: Datadog

Benefits

hireworks is cultivating a growing community of top talent across Colombia, Argentina, Brazil and Bulgaria. In addition to unlocking access to positions at top tier U.S. based companies, we offer a variety of benefits to enhance your experience:

Competitive Pay – compensation that reflects your experience and accomplishments.
Remote Flexibility – work from anywhere within your local country (Colombia, Argentina or Brazil), with the option to use co-working space as available locally.
Paid Time Off – ample vacation days to rest and recharge.
Public Holidays – all local federal holidays are fully paid days off.

Or refer someone