Site Reliability Engineer (SRE)

Job Openings Site Reliability Engineer (SRE)

About the job Site Reliability Engineer (SRE)

Site Reliability Engineer (SRE)

Role Summary

As a Site Reliability Engineer at Nulogy, you will plan, build and maintain highly performant and available infrastructure for our industry-leading suite of solutions. Our supply chain customers rely on our applications to make critical, real-time decisions on the manufacturing shop floor. This requires us to deliver our services with the highest levels of availability, scalability and security.

Our Company maintains Elite level DORA metrics through infrastructure as code and CI/CD pipelines. The SRE team works closely with our Product Development teams to design and implement new infrastructural projects, optimize existing infrastructure for performance and cost, ensure fast build and deployment times, and minimize downtime.

Role and Responsibilities

Maintain Kubernetes clusters, including the Helm charts that drive daily deployments.
Oversee monitoring/metrics (performance and financial) infrastructure.
Improve our DR architecture and orchestrate regular DR drills.
Assist engineering teams with critical issues as they arise.
Provide recommendations for cost reduction (in AWS), as well as making valuable contributions to identifying and addressing scalability issues with the platform as a whole.
Recommend security enhancements for triage/risk assessment.
Optimizing Nulogys CI/CD infrastructure, test and deployment processes, and performance bottlenecks with deployments (which happen multiple times any given workday).
Maintain and improve the structure of all of Nulogys infrastructure as a code (e.g. Terraform, Cloudformation, Helm).
Propose projects that provide value and improve infrastructure and practices.
Participate in team on-call rotation.
Write and maintain documentation such as references, how-to guides, architectural decision records, and tutorials.

Experience and Skill Requirements

5 years of professional experience in AWS (particularly EKS, RDS, S3, Kafka, GuardDuty, in addition to general service agnostic principles).
Expertise in Kubernetes support and maintenance, both the cluster/infrastructure itself as well as all manner of work to assure optimum pod health.
Expertise of tactics to sustain optimal performance of relational databases such as PostgreSQL or MySQL.
Expertise in building CI/CD pipelines (currently Buildkite).
Expertise in working with infrastructure as code (e.g. Terraform), including best practices and configuration maintenance.
Expertise in working with Docker/containers.
Familiarity, if not expertise with security best practices (e.g. OWASP, etc.)
Experience in audited environments such as SOC2, ISO27001, PCI, Hitrust, etc.
Developer tooling (shell scripting, optimizing local developer environments, etc.)
Aptitude for working in a mission-critical computing space (i.e. being methodical, data driven and detail oriented).