Job Openings Site Reliability Engineer (SRE)

About the job Site Reliability Engineer (SRE)

***W2 only***

Position: Site Reliability Engineer (SRE)

Work Authorization: All Work Authorizations

Location: Reston, VA

Contract: 24 months

Description: Site Reliability Engineer (SRE) roles and responsibilities

The SRE role bridges the Development Engineer role and the Production Engineer role with a mixture of development, test, deploy, and support skills that contribute to application reliability and resiliency. The SRE approaches problems as an Engineer and looks to automate processes with code or tools to detect and prevent identified software reliability issues. The SRE role splits time between runtime support issues (toil) and development automation work (dev). The skills are organized by Development, Support, and Common areas.

Software Development and Configuration

The following SRE skills are used to improve reliability of an application/service while it is in development:

Required Skills:

  • 8-10 years overall experience
  • Hands-On in at least one language - Java (must), Python (3-4 yrs) -not coding just understanding of
  • Hands-On experience with automated testing tools (JMeter, Junit, Mockito, Postman)
  • Hands-On experience with a source code management system like GIT or SVN including pull, push, branch, commit and merge functions
  • Hands-On experience creating, configuring and maintaining cloud-based applications and infrastructure for the rapid development and monitoring of applications and services:
  • AWS, EC2, Fargate, CloudFormation, RDS, ElasticCache, S3
  • Experience with Cloud Migrations with reliability and availability as core focus
  • Experience in implementing the SRE at the team/enterprise level with hands-on implementation of SRE practices and improving the metrics
  • Hands-On experience with monitoring tools (Splunk, Dynatrace) and dashboard developmentincluding development and customization of dashboards
  • Hands-On experience with the build, deploy, and packaging process and best practices. Familiar using DevOps automation tools (UCD, Jenkins, Maven, SonarQube, Chef, Ansible, Puppet)
  • Scripting skills for automation (Linux bash and Windows)

General Required Skills:

  • Ability to diagnose and optimize software code for reliability and resiliency
  • Knowledge of the incident management process and reporting tools (ServiceNow, Jira Service Desk)
  • Good communication and documentation skills. An SRE must document their work, collect and document tribal knowledge (the good stuff in peoples head), and make it accessible to others.
  • Experience triaging incidents and conducting RCAs (Root Cause Analysis)
  • Nice to have skills:
  • Ability to diagnose technical problems, isolate and debug issues, formulate creative solutions, analyze alternative approaches, and implement a timely solution.
  • Experience providing alternatives and estimates for implementing a fix or automation to improve reliability.
  • Ability to juggle several different tasks at a time, and able to frequently adjust for new tasks or higher priority tasks.
  • Experience with a modern RDBMS or NoSQL, like Postgres, MySQL, DB2, Oracle, MongoDB, and Cloudant