Production Support/Management
Job Description:
About the Role:
We are seeking a highly motivated Production Support Engineer with 2+ years of experience to ensure the continuous and efficient operation of our production systems. In this role, you will be responsible for monitoring, troubleshooting, and resolving production issues in real-time, as well as improving the overall stability and performance of our services.
You will work closely with development, QA, and operations teams to address incidents, identify root causes, and implement long-term solutions. If you thrive in high-pressure environments and enjoy problem-solving, this could be a perfect fit for you.
Key Responsibilities:
Monitor the health, performance, and availability of production systems and services
- Diagnose and resolve production issues quickly, minimizing downtime and impact on end-users
- Provide on-call support for production incidents and manage issue escalation as necessary
- Collaborate with development teams to investigate root causes of production issues and propose solutions
- Perform system health checks and regular system maintenance tasks to ensure optimal performance
- Implement monitoring tools and alerting systems to proactively identify potential issues before they impact users
- Deploy bug fixes, patches, and system upgrades in production environments
- Document issues, resolution steps, and operational procedures for knowledge sharing
- Assist in post-incident reviews and implement improvements based on lessons learned
- Help implement change management processes to ensure smooth and controlled deployments
- Ensure adherence to SLAs (Service Level Agreements) for incident resolution and response time
Qualifications:
Required:
- Bachelors degree in Computer Science, Information Technology, Engineering, or a related field
- 2+ years of experience in production support or operations management in a tech environment
- Familiarity with Linux/Unix or Windows server administration
- Strong experience with monitoring and alerting tools (e.g., Prometheus, Grafana, Nagios, New Relic)
- Ability to work with log aggregation and analysis tools (e.g., ELK Stack, Splunk)
- Proficiency in troubleshooting application, infrastructure, and network issues
- Experience with databases (e.g., MySQL, PostgreSQL, MongoDB)
- Knowledge of incident management tools (e.g., JIRA, ServiceNow)
- Strong understanding of cloud platforms (e.g., AWS, Azure, GCP) and cloud infrastructure
- Familiarity with CI/CD pipelines and deployment automation tools
Preferred:
- Experience in automation and scripting (e.g., Bash, Python, Shell scripting)
- Familiarity with containerization technologies like Docker and orchestration tools like Kubernetes
- Experience in load balancing, scaling, and disaster recovery practices
- Knowledge of ITIL or other IT operations frameworks
- Experience in release management and deployment strategies
Required Skills:
Production Support Gcp Operations Incident Management Disaster Recovery CI/CD Shell Scripting Analysis Splunk Steps Escalation Pipelines Lessons ROOT ServiceNow Azure ITIL Bash Load Checks Windows Server Operations Management Unix AWS Reviews Change Management Kubernetes Infrastructure Availability Automation PostgreSQL Information Technology MongoDB Databases Docker Linux Computer Science Troubleshooting Windows Administration JIRA MySQL Maintenance Engineering Python Science Management