About the job Senior Data Engineer
Skills we are looking for and examples of day to day tasks
Linux system administration Managing virtual machines on-premises and in the cloud, from initial provisioning and automation to optimisation for certain workloads. For example, disabling swap for kubernetes or tuning the kernel for elasticsearch. Monitoring, debugging the systems and network management is also vital to your success as a data devops engineer since there are terabytes of data flowing through our systems daily.
Kubernetes management - Kubectl, Kustomize, Helm charts. Kubernetes is our preferred tool for managing all infrastructure, it gives a good level of abstraction to standardise on and with community helm charts some of the work is already available from official sources. However, more often than not these need to be extended and improved upon for our specialised use cases. We are looking for an individual with a passion for kubernetes that enjoys working with the flexibility it provides, but also understands that it requires continual improvement for non-standard use cases, e.g., most of our services are stateful and kubernetes was engineered for stateless systems but it is evolving for more stateful application management.
Infrastructure as code to automate on-premise and cloud infrastructure, because flexibility is key when a spark cluster need to be scaled according to different configurations for large batch processing. The aim is to empower data engineers to scale clusters dynamically via configuration files and simple deployment methods so they can focus on dealing with the processing challenges rather than infrastructure challenges.
Operations for developing high availability or deployment & upgrades of distributed systems such as: Kafka, Kubernetes, Spark, Trino, Flink, Airflow etc. Your focus will be empowering data scientists and data engineers, by deploying initial proof of concepts to fully functional deployments, as well as tuning and monitoring these systems.
Building and deploying systems for metrics tracking, monitoring, and debug logging. Kubernetes helm charts often come with integrations for logging into systems such as prometheus and elasticsearch. But sometimes a custom integration is required or metrics need to be extended for debugging systems or improving system performance. Youll work closely with others needing these extensions and you will learn more about how these systems function along the journey.
CI/CD build systems to ensure our teams can deploy frequently and safely. We are looking for experience in automating docker builds and java code builds with the deployment thereof to production systems. Having checks and balances in place before deployment is important since we want to minimise downtime during system upgrade rollouts and to avoid integration issues. The fact that we have teams across the globe means there are no downtime windows and the system needs to remain available at all times.
Troubleshoot networking (tcp/ip, calico/weave, vlans, tcpdump, routing etc.) on premise, in the cloud and in and between kubernetes clusters.
Security aware, design systems with security in mind rather than retro fit solutions after the fact. Keep up to date with the security landscape and identify important risks before they become a problem. The log4j security vulnerability gave everyone in the security industry a run for their money. As similar threats become known, we need to sometime deprioritise certain tasks to rollout images with security fixes and actively detect these threats in an ever connected world.
User access and credential management. Liaising with the relevant teams inside cartrack to provision and grant access to systems based on their function. For example, a data scientist has a different permissions and access structure to that of a data engineer, it is vital to understand these differences and engineer the systems according to remain compliant with the legislation in the operating territory.
Responding to incidents and requests, troubleshooting with teammates spread across the globe (Asia, Europe and South Africa). Because things sometimes do go wrong, being flexible to respond to incidents and remediate them as quickly as possible is vital. For example being able to quickly detect an issue with a database server and add extra disk so others can resume their tasks without too much interruption.
You have
Required experience:
- Kubernetes cluster deployment and administration.
- Deep understanding of kubernetes storage management and abstractions.
- Understanding of how helm charts work, their limitations and how to improve upon these limitations.
- Linux system provisioning and administration (Nutanix, VMWare, KVM).
- System and cluster monitoring (Prometheus and/or Elasticsearch).
- Understanding of how docker works and familiarity with writing dockerfiles and automating the process.
- Networking management and troubleshooting of kubernetes, cloud and on-premises systems.
Interests or extra skills that need to be refined:
- Open source database administration (Postgres, MySQL).
- Distributed computing engines (Spark, Trino, Presto, Flink).
- Object storage management and administration.
- CI/CD pipeline development for docker and/or code builds (Java).
- System automation (Terraform, Cloudformation, Ansible, Python scripting, Shell scripting).
- Event based systems (Kafka, Elasticsearch).
- Cloud computing (Architecture, Operations, Cost management).
- User and access management (LDAP, IAM policies, Databases).
- Platform security, monitoring and continual auditing and improvement.
- Development environment and scheduler management (Jupyterhub, Airflow).
- Automation, as you refine a repeatable process, it should be automated to free up your time to focus on new tasks so you can improve and evolve your skillsets.
Qualifications & Experience:
Bachelors Degree or Advanced Diploma in IT, Computer Science, Engineering with 3 years of experience in a software/technology environment, where you were responsible for architecting and deploying distributed systems.