Wait, What Do You Do?

Key Responsibilities

Build and maintain observability stacks using Prometheus and Grafana; define SLOs, SLIs, SLAs and error budgets.
Own incident response: on-call rotation, triage, mitigation, and blameless post-mortems.
Automate repetitive operational tasks and eliminate toil through scripting and tooling (Python, Bash, Go).
Design, deploy, and maintain highly available infrastructure on AWS using Terraform and Ansible for infrastructure-as-code workflows.
Manage and optimize Kubernetes clusters (EKS) and containerized workloads with Docker to support microservices architecture.
Collaborate with engineering teams during design reviews to embed reliability and scalability requirements.
Monitor capacity and performance trends; proactively identify and resolve bottlenecks.
Maintain and improve CI/CD pipelines and deployment automation.

Qualifications Required

2–8 years of experience in Site Reliability Engineering, DevOps, or a closely related discipline.
Working knowledge of monitoring and logging tools like Prometheus, Grafana, Dynatrace or Datadog, OpenSearch and Victoria metrics etc.
Tracking and monitoring SLAs for all critical services.
Experience with Linux systems administration.
Hands-on experience with Kubernetes and Docker in production environments.
Proficiency with AWS services (EC2, EKS, RDS, S3, VPC, IAM, CloudWatch).
Experience with Infrastructure-as-Code tools such as Terraform or Ansible.
Strong scripting skills in Python or Bash.
Familiarity with CI/CD tools (e.g., GitHub Actions, Jenkins, GitLab CI).
Familiarity with GitOps workflows (ArgoCD, Rancher etc).

Preferred

Site Reliability Engineer