> job detail
C
๐ฝOther
Site Reliability Engineer
clearwateranalytics ยท Office - Noida
// classified as
Other (Adjacent or hard to classify.)
posted
44d ago
location
Office - Noida
languages
bash, python
tools
aws, datadog, docker
> stack
bashpythonawsdatadogdockergrafanakubernetess3terraform
> description
Key Responsibilities
- Build and maintain observability stacks using Prometheus and Grafana; define SLOs, SLIs, SLAs and error budgets.
- Own incident response: on-call rotation, triage, mitigation, and blameless post-mortems.
- Automate repetitive operational tasks and eliminate toil through scripting and tooling (Python, Bash, Go).
- Design, deploy, and maintain highly available infrastructure on AWS using Terraform and Ansible for infrastructure-as-code workflows.
- Manage and optimize Kubernetes clusters (EKS) and containerized workloads with Docker to support microservices architecture.
- Collaborate with engineering teams during design reviews to embed reliability and scalability requirements.
- Monitor capacity and performance trends; proactively identify and resolve bottlenecks.
- Maintain and improve CI/CD pipelines and deployment automation.
Qualifications Required
- 2โ8 years of experience in Site Reliability Engineering, DevOps, or a closely related discipline.
- Working knowledge of monitoring and logging tools like Prometheus, Grafana, Dynatrace or Datadog, OpenSearch and Victoria metrics etc.
- Tracking and monitoring SLAs for all critical services.
- Experience with Linux systems administration.
- Hands-on experience with Kubernetes and Docker in production environments.
- Proficiency with AWS services (EC2, EKS, RDS, S3, VPC, IAM, CloudWatch).
- Experience with Infrastructure-as-Code tools such as Terraform or Ansible.
- Strong scripting skills in Python or Bash.
- Familiarity with CI/CD tools (e.g., GitHub Actions, Jenkins, GitLab CI).
- Familiarity with GitOps workflows (ArgoCD, Rancher etc).
Preferred
- Experience in financial services, FinTech, or other regulated industries.
- Knowledge of service mesh technologies (Istio, Linkerd).
- Familiarity with distributed tracing tools (Jaeger, OpenTelemetry).
- AWS certifications (Solutions Architect, DevOps Engineer, or equivalent).
- Experience with cost optimization strategies in cloud environments.