Wait, What Do You Do?

Site Reliability Engineer (SRE) – Job Description Job Title: Site Reliability Engineer (SRE) Experience: 4–8 Years Location: Delhi NCR Employment Type: Full-Time Job Summary We are looking for a Site Reliability Engineer (SRE) to ensure the reliability, scalability, performance, and availability of our cloud infrastructure and applications. The ideal candidate will have strong experience in cloud platforms, Kubernetes, automation, monitoring, incident management, and DevOps practices. Key Responsibilities Maintain and improve the reliability, availability, and performance of production systems. Design, implement, and manage monitoring, alerting, and observability solutions. Manage and support Kubernetes clusters and containerized workloads. Automate operational tasks using Infrastructure as Code (Terraform, ARM, Bicep, etc.). Collaborate with development teams to improve application resilience and deployment processes. Perform root cause analysis (RCA) for incidents and implement preventive measures. Define and monitor SLIs, SLOs, and error budgets. Manage CI/CD pipelines and deployment automation. Support disaster recovery (DR), backup, and business continuity planning. Participate in on-call support and incident response activities. Optimize cloud infrastructure for performance, security, and cost efficiency. Required Skills Strong experience with Azure, AWS, or GCP. Hands-on experience with Kubernetes (AKS/EKS/GKE). Experience with Terraform, Infrastructure as Code, and automation. Strong Linux and networking fundamentals. Experience with GitLab CI/CD, Azure DevOps, or Jenkins. Monitoring and observability tools such as Prometheus, Grafana, ELK, Datadog, Azure Monitor. Scripting experience in Python, Bash, or PowerShell. Knowledge of incident management, problem management, and change management processes. Experience with databases, caching solutions, and messaging platforms is desirable. Preferred Qualifications Azure Administrator, Azure DevOps, Kubernetes (CKA/CKAD), or similar certifications. Experience with microservices architecture and cloud-native technologies. Understanding of security best practices and compliance requirements. Nice to Have Service Mesh (Istio/Kiali) Kafka, Redis, MongoDB, PostgreSQL Azure APIM, Application Gateway, WAF Disaster Recovery and High Availability architecture Key Metrics Platform Availability (99.9%+) MTTR (Mean Time to Recovery) Incident Reduction Deployment Success Rate Infrastructure Automation Coverage Requirements Required Skills Strong experience with Azure, AWS, or GCP. Hands-on experience with Kubernetes (AKS/EKS/GKE). Experience with Terraform, Infrastructure as Code, and automation. Strong Linux and networking fundamentals. Experience with GitLab CI/CD, Azure DevOps, or Jenkins. Monitoring and observability tools such as Prometheus, Grafana, ELK, Datadog, Azure Monitor. Scripting experience in Python, Bash, or PowerShell. Knowledge of incident management, problem management, and change management processes. Experience with databases, caching solutions, and messaging platforms is desirable. Preferred Qualifications Azure Administrator, Azure DevOps, Kubernetes (CKA/CKAD), or similar certifications. Experience with microservices architecture and cloud-native technologies. Understanding of security best practices and compliance requirements. Nice to Have Service Mesh (Istio/Kiali) Kafka, Redis, MongoDB, PostgreSQL Azure APIM, Application Gateway, WAF Disaster Recovery and High Availability architecture Key Metrics Platform Availability (99.9%+) MTTR (Mean Time to Recovery) Incident Reduction Deployment Success Rate Infrastructure Automation Coverage

Site Reliability Engineering (SRE)