> job detail
A
š½Other
Site Reliability Engineer II
American Express Ā· Chennai, TN, India
// classified as
Other (Adjacent or hard to classify.)
posted
1d ago
location
Chennai, TN, India
languages
java
tools
aws, azure, docker
> stack
javaawsazuredockerelasticsearchkubernetesmysqlnosqlpostgresql
> description
Site Reliability Engineer II collaborates with engineering teams to enhance system resilience, scalability, and performance through feature development, automation, architectural design, resiliency testing, and disaster recovery planning, while promoting best practices for continuous improvement.
Key Responsibilities
- Monitor application and infrastructure health using enterprise monitoring and observability tools, including ELF, to ensure availability, performance, and reliability of enterprise platforms
- Configure, tune, and maintain alerting mechanisms in ELF, aligned to service health indicators and SLOs, to enable timely incident detection and reduce noise and false positives
- Develop and maintain dashboards providing visibility into system performance, availability, reliability trends, and key operational metrics
- Analyze metrics, logs, and distributed traces across application and infrastructure layers to proactively identify issues and support effective root cause analysis (RCA)
- Own and execute blameless RCAs for production incidents, identify corrective and preventive actions, and track them to closure
- Implement minor code fixes, configuration updates, and reliability enhancements as part of incident remediation and preventive measures
- Collaborate with application development and platform teams to review defects, propose fixes, and improve overall service reliability
- Participate in Agile sprint planning ceremonies, backlog grooming, estimation, and delivery of SREāowned work items
- Drive reliability improvements through sprintābased commitments, including automation, operational fixes, and platform enhancements
- Participate in Disaster Recovery (DR) planning, testing, and execution to ensure resilience of businessācritical services
- Perform regular system patching and maintenance activities in line with organizational security, compliance, and audit requirements
- Support ITILābased Incident, Problem, and Change Management processes, including planning, documentation, approvals, execution, and postāimplementation validation
- Monitor network performance and troubleshoot connectivity, latency, and accessārelated issues impacting platform traffic
- Participate in certificate lifecycle management, including provisioning, renewal, validation, and troubleshooting of SSL/TLS certificates
- Maintain and manage service accounts (Service IDs), including access provisioning, credential rotation, and compliance with security policies
- Drive automation and operational toil reduction using scripting, CI/CD pipelines, and platform tooling to improve reliability and scalability
- Maintain accurate documentation of system configurations, runbooks, SOPs, platform operational guidelines, and troubleshooting procedures, and generate reports on system performance, incidents, and resolutions
Education and Knowledge
⢠Minimum of 5+ years of relevant experience in application development, maintenance, and production support, along with hands-on exposure to Java and distributed systems in enterprise environments.
- Bachelorās degree in computer science, Information Technology, Engineering, or equivalent practical experience; advanced degree is a plus
- Strong knowledge of operating systems and application runtimes such as Java and .NET
- Knowledge of distributed systems and serviceābased architectures from an operations and reliability perspective
- Strong knowledge of modern observability stacks and platforms, including Splunk, Elasticsearch, Prometheus, and Grafana
- Knowledge of observability practices including logging, monitoring, tracing, and performance analysis
- Knowledge of RDBMS and NoSQL databases including MySQL, PostgreSQL, Couchbase, HBase, and Cassandra
- Knowledge of scripting and automation using languages such as PowerShell and Python
- Basic understanding of AI, analytics, or AIOps platforms from an operational perspective is a plus
Ā
Work Experience
- Experience in Incident, Problem, and Change Management using ServiceNow or similar ITSM tools
- Experience supporting production systems in largeāscale enterprise environments with a focus on reliability and availability
- Experience in system administration, infrastructure operations, and network troubleshooting
- Experience with CI/CD pipeline implementation and support using tools such as Jenkins, GitHub Actions, XL Release (XLR), or similar
- Experience managing and troubleshooting technology infrastructure and services, including servers, networks, and cloud platforms
- Knowledge of cloudābased Site Reliability Engineering (SRE) practices with handsāon experience on public cloud platforms such as AWS, Azure, or Google Cloud Platform
- Knowledge of containerization and orchestration technologies such as Docker and Kubernetes, and microservicesābased architectures
- Experience using enterprise monitoring and alerting platforms such as ELF
- Exposure to AIāassisted monitoring, automation, or AIOps tools is a plus
- Experience accessing and managing remote systems using tools such as RDP and Citrix
- Proficiency in connecting to and administering servers via SSH (Secure Shell)
- Knowledge of core networking concepts including ports, protocols, firewalls, and secure remote access
Ā
Licenses & Certifications
- Certification in at least one programming language or runtime such as Java, .NET, or Python
- Certification in containerization and orchestration technologies (Docker, Kubernetes, OpenShift) is a plus
- Public cloud certification in AWS or GCP is a plus
- Certification or training related to AI platforms, analytics platforms, or AIOps is a plus
Ā