Wait, What Do You Do?

Define and drive reliability strategy across services, including measurable targets for availability, latency, and performance aligned to business priorities. Establish and enforce SLO/SLI frameworks and error budgets across multiple teams, ensuring consistent adoption and accountability. Lead complex incident management and systemic RCA efforts, identifying cross-service failure patterns and driving durable, long-term fixes. Influence architecture and platform design to improve operability, scalability, fault isolation, and disaster recovery at enterprise scale. Drive reliability engineering standards for observability (metrics, logs, traces), capacity planning, and production readiness across the organization. Eliminate operational toil through automation, enabling self-healing systems and reducing manual intervention. Embed security, compliance, and resiliency practices into design and operational processes, ensuring alignment with enterprise requirements. Partner with engineering leadership to prioritize reliability investments and balance feature velocity with system stability. Lead and mentor engineers while shaping a strong reliability culture across teams and org boundaries. 8+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration. Proven track record of defining and operationalizing SLOs, SLIs, and error budgets across multiple services or organizations. Experience leading reliability efforts for enterprise-scale or globally distributed systems. Advanced debugging and troubleshooting skills across application, platform, and infrastructure layers. Demonstrated ability to mentor senior engineers and influence engineering culture at scale. Experience driving platform-level improvements (e.g., standardized observability, shared reliability tooling, automated remediation frameworks). Extensive experience operating large-scale, distributed production systems, including cloud-native platforms (Azure preferred). Demonstrated ability to drive cross-team technical initiatives and influence architecture and engineering standards. Deep experience in observability, incident management, and production operations at scale. Strong understanding of Azure networking, distributed systems performance, and reliability engineering principles. Experience leveraging data platforms (Kusto, Power BI, telemetry pipelines) to drive operational insights and decision-making.

Principal Service Reliability Engineer