← back to jobs
> job detail
S
👽Other

Principal Site Reliability Engineer

SonicWall · USA-Remote
// classified as
Other (Adjacent or hard to classify.)
posted
2d ago
location
USA-Remote
languages
python, shell
tools
aws, datadog, dynamodb
> stack
pythonshellawsdatadogdynamodbgrafanapostgresqlredisterraform
> description
<div class="content-intro"><p><a href="https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fc212.net%2Fc%2Flink%2F%3Ft%3D0%26l%3Den%26o%3D4263951-1%26h%3D4126915226%26u%3Dhttps%253A%252F%252Fnam04.safelinks.protection.outlook.com%252F%253Furl%253Dhttps%25253A%25252F%25252Fwww.sonicwall.com%25252F%2526data%253D05%25257C01%25257Cbfitzgerald%252540SonicWall.com%25257Ca6c16b82afc749239f1c08dbda330d94%25257C84fe6f401cbc473083288018b2af88bc%25257C1%25257C0%25257C638343685070400977%25257CUnknown%25257CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%25253D%25257C3000%25257C%25257C%25257C%2526sdata%253DQiWUyjNRdG%25252F5Rg2xvGXa1CYzr9V6BNnIgriz33Hk4vw%25253D%2526reserved%253D0%26a%3DSonicWall&amp;data=05%7C02%7CNikhilaR%40SonicWall.com%7Cfd3ae83762384be031d508dcdf3d2ddf%7C84fe6f401cbc473083288018b2af88bc%7C1%7C0%7C638630701067920926%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&amp;sdata=GhrhwS9jae1nk4KBGBzKfBcXS7BDBmkWDO87fbJLQxQ%3D&amp;reserved=0"><strong>SonicWall</strong></a>&nbsp;is a cybersecurity forerunner&nbsp;with more than 30 years of expertise and is recognized as a leading partner-first company, ensuring our partners and their customers are never alone&nbsp;in the fight against cybercrime. With the ability to build, scale and manage security across the cloud, hybrid and traditional environments in real-time, SonicWall provides relentless security against the most evasive cyberattacks across endless exposure points for increasingly remote, mobile and cloud-enabled users. With its own threat research center, SonicWall can quickly and economically provide purpose-built security solutions to enable any organization—enterprise, government agencies and SMBs—around the world. For more information, visit&nbsp;<a href="https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fc212.net%2Fc%2Flink%2F%3Ft%3D0%26l%3Den%26o%3D4263951-1%26h%3D2643321098%26u%3Dhttps%253A%252F%252Fnam04.safelinks.protection.outlook.com%252F%253Furl%253Dhttp%25253A%25252F%25252Fwww.sonicwall.com%25252F%2526data%253D05%25257C01%25257Cbfitzgerald%252540SonicWall.com%25257Ca6c16b82afc749239f1c08dbda330d94%25257C84fe6f401cbc473083288018b2af88bc%25257C1%25257C0%25257C638343685070400977%25257CUnknown%25257CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%25253D%25257C3000%25257C%25257C%25257C%2526sdata%253DXI%25252FDFdF08xnLTx4H7uCLIOH1N6kWKxObFvd9KJwK2Lc%25253D%2526reserved%253D0%26a%3Dwww.sonicwall.com&amp;data=05%7C02%7CNikhilaR%40SonicWall.com%7Cfd3ae83762384be031d508dcdf3d2ddf%7C84fe6f401cbc473083288018b2af88bc%7C1%7C0%7C638630701067946534%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&amp;sdata=4g9iPS3CN7E0hNoG8nisr9%2F%2B1iThtyJLQkZfTTFjk38%3D&amp;reserved=0"><strong>www.sonicwall.com</strong></a>&nbsp;or follow us on&nbsp;<a href="https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fc212.net%2Fc%2Flink%2F%3Ft%3D0%26l%3Den%26o%3D4263951-1%26h%3D1423517997%26u%3Dhttps%253A%252F%252Fnam04.safelinks.protection.outlook.com%252F%253Furl%253Dhttps%25253A%25252F%25252Ftwitter.com%25252FSonicWall%2526data%253D05%25257C01%25257Cbfitzgerald%252540SonicWall.com%25257Ca6c16b82afc749239f1c08dbda330d94%25257C84fe6f401cbc473083288018b2af88bc%25257C1%25257C0%25257C638343685070400977%25257CUnknown%25257CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%25253D%25257C3000%25257C%25257C%25257C%2526sdata%253Dw1QKMhTH8DPkWXgLMTEWRWld7QF7Vby9hPJfzftCYj0%25253D%2526reserved%253D0%26a%3DTwitter&amp;data=05%7C02%7CNikhilaR%40SonicWall.com%7Cfd3ae83762384be031d508dcdf3d2ddf%7C84fe6f401cbc473083288018b2af88bc%7C1%7C0%7C638630701067957822%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&amp;sdata=DvfduakDCq8KXqJ%2BZlrMiXuosZ4Ij6hueiSMBKx9lA4%3D&amp;reserved=0"><strong>Twitter</strong></a>,&nbsp;<a href="https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fc212.net%2Fc%2Flink%2F%3Ft%3D0%26l%3Den%26o%3D4263951-1%26h%3D1291489902%26u%3Dhttps%253A%252F%252Fnam04.safelinks.protection.outlook.com%252F%253Furl%253Dhttps%25253A%25252F%25252Fwww.linkedin.com%25252Fcompany%25252Fsonicwall%25252F%2526data%253D05%25257C01%25257Cbfitzgerald%252540SonicWall.com%25257Ca6c16b82afc749239f1c08dbda330d94%25257C84fe6f401cbc473083288018b2af88bc%25257C1%25257C0%25257C638343685070400977%25257CUnknown%25257CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%25253D%25257C3000%25257C%25257C%25257C%2526sdata%253DAqnGVNrPMjGMxmC54q1sp9PlJyB5Cwer4xH4LgrWvd8%25253D%2526reserved%253D0%26a%3DLinkedIn&amp;data=05%7C02%7CNikhilaR%40SonicWall.com%7Cfd3ae83762384be031d508dcdf3d2ddf%7C84fe6f401cbc473083288018b2af88bc%7C1%7C0%7C638630701067969397%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&amp;sdata=WjyCv%2FrHK4zlqjVcPHKXbMuh4gdziEANXtaPUrwTHvA%3D&amp;reserved=0"><strong>LinkedIn</strong></a>,&nbsp;<a href="https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fc212.net%2Fc%2Flink%2F%3Ft%3D0%26l%3Den%26o%3D4263951-1%26h%3D2712840885%26u%3Dhttps%253A%252F%252Fnam04.safelinks.protection.outlook.com%252F%253Furl%253Dhttps%25253A%25252F%25252Fwww.facebook.com%25252FSonicWall%25252F%2526data%253D05%25257C01%25257Cbfitzgerald%252540SonicWall.com%25257Ca6c16b82afc749239f1c08dbda330d94%25257C84fe6f401cbc473083288018b2af88bc%25257C1%25257C0%25257C638343685070400977%25257CUnknown%25257CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%25253D%25257C3000%25257C%25257C%25257C%2526sdata%253DMQpoYcVevI0sLgxX7PsLuGFbW0C62qmRerKqcgHO6K4%25253D%2526reserved%253D0%26a%3DFacebook&amp;data=05%7C02%7CNikhilaR%40SonicWall.com%7Cfd3ae83762384be031d508dcdf3d2ddf%7C84fe6f401cbc473083288018b2af88bc%7C1%7C0%7C638630701067980581%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&amp;sdata=6t54RMMOc1GoFFeEy5HvdzorLYF3h6iUrSGzvd02wPQ%3D&amp;reserved=0"><strong>Facebook</strong></a>&nbsp;and&nbsp;<a href="https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fc212.net%2Fc%2Flink%2F%3Ft%3D0%26l%3Den%26o%3D4263951-1%26h%3D3922727505%26u%3Dhttps%253A%252F%252Fnam04.safelinks.protection.outlook.com%252F%253Furl%253Dhttps%25253A%25252F%25252Fwww.instagram.com%25252Fsonicwall_inc%25252F%25253Fhl%25253Den%2526data%253D05%25257C01%25257Cbfitzgerald%252540SonicWall.com%25257Ca6c16b82afc749239f1c08dbda330d94%25257C84fe6f401cbc473083288018b2af88bc%25257C1%25257C0%25257C638343685070400977%25257CUnknown%25257CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%25253D%25257C3000%25257C%25257C%25257C%2526sdata%253DBzB2YhsUzF35BWtGReUzu8wEsOqNssaVdVdDPhaSJZs%25253D%2526reserved%253D0%26a%3DInstagram&amp;data=05%7C02%7CNikhilaR%40SonicWall.com%7Cfd3ae83762384be031d508dcdf3d2ddf%7C84fe6f401cbc473083288018b2af88bc%7C1%7C0%7C638630701067991196%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&amp;sdata=96EW%2BjKqH%2B2E3efQ71gSCjqPEt9NO%2BRD4B2JoZFbgYk%3D&amp;reserved=0"><strong>Instagram</strong></a>.</p></div><p>As a <strong>Principal Site Reliability Engineer</strong>, you will own the reliability, scalability, and operational excellence of our Cloud-based services. You will define and enforce reliability standards, drive the adoption of SRE practices across engineering teams, and build the systems and tooling that keep our production infrastructure healthy. We follow a DevOps model: Development and Operations teams are integrated, and the SRE function acts as the reliability layer — setting Service Level Objectives, managing error budgets, and continuously reducing toil through engineering.</p> <p><strong><u>Key Responsibilities:</u></strong></p> <ul> <li>Define, publish, and continuously refine Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) for all critical services, partnering with product and engineering leadership.</li> <li>Own the error budget framework: track consumption, enforce error budget policies, and drive reliability investments when budgets are at risk.</li> <li>Lead the design and implementation of comprehensive observability platforms — metrics, structured logging, and distributed tracing — to ensure full visibility into production systems.</li> <li>Drive toil reduction initiatives by identifying and automating repetitive, manual operational work, targeting measurable reduction in operational burden across teams.</li> <li>Design and execute chaos engineering programs to proactively uncover reliability weaknesses in our infrastructure and services before they impact customers.</li> <li>Lead blameless postmortem culture: facilitate incident retrospectives, extract systemic learnings, and track corrective action items to completion.</li> <li>Build and improve on-call incident response processes, runbooks, and escalation paths; manage and optimize on-call rotation health to prevent burnout.</li> <li>Help design, build, and support infrastructure and security technologies within the cloud that offer resiliency, observability, and optimized cost.</li> <li>Develop solutions for automated deployment of software and services on our production infrastructure hosted on AWS, applying reliability engineering principles throughout.</li> <li>Shape how mission-critical enterprise software solutions are developed and deployed using optimized CI/CD pipelines that embed reliability and quality gates.</li> <li>Develop management solutions for services across multiple cloud platforms and data centers, with a focus on fault tolerance and graceful degradation.</li> <li>Collaborate with developers to bring new features and services into production using production-readiness reviews and launch checklists.</li> <li>Champion reliability engineering best practices across the organization, embedding SRE principles into the software development lifecycle.</li> <li>Mentor team members on SRE philosophy, technical decision-making, code reviews, and cloud engineering best practices.</li> <li>Participate in roadmap planning, identify areas of improvement, and perform technology evaluation and selection.</li> </ul> <p><strong><u>Required Skills and Qualification:</u></strong></p> <ul> <li>7+ years of experience in scalable, distributed systems architecture.</li> <li>3+ years of hands-on Site Reliability Engineering experience, including ownership of SLOs and error budget management.</li> <li>4+ years of experience with Cloud Platforms, including AWS.</li> <li>4+ years of experience in infrastructure as code (Terraform, AWS CDK).</li> <li>5+ years of experience in scripting using Python, Shell, or a similar language.</li> <li>3+ years of experience with containerization technologies, including Docker.</li> <li>4+ years of experience with orchestration technologies, including Kubernetes.</li> <li>Demonstrated experience designing and operating observability stacks (e.g., Prometheus, Grafana, Datadog, OpenTelemetry, Jaeger, or equivalent).</li> <li>Experience with incident management platforms and on-call tooling (e.g., PagerDuty, OpsGenie).</li> <li>Experience defining and implementing automated service deployments, including provisions for networking, security, reliability, management, reporting, and configuration management.</li> <li>Experience with chaos engineering principles and tools (e.g., Chaos Monkey, LitmusChaos, Gremlin, or equivalent).</li> <li>Experience managing databases — PostgreSQL, Redis, DynamoDB, MongoDB.</li> <li>In-depth understanding of best practices for deployment automation and production-readiness reviews.</li> <li>Experience using Git in a team environment (merge requests, branching, push, and pulls).</li> <li>CS Degree or equivalent experience.</li> </ul> <p><strong><u>Preferred Skills:</u></strong></p> <ul> <li>Familiarity with Google SRE principles and the concepts outlined in the Google SRE Book.</li> <li>In-depth understanding of networking, including routing, naming, security, network performance, and network failure modes.</li> <li>In-depth understanding of the HTTP protocol and experience diagnosing distributed system latency issues.</li> <li>Experience with distributed tracing frameworks (Jaeger, Zipkin, AWS X-Ray).</li> <li>Experience implementing AIOps or ML-based anomaly detection and alerting systems.</li> <li>Experience with instrumentation and management of automated deployments.</li> <li>Experience resolving customer-facing production issues under time pressure.</li> <li>Experience working with CI/CD processes and building pipelines with embedded reliability gates.</li> <li>Experience with capacity planning, load testing, and performance benchmarking at scale.</li> <li>Experience working with distributed, cross-functional teams.</li> </ul> <p>#LI-KB7</p> <p>#LI-Remote</p><div class="content-conclusion"><p>SonicWall is an equal opportunity employer.&nbsp;&nbsp;</p> <p>We are committed to creating a diverse environment and are an equal opportunity employer. All qualified applicants receive consideration for employment without regard to race, color, ethnicity, religion, sex, gender, gender identity and expression, sexual orientation, national origin, disability, age, marital status, veteran status, pregnancy, or any other basis prohibited by applicable law.<br><br>At SonicWall, we pride ourselves on recruiting a diverse mix of talented people and providing active security solutions in 100+ countries.</p> <p><span style="font-size: 10pt;"><a href="https://www.sonicwall.com/legal/job-applicant-privacy-notice/" target="_blank">Applicant Privacy Notice</a></span></p></div>