Senior Site Reliability Engineer

Red Hat, Inc.

Pune, India₹20,000–₹50,000/mo≈ AED 880-2.2K/moToday

IndiaPythonJavaCCAWSGCPAzureKubernetesOpenShiftAnsiblePuppetChefDNSHTTPDockerGolangUnixLinuxPrometheusTCPIP NetworkingFull Time

Skills Required

PythonJavaAwsAzureDockerKubernetesGitAgileErpCommunication

Job Description

Job Description As a Senior Site Reliability Engineer (SRE) at Red Hat, you will be responsible for developing, scaling, and operating the OpenShift managed cloud services, which is Red Hats enterprise Kubernetes distribution. Your role will involve enabling customer self-service, enhancing the monitoring system sustainability, and driving automation to eliminate manual work. You will have the opportunity to tackle unique challenges at scale while leveraging your expertise in coding, operations, and large-scale distributed system design. **Key Responsibilities:** - Contribute code to enhance scalability and reliability of the service - Collaborate on software testing and participate in peer reviews to elevate the quality of the codebase - Mentor and share knowledge with peers to foster their development - Engage in a regular on-call rotation, including weekends and holidays as needed - Practice sustainable incident response and participate in blameless postmortems - Address customer issues escalated from the Red Hat Global Support team - Work within an agile team to enhance SRE software, support colleagues, and engage in self-improvement - Utilize LLMs (e.g., Google Gemini) for tasks like brainstorming solutions, summarizing technical documentation, and enhancing problem-solving efficiency - Employ AI-assisted development tools (e.g., GitHub Copilot) for code generation and auto-completion to accelerate development cycles - Participate in AI-assisted code reviews to identify bugs, security vulnerabilities, and adherence to coding standards - Collaborate with cross-functional teams to identify opportunities for AI integration within the software development lifecycle **Qualifications Required:** - Bachelor's degree in Computer Science or related technical field; hands-on experience in Site Reliability Engineering may be considered in lieu of degree requirements - Proficiency in programming languages like Python, Golang, Java, C, or C++ - Experience with public clouds such as AWS, GCP, or Azure - Ability to troubleshoot and solve problems collaboratively in a team setting - Familiarity with troubleshooting SaaS or PaaS offerings and working with complex distributed systems - Basic understanding of Unix/Linux operating systems **Desired Skills:** - 5+ years managing Linux servers at cloud providers like AWS, GCE, or Azure - 3+ years of enterprise systems monitoring experience; knowledge of Prometheus is beneficial - 3+ years with enterprise configuration management software like Ansible, Puppet, or Chef - 2+ years programming with an object-oriented language; Golang, Java, or Python preferred - Experience delivering a hosted service and troubleshooting system issues - Understanding of TCP/IP networking, DNS, HTTP, and Kubernetes - Strong communication skills and customer interaction experience - Familiarity with Kubernetes and docker-based containers is a plus At Red Hat, you will be part of a global team that values transparency, teamwork, and continuous improvement. Your individual contributions will be recognized, providing visibility for career growth opportunities. Join Red Hat in delivering high-performing Linux, cloud, container, and Kubernetes technologies through an open and inclusive environment that encourages innovation and collaboration. As a Senior Site Reliability Engineer (SRE) at Red Hat, you will be responsible for developing, scaling, and operating the OpenShift managed cloud services, which is Red Hats enterprise Kubernetes distribution. Your role will involve enabling customer self-service, enhancing the monitoring system sustainability, and driving automation to eliminate manual work. You will have the opportunity to tackle unique challenges at scale while leveraging your expertise in coding, operations, and large-scale distributed system design. **Key Responsibilities:** - Contribute code to enhance scalability and reliability of the service - Collaborate on software testing and participate in peer reviews to elevate the quality of the codebase - Mentor and share knowledge with peers to foster their development - Engage in a regular on-call rotation, including weekends and holidays as needed - Practice sustainable incident response and participate in blameless postmortems - Address customer issues escalated from the Red Hat Global Support team - Work within an agile team to enhance SRE software, support colleagues, and engage in self-improvement - Utilize LLMs (e.g., Google Gemini) for tasks like brainstorming solutions, summarizing technical documentation, and enhancing problem-solving efficiency - Employ AI-assisted development tools (e.g., GitHub Copilot) for code generation and auto-completion to accelerate development cycles - Participate in AI-assisted code reviews to identify bugs, security vulnerabilities, and adherence to coding standards - Collaborate with cross-functional teams to identify opportunities for AI integration within the software development lifecycle **Quali