Site Reliability Engineer (SRE)

PrimeGate for Communications and IT

Riyadh, Saudi ArabiaSAR 16,667-25,000/moYesterday

Saudi ArabiaIT & TechnologyFull Time

Skills Required

PythonAwsAzureDockerKubernetesGitDevopsCommunication

Job Description

<div><h3>About the Role</h3><p>We are looking for a Site Reliability Engineer (SRE) with solid experience running production systems and working closely with development teams. The ideal candidate is comfortable with Linux, containers, Kubernetes, and CI/CD pipelines, and has a strong focus on reliability, monitoring, and incident handling. You will help keep our services stable, observable, and scalable while collaborating with engineers across the stack.</p><h3>Responsibilities</h3><ul><li>Operate and maintain production systems with a focus on reliability, availability, and performance.</li><li>Work with Docker and Kubernetes to deploy, update, and troubleshoot services.</li><li>Configure and optimize Kubernetes resources (pods, deployments, services, ingress, config maps, secrets, etc.).</li><li>Implement and maintain monitoring, logging, and alerting for applications and infrastructure.</li><li>Build and improve CI/CD pipelines in collaboration with development and DevOps teams.</li><li>Create and maintain dashboards for key service metrics (latency, error rate, throughput, resource usage).</li><li>Participate in incident response: investigate issues, identify root cause, and propose fixes and improvements.</li><li>Work closely with backend developers to improve service reliability, resilience, and observability.</li><li>Contribute to capacity planning and performance tuning of services and infrastructure.</li><li>Automate repetitive operational tasks using scripts or small tools.</li><li>Document runbooks, procedures, and best practices for operating services in production.</li></ul><h3>Must-Have Qualifications</h3><ul><li>3–5 years of professional experience in an SRE, DevOps, or infrastructure-focused engineering role.</li><li>Strong understanding of Linux systems (shell, processes, networking, permissions, logs).</li><li>Hands‑on experience with Docker and Kubernetes in real environments.</li><li>Practical experience with:<ul><li>Kubernetes deployments, services, ingress, config maps, and secrets</li><li>Basic troubleshooting inside a cluster (pods failing, crashes, restarts, resource issues)</li></ul></li><li>Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK/EFK, Application Insights, or similar).</li><li>Experience with CI/CD pipelines (Azure DevOps, GitHub Actions, GitLab CI, Jenkins, or similar).</li><li>Ability to read and modify pipeline definitions and understand build → test → deploy flows.</li><li>Basic programming/scripting skills in at least one language (e.g., Python, Bash, PowerShell, Go, etc.).</li><li>Understanding of core reliability concepts such as SLIs, SLOs, uptime, latency, and availability.</li><li>Experience troubleshooting production issues using logs, metrics, and dashboards.</li><li>Good communication skills and ability to collaborate with developers, QA, and product teams.</li></ul><h3>Nice-to-Have</h3><ul><li>Experience with at least one major cloud platform (Azure, AWS, Alibaba Cloud, or GCP).</li><li>Experience with infrastructure as code (Terraform, Bicep, Pulumi, Helm, etc.).</li><li>Experience with ingress controllers, API gateways, or service mesh.</li><li>Familiarity with security best practices (secrets management, TLS/certificates, RBAC on Kubernetes or cloud).</li><li>Experience participating in on‑call rotations and using incident management tools (PagerDuty, Opsgenie, etc.).</li><li>Experience contributing to post‑incident reviews and implementing follow‑up improvements.</li></ul><h3>Experience</h3><p>3–5 years</p></div>#J-18808-Ljbffr