JobsAisle
Q

Infrastructure&Site Reliability Engineer – Datacentre AI Engineering - Riyadh, KSA

Qualcomm

Riyadh, Saudi ArabiaSAR 16,667-25,000/moToday
Saudi ArabiaIT & TechnologyFull Time

Skills Required

PythonGitDevopsMachine Learning

Job Description

<div><h3>Company</h3><p>Qualcomm Middle East Information Technology Company LLC</p><h3>Job Area</h3><p>Engineering Group, Engineering Group>Software Test Engineering</p><h3>General Summary</h3><p>Qualcomm is growing its presence in Riyadh and is hiring Data Centre Engineers to support our expanding infrastructure across the region. As Saudi Arabia accelerates its digital transformation under Vision 2030, Qualcomm is investing in world‑class computing and data centre capabilities to power AI, cloud, and advanced connectivity at scale. This is a unique opportunity to work in a fast‑growing technology hub, supporting critical environments and helping shape the future of data centre operations in the Kingdom and beyond.</p><h3>About The Role</h3><p>We are looking for a Site Reliability Engineer or Senior Engineer – Datacentre AI Engineering at Qualcomm Technologies, Inc., located in Riyadh, Saudi Arabia.</p><p>The role focuses on the design, operation, and continuous improvement of large‑scale AI inference systems in a datacenter environment. The engineer will support critical AI use cases by ensuring Qualcomm’s AI infrastructure is reliable, scalable, and production‑ready for advanced machine‑learning workloads.</p><p>The role requires strong systems and software engineering fundamentals, hands‑on execution, and the ability to work independently on complex problem areas while collaborating closely with cross‑functional teams across hardware, software, and machine learning.</p><h3>Ideal Candidates</h3><p>2–8 years of experience.</p><h3>Key Responsibilities</h3><ul><li>AI Infrastructure: Design, deploy, and operate large‑scale AI inference systems supporting critical AI workloads. Ensure reliability, availability, and scalability of Qualcomm datacenter AI clusters. Develop and maintain software tools and support infrastructure around AI software stacks.</li><li>AI&ML Engineering: Analyze software requirements and collaborate with architecture and hardware engineers to support AI workloads. Build, deploy, and operate components supporting LLM inference, agentic AI workflows, and AI services. Work with models, systems, and software teams to improve model performance on AI100 deployments. Identify and implement optimizations for workloads running on multi‑SoC and multi‑card systems.</li><li>Site Reliability Engineering (SRE): Apply SRE fundamentals including monitoring, alerting, incident response, and performance optimization. Support production ML systems using MLOps tools and operational best practices. Contribute to incident reviews, operational documentation, and continuous reliability improvements.</li><li>Observability&Tooling: Build and maintain observability tools, dashboards, and alerts to monitor system health and reliability. Monitor infrastructure and services using tools such as Prometheus, Grafana, CloudWatch, and custom telemetry. Create and maintain technical documentation, runbooks, and knowledge‑base articles.</li><li>Automation&CI/CD: Develop automation to reduce manual operational tasks and improve system reliability. Support CI/CD pipelines for AI service and agent deployment. Apply Infrastructure‑as‑Code practices using tools such as Terraform and Ansible.</li></ul><h3>Required Skillset</h3><ul><li><b>AI&Deep Learning</b><ul><li>Experience working with AI/ML workloads such as LLMs, NLP, Vision, Audio, or Recommendation systems.</li><li>Understand ML inference concepts including batching, token streaming, and performance considerations.</li><li>Hands‑on experience with PyTorch and familiarity with modern ML frameworks.</li><li>Familiarity with distributed inference, checkpointing, and accelerator‑based compute environments.</li></ul></li><li><b>AI Operations</b><ul><li>Experience supporting AI or ML applications in production environments.</li><li>Familiarity with LLM inference pipelines and AI service operations.</li></ul></li><li><b>Programming&Software Design</b><ul><li>Strong programming skills in Python with experience building and supporting production systems.</li><li>Experience with scripting and automation using Python and Bash.</li><li>Familiarity with configuration management and orchestration tools.</li></ul></li><li><b>Systems&Infrastructure</b><ul><li>Strong Linux fundamentals including shell, containers, system services, and networking basics (DNS, TLS, HTTP/gRPC).</li><li>Experience working with cluster schedulers such as Slurm or equivalent systems.</li><li>Experience operating distributed systems with high availability and fault tolerance.</li></ul></li><li><b>Observability&Monitoring</b><ul><li>Hands‑on experience with monitoring and logging tools such as Prometheus, Grafana, ELK, or Loki.</li><li>Understanding of incident management, service health metrics, and system reliability monitoring.</li></ul></li><li><b>DevOps&SRE Practices</b><ul><li>Solid understanding of SDLC, release processes, and operational reliability practices.</li><li>Familiarity with CI/CD pipelines and