JobsAisle
L

Site Reliability Engineer

Lucidya

Riyadh, Saudi ArabiaAED 7,000-18,000/moSAR 7.1K-18.4K/moYesterday
Saudi ArabiaIT & TechnologyFull Time

Skills Required

PythonReactAwsAzureDockerKubernetesGitDevops

Job Description

About LucidyaLucidya is an AI-native platform for customer experience (CX) intelligence that manages entire customer lifecycles autonomously, from initial engagement through retention and growth.Unlike platforms that only surface insights and leave the action to you, Lucidya closes the loop with proprietary NLU technology built in-house and trained on millions of multilingual conversations. This enables marketing, support, CX, and research teams to deliver personalized experiences that drive measurable improvements in customer satisfaction, retention, and lifetime value.As we continue scaling globally, the reliability, performance, and resilience of our infrastructure become mission-critical to everything we do.Why this role mattersAt Lucidya, our platform processes massive volumes of real-time customer data. Any downtime, latency, or instability directly impacts our customers’ ability to make decisions and serve their own users.This role exists to make sure that doesn’t happen.As a Site Reliability Engineer, you’ll sit at the heart of our platform’s stability, owning the reliability of our cloud infrastructure and ensuring it scales seamlessly as we grow. You won’t just react to issues; you’ll anticipate them, design systems that prevent them, and build automation that removes them entirely.If you enjoy solving complex infrastructure challenges, eliminating inefficiencies, and building systems that “just work” - this is where you’ll thrive.What You’ll DoYou’ll be responsible for outcomes, not just tasks. Here’s what success looks like in this role:You’ll make reliability the defaultYou’ll design and maintain infrastructure that is highly available, fault‑tolerant, and scalableYou’ll proactively identify and eliminate single points of failure before they become incidentsYou’ll ensure our production systems remain stable, even under increasing scale and loadYou’ll own and optimize our cloud environmentsYou’ll manage and continuously improve workloads across AWS, GCP, or AzureYou’ll use Infrastructure as Code (Terraform) to standardize and scale infrastructureYou’ll optimize resource usage to balance performance and costYou’ll run and improve Kubernetes in productionYou’ll operate and scale Kubernetes clusters (EKS, GKE, etc.) with confidenceYou’ll troubleshoot issues quickly and ensure smooth deployments and upgradesYou’ll ensure our containerized workloads perform reliably at scaleYou’ll build strong observability and respond to incidentsYou’ll implement and refine monitoring systems using tools like Prometheus, Grafana, Datadog, or ELKYou’ll define alerting that is meaningful, not noisyYou’ll respond to incidents, lead root cause analysis, and ensure we learn from every failureYou’ll automate everything that shouldn’t be manualYou’ll write scripts and build tooling to eliminate repetitive operational workYou’ll continuously improve infrastructure efficiency through automationYou’ll promote a culture where manual work is a temporary state, not the normYou’ll collaborate to improve the entire systemYou’ll work closely with DevOps and engineering teams to solve performance bottlenecksYou’ll contribute to CI/CD improvements and deployment reliabilityYou’ll help shape reliability best practices across the organizationWhat success looks like (First 90 Days)First 30 days:You’ve built a strong understanding of our infrastructure, systems, and workflowsYou’re contributing to day‑to‑day operations with support from the teamYou’ve started identifying areas for improvement in automation and reliabilityBy 90 days:You’re independently managing infrastructure tasks and troubleshooting issuesYou’re actively contributing to reliability and scalability improvementsYou’ve taken ownership of parts of our infrastructure and are improving themWho You AreThis is what will make you successful in this role:You’ve spent ~3 years working in SRE, DevOps, or infrastructure engineering, and you’ve seen what breaks at scaleYou’re comfortable working in cloud environments like AWS, GCP, or Azure—and you understand how distributed systems behaveYou’ve worked hands‑on with Kubernetes in production and know how to troubleshoot it when things go wrongYou don’t just fix issues - you ask why they happened and make sure they don’t happen againTechnically, you likely:Use Terraform (or similar IaC tools) to manage infrastructureWork confidently with Docker and KubernetesWrite scripts in Python, Bash, or similar to automate workflowsUnderstand CI/CD pipelines (Jenkins, GitHub Actions, Bitbucket, etc.)Have a solid grasp of networking, load balancing, and high‑availability designWhen it comes to monitoring:You’ve implemented tools like Prometheus, Grafana, Datadog, or ELKYou know the difference between useful alerts and noiseYou focus on signals that actually drive actionWhat sets you apart:You take ownership - you don’t wait to be told something is brokenYou’re calm under pressure and methodical during incidentsYou simplify complexity instead of ad