Cloud Site Reliability Engineer

Cornerstone OnDemand

Cornerstone OnDemand cover image
Cornerstone OnDemand logo image
Cornerstone OnDemandComputer Software

Cloud Site Reliability Engineer

India , Mumbai

We are seeking a highly skilled Site Reliability Engineer (SRE) with strong experience in Kubernetes troubleshooting, incident response, and deep knowledge of monitoring and alerting systems, along with solid experience in CI/CD pipeline design and maintenance. You will play a key role in building and maintaining reliable infrastructure, enhancing observability, and ensuring uptime for mission-critical systems.

In this role, you will...

  • Diagnose and resolve issues in Kubernetes clusters, including deployments, pod failures, networking issues, and autoscaling.
  • Lead incident management efforts including on-call response, root cause analysis, and continuous improvement of incident playbooks.
  • Design and maintain monitoring, logging, and alerting systems using tools such as Prometheus, Grafana, and ELK (Elasticsearch, Logstash, Kibana).
  • Set up and manage Kibana dashboards and maintain the ELK stack to ensure high availability and performance of logging infrastructure.
  • Integrate metrics, logs, and traces into a unified observability platform.
  • Build and maintain alerting pipelines to reduce noise and improve signal-to-noise ratio for production incidents.
  • Contribute to infrastructure automation using tools like Terraform, Helm.
  • Set up and support CI/CD pipelines for automated testing, deployment, and rollback across multiple environments.
  • Participate in shift rotations and continuously improve observability and response systems.

You've Got What It Takes If You Have...

  • 2+ years in an SRE, DevOps, or Infrastructure Engineer role.
  • Bachelor's degree in computer science, IT, or related technical field.
  • Hands-on experience on AWS and GCP Cloud
  • Deep hands-on experience with Kubernetes (EKS, AKS, GKE)
  • Strong understanding of Linux internals, container orchestration, and microservice architecture.
  • Hands-on experience with monitoring/logging tools:
  • Prometheus, Grafana, InfluxDB
  • ELK stack (Elasticsearch, Logstash, Kibana)
  • Proficient in incident response and alerting tools (PagerDuty etc.).
  • Basic knowledge of:
  • Kafka - topic monitoring, consumer health
  • ElastiCache / Redis - caching patterns and troubleshooting
  • InfluxDB - time-series metrics storage
  • Experience writing and maintaining automation scripts in Bash, Python, or Go.

#LI-Onsite

Share this job