Join us as we pursue our disruptive new vision to make machine data accessible, usable and valuable to everyone. We are a company filled with people who are passionate about our product and seek to deliver the best experience for our customers. At Splunk, we’re committed to our work, customers, having fun and most importantly to each other’s success. Learn more about Splunk careers and how you can become a part of our journey!
The Splunk Observability Suite is a new generation of cloud applications for microservices and distributed applications. We work on new, world-class tools to monitor and observe microservice-based applications. Site Reliability Engineers at Splunk are hybrid Software/Systems Engineers whose overarching goal is to ensure that production services are always up and running reliably.
As a Software Engineer - Infrastructure, you will help us run one of the largest and most sophisticated cloud-scale, big data systems in the world. You will be responsible for improving operational efficiency, optimal utilization and system resiliency for a real-time streaming analytics platform. You are passionate about automation, infrastructure-as-code, and getting rid of tedious, manual tasks.
- Responsible for automating & operationalizing cloud provider infrastructure via Terraform as well as Kubernetes, Helm and Istio
- Monitor capacity & utilization and work closely with the infrastructure team to orchestrate scale-up/down of backend services.
- Own & operate critical back-end open-source services like Cassandra, Kafka, and Zookeeper.
- Build tools and design processes that help improve observability and system resiliency.
- Triage site availability incidents and proactively work towards reducing MTTR for customer-impacting incidents.
- Partner with service owners to implement service level metrics & service level objectives that act as service-level health indicators.
- Establish design patterns for monitoring, benchmarking and deploying new features for the backend services.
- Coding experience in one or more of Python, Go or Java.
- Infrastructure as code experience with in one or more of Terraform, Ansible, Puppet or Salt.
- Strong experience with modern application development workflows and version control systems like GitHub, Gitlab or Bitbucket
- Strong working knowledge of Docker containers and cloud platforms (AWS, GCP and/or Azure)
- Strong working knowledge of orchestration engines and package management including Kubernetes, Helm, and Istio
- Experience operating one or more OSS technologies like Kafka, Cassandra, Zookeeper; other backends and streaming systems a plus
- Extensive understanding of Unix/Linux systems from kernel to shell and beyond (system libraries, file systems, and client-server protocols).
- 8+ years experience as a Site Reliability Engineer, Production Engineer or Backend Software Engineer for web-scale or similar platforms.
- BS degrees in Computer Science or related technical field, or equivalent practical experience.