Help us maintain the quality of jobs posted on PowerToFly. Let us know if this job is closed.
Job Details
At Juniper, we believe the network is the single greatest vehicle for knowledge, understanding, and human advancement the world has ever known. To achieve real outcomes, we know that experience is the most important requirement for networking teams and the people they serve. Delivering an experience-first, AI-Native Network pivots on the creativity and commitment of our people. It requires a consistent and committed practice, something we call the Juniper Way. Juniper is seeking a full-time SRE to join our talented team and support high quality technology solutions that revolutionize wireless and wired networks, powered by Artificial Intelligence in the cloud. Juniper provides services through SaaS applications to several enterprises, including Fortune 100 and Fortune 500 customers. You will be responsible for maintaining and improving the company's production environment for rapid scaling and outstanding performance. You will keep stellar cloud uptime and reliability. Your primary responsibilities will be incident management and release management in cloud instances in various regions. Responsibilities
- Maintain system availability, health and service levels (SLAs, SLOs) of the large-scale cloud infrastructure, running in AWS and GCP.
- Support infrastructure components, data streaming frameworks and databases, such as Kubernetes, Flink, Storm, Spark, Kafka, Cassandra, Elasticsearch, Redis, Postgres, ArangoDB, and many others.
- Monitor, troubleshoot, analyze failures, and provide support for software engineers to debug production issues across microservices and distributed platforms. Work with development team in resolving the issues found.
- Join on-call rotation and resolution of issues in a 24x7 multi-cloud (AWS/GCP) environment.
- Monitor metrics and performance of applications and cloud infrastructure.
- Handle entire lifecycle of incident management, including reporting, analyzing, handling incidents, until its closure and writing RCAs.
- Write and update runbooks for knowledge driven automated processes and bots.
- Perform capacity planning based on performance, usage, and utilization stats.
- Follow SRE best practices and procedures.
- Bachelor's degree in computer science or computer engineering or equivalent.
- 1+ years hands-on experience with AWS or GCP, EC2 (GCE), IAM, S3 (GS), Docker, Kubernetes pods, Jenkins, Prometheus, CloudWatch (Stack Driver), Linux, Ansible.
- 1+ years’ experience in deploying code and infrastructure in AWS or GCP using continuous integration/continuous delivery (CI/CD) tools in production environments.
- 1+ Administration experience of distributed computation and streaming frameworks, like Kafka, Cassandra, Elasticsearch, Flink, Storm, Spark, and cloud services EMR, Dataproc, Elasticache, AWS RDS, GCP SQL or similar.
- 1+ years of automation using Python or/and Golang, or/and Rust, and shell scripting.
- 1+ prior experience in developing metrics to monitor health of infrastructure and applications.
- Good understanding of Terraform or CloudFormation or any IaC code is preferred.
- Any opensource development experience.
- AI Ops /Gen AI experience.
- Automation using workflow services GitHub Actions, Google Workflows, Jenkins, GitLab, Slack and Confluence/Jira.
- Microservices release operations experience.
About the Company
Juniper Networks
Sunnyvale, CA, United States
Juniper Networks is leading the revolution in networking, making it one of the most exciting technology companies in Silicon Valley today. Since... Read more