Help us maintain the quality of jobs posted on PowerToFly. Let us know if this job is closed.
Job Type
Full Time
Job Details
To get the best candidate experience, please consider applying for a maximum of 3 roles within 12 months to ensure you are not duplicating efforts. Job CategorySoftware Engineering Job Details About Salesforce We’re Salesforce, the Customer Company, inspiring the future of business with AI+ Data +CRM. Leading with our core values, we help companies across every industry blaze new trails and connect with customers in a whole new way. And, we empower you to be a Trailblazer, too — driving your performance and career growth, charting new paths, and improving the state of the world. If you believe in business as the greatest platform for change and in companies doing well and doing good – you’ve come to the right place. About the teamThe Reliability and Incident Automation team builds tools and products that underpin Reliability, Service Ownership and Incident Management at Slack.We seek diverse perspectives and strategies with a focus on how to keep Slack reliable, empower service owners and learn from incidents. We collaborate with product and infrastructure engineering teams to continuously improve shared technology and processes, and maintain incident management as a foundational skill set of all engineering teams at Slack.Slack has a positive, diverse, and supportive culture. We want people who are curious, inventive, and inspired to do their best work every single day. In our work together we aim to be smart, humble, hardworking and, above all, collaborative. If this sounds like a good fit for you, please apply and connect with our team.What you will be doing:
- Lead engineering development on internal products and tools with a focus on prototyping and iteration for high velocity. Engage with teams and users to build features that have a delightful user experience and make their lives better.
- Build tooling and services that handle failure gracefully and without interrupting incident response in an environment that requires rock solid reliability and interacting with a variety of external systems, such as Observability, Monitoring, Alerting and Ticketing, to provide real time information to incident responders.
- Provide mentorship and guide the team forward through technical expertise.
- Facilitate and participate in incident investigations and reviews (aka postmortems) for major incidents at Slack and drive program improvements for Incident Analysis and Review across Slack Engineering.
- Run training and workshops to teach Incident Responders and Commanders across Slack about the principles of incident management and the tactical ways in which we perform incident response. Be a peer and mentor to engineers who are new to on-call work and various roles in incident response.
- Be a service owner for the software and tooling we write and develop. You will participate in an on-call rotation, assist with triage, address production issues, and respond to incidents. Participate as an Incident Commander at Slack.
- You have 7+ years of experience in Reliability, Incident Management and/or operating distributed systems at scale.
- You have experience with functional or imperative programming languages — e.g., PHP, Python, Ruby, or Go.
- You write understandable, testable code with an eye towards maintainability.
- You are a strong communicator with a positive attitude, and empathy. Explaining complex technical concepts to designers, support, and other engineers is no problem for you.
- You possess strong computer science fundamentals: data structures, algorithms, programming languages, distributed systems, and information retrieval.
- Strong UX and design sensibilities, and a desire to sweat the small stuff.
- Self-awareness and a desire to continually improve.
- Experience with large scale distributed systems and cloud-based environments.
- You enjoy helping onboard new team members, mentoring, and teaching others.
- You have a Bachelor's degree in Computer Science, Engineering or related field, or equivalent training, fellowship, or work experience.
- You are passionate about Site Reliability Engineering (SRE), Resilience Engineering and Learning from Incidents
- Experience building tools or applications with Python and Go
- Curiosity for gaining valuable insights via analytics and metrics
- You have experience in responding to and coordinating incidents in previous roles
About the Company
Salesforce
San Francisco, CA, United States
WHO WE ARE: We’re Salesforce, the Customer Company, inspiring the future of business with AI+Data+CRM. Leading with our core values, we help... Read more