Applies full understanding of the business, the customer, and the solutions that a business offers to effectively design, develop, and implement operational capabilities, tools and processes that enable highly available, scalable & reliable customer experiences.
Utilizes their deep knowledge of operations engineering, connected services, and information technology plus their knowledge of industry best practices to innovate and influence operational approaches and solutions
Be the Technical Lead and works on significant assignments that are broad in scope and complexity and cover a wide range of issues
Exercises independent judgment in the selection of methods and techniques used to deliver operational solutions.
Creates formal internal and external networks outside of own area of expertise to leverage and adopt ideas, technologies and best practices that helps the organization move fast
Support the migration to AWS with automation all the way and with no manual intervention anywhere in the flow.
Manage all operational aspects of Production and Pre-Production environments in AWS and traditional data centers.
Work with development on design, testing, and implementing data objects in support of critical applications
Implement, monitor, and test backup and resiliency methods
Developing the monitoring architecture and implementing monitoring agents, dashboards, escalations and alerts
Working closely with Product development for operational aspects of the release, for Resiliency patterns and ensuring that the customer experience is monitored, measured and improved release over release.
Provide Tier-2 support and participate in 24x7 on-call incident escalation rotations.
Utilize proven skills and knowledge, to provide troubleshooting and timely resolution of application, performance, systems and infrastructure incidents.
Developing and driving incident management processes, playbooks and stakeholder communication mechanisms.
Coaches and mentors other Site Reliability engineers.
Incident management reports, including initial problem analysis, management status, resolution, and follow up defect reporting
Technical documentation on supported applications & operational tools
Application Deployment Plan and implementation
Configuration of monitoring agents at the software layer, and the development of meaningful alerts and the escalation procedures
Responses to monitoring alerts according to defined playbooks and procedures
Participation in Root Cause Analysis (RCA) processes
Implementation of business operations standards
Suggestions for process improvements and enhanced operational efficiencies
Implementation of monitoring agents
Management of application deployment processes
Management of RCA processes for a specific application
Implementation of improved operational processes
Real Time Application Dashboards showing overall health of the system
Code reviews of operational solutions
Facilitate the creation of the Operational readiness documents
Review and development of performance and capacity plans (operational capacity and load requirements)
Specifications for onboarding new offerings, including trouble shooting, patch processes, cross organizational incident management processes, security breach response plans, etc.
Implementation plans for application disaster recovery, migration, roll-back plans, expansion, routine deployments, and system upgrades
Metrics reporting on applications performance, availability, reliability, etc.
Design reviews of operational approaches and solutions
Contributions to Operational Standards and Requirements