Platform Operations Engineer
NY, United States
Experience levels: Mid-Level, Senior
- Manage all Digital Infrastructure related backend systems/services currently residing on AWS Cloud including users access, network connectivity, Linux/Windows systems, databases, and applications management.
- Deploy updates and patches to servers as well as connected client systems in off-hours maintenance windows.
- Identify, troubleshoot and resolve both server and client issues by analyzing logs from all digital infrastructure components.
- Set up and continue to improve monitoring/alerting matrices for all supported platforms.
- Proactively review key operating matrices and status to ensure all systems are running under recommended operational conditions.
- Participate in designing and implementing of mechanisms for redundancy, failover, and disaster recovery.
- Develop tools and scripts to automate routine tasks.
- Collaborate with NOC, DevOps, and Engineering teams to harden, streamline, and document operating processes.
- Work closely with Head of Digital Infrastructure to improve operability, supportability, usability, and visibility of the digital infrastructure.
- Assist in continuous improvement of operational processes for better utilization of underlying cloud resources.
- At least 5+ years of direct working experience in operating production digital infrastructure with strong scripting and system administration skills for both Linux and Windows operating systems.
- At least 3 years AWS administration experience including but not limited to OpsWorks, VPC, EC2/ECS, S3, RDS, IAM, ES and EMR services
- Working knowledge of advanced message queuing and extensible messaging and presence protocols
- Working knowledge of modern system operating tools for monitoring and centralized logging.
- Experience with automation and configuration management using Chef and Ansible
- Ability to use a variety of open source technologies and integrating them with cloud services
- Experience in managing PostgreSQL, MySQL, MS SQL and NoSQL clusters
- Working knowledge for securing data and ensuring operating redundancy in cloud environment
- Ability to evaluate system and application logs, error messages, stack traces to quickly identify and solve production problems
- Understanding of best practice and data center operations in an always-up, always-available setup
- Ability to create and maintain up to date infrastructure documentation including systems, networks, databases, and their interactions
- Ability to adhere to established operations procedures and policies
- Ability to create clear steps by steps knowledge base documents for NOC to follow and resolve known issues
- Participate in 24x7 on call rotations
- Bachelors degree in relevant fields