New York City, NY, United States Posted 7 days ago
Manage all Digital Infrastructure related backend systems/services currently residing on AWS Cloud including users access, network connectivity, Linux/Windows systems, databases, and applications management.
Deploy updates and patches to servers as well as connected client systems in off-hours maintenance windows.
Identify, troubleshoot and resolve both server and client issues by analyzing logs from all digital infrastructure components.
Set up and continue to improve monitoring/alerting matrices for all supported platforms.
Proactively review key operating matrices and status to ensure all systems are running under recommended operational conditions.
Participate in designing and implementing of mechanisms for redundancy, failover, and disaster recovery.
Develop tools and scripts to automate routine tasks.
Collaborate with NOC, DevOps, and Engineering teams to harden, streamline, and document operating processes.
Work closely with Head of Digital Infrastructure to improve operability, supportability, usability, and visibility of the digital infrastructure.
Assist in continuous improvement of operational processes for better utilization of underlying cloud resources.
At least 5+ years of direct working experience in operating production digital infrastructure with strong scripting and system administration skills for both Linux and Windows operating systems.
At least 3 years AWS administration experience including but not limited to OpsWorks, VPC, EC2/ECS, S3, RDS, IAM, ES and EMR services
Working knowledge of advanced message queuing and extensible messaging and presence protocols
Working knowledge of modern system operating tools for monitoring and centralized logging.
Experience with automation and configuration management using Chef and Ansible
Ability to use a variety of open source technologies and integrating them with cloud services
Experience in managing PostgreSQL, MySQL, MS SQL and NoSQL clusters
Working knowledge for securing data and ensuring operating redundancy in cloud environment
Ability to evaluate system and application logs, error messages, stack traces to quickly identify and solve production problems
Understanding of best practice and data center operations in an always-up, always-available setup
Ability to create and maintain up to date infrastructure documentation including systems, networks, databases, and their interactions
Ability to adhere to established operations procedures and policies
Ability to create clear steps by steps knowledge base documents for NOC to follow and resolve known issues