The Cloud Operations Monitoring team ensures a solid monitoring practice and strategy is in place and integrated with the IT Service Management (ITSM) framework for automatic alerting, tracking, escalation and resolution of issues impacting Sumtotal services. This team will lead efforts to review current functionality, identify and remediate gaps and evaluate new and replacement capabilities to improve alignment with Organization strategies and direction towards operational intelligence and cognitive IT.
This role will be responsible for Sumtotal’s enterprise-wide strategy, implementation, and operations for end-to-end monitoring of application and infrastructure utilizing and enhancing the currently deployed technologies. This role will require hands on experience in implementation of monitoring in the applications landscape. The resource will identify key business transactions in collaboration with the business and implement monitoring to get pro-active alerts on system disruption and outages before it creates business impact.
- Responsible to design, implement and operate software tools and process for end-to-end monitoring of all technology assets.
- Lead efforts to design, integrate and implement monitoring systems for core infrastructure, cloud infrastructure, applications, performance monitoring, and synthetic monitoring.
- Resolve complex technical issues and drive innovation that improves system availability, resiliency, and performance.
- Build end-to-end monitoring, detection and prevention tools for both internal Information technology assets and external product/go-to-market environments.
- Define, drive, and implement architecture strategy for end-to-end monitoring.
- Partner with the rest of the technology teams including application development, testing services, network engineering, security, and operations to establish end-to-end monitoring strategy.
- Engineer monitoring solutions operations and policies to drive proactive alerts and notification.
- Work with operational teams to automate proactive actions to proactively solve problems through automation
Knowledge, Skills, and Abilities
- Ability to architect monitoring tool solution end to end for new deployments
- Rich knowledge on Event Management (Events to alerts to warnings to alarms definition), alert suppression logic, event correlation
- Hands-on experience in integrating EMS and ITSM tools (bi-directional API Integration)
- Hands-on experience in implementing/ managing of the Monitoring tools such as Catchpoint/Newrelic/Dynatrace/AppDynamics/AWS Cloudwatch and event aggregation tools like Bigpanda and Dashboarding tools like Grafana, Time series database like Influx DB and logging tools like Splunk and ELK
- Working knowledge of the holistic monitoring of applications with drill-down on the platform layer (infra, data, middleware, apps, etc.)
- Knowledge of SCOM and Solarwinds monitoring tools will be a plus.
- Working knowledge of automated QA tools like selenium
- Strong Experience with Windows, Unix and Linux Systems
- Working knowledge on Operating systems primarily Windows/Linux, networking fundamentals along with communication protocols WMI, RPC, SNMP, etc.
- Strong application and database experience as an operations or architecture resource
- Desire and ability to thrive in a fast-paced, highly demanding, dynamic business and information technology environment
- Excellent communication skills and experience in driving cross department initiatives to obtain organizational objectives
- Strong communication, presentation and business and technical writing skills
Education, Training and Minimum Qualifications
- Degree in Computer Science or equivalent experience.
- 5+ years’ experience in a Monitoring Engineering related role
- Minimum 10 years system administration of Enterprise Monitoring Systems.
- Minimum 10 years of experience in Management products, i.e.: BMC ProactiveNet / Patrol, OEM and Microsoft SCOM, SolarWinds.
- Minimum of 3 years of experience in Application Performance Monitoring tools and Real User Monitoring tools
- Advanced knowledge of Enterprise Monitoring metrics, reporting, logging and best practices.
- Experience with SNMP traps.