About This Role:
We are hiring a hands-on DevOps Engineer to manage and support production-grade cloud infrastructure for Kibo’s commerce platform. This role focuses on Kubernetes (EKS), Terraform, and real-time production troubleshooting in a 24/7 on-call environment.
KIBO is a composable digital commerce platform for B2C, D2C, and B2B organizations who want to simplify the complexity in their businesses and deliver modern customer experiences. KIBO is the only modular, modern commerce platform that supports experiences spanning B2B and B2C Commerce, Order Management, and Subscriptions. Companies like Ace Hardware, Zwilling, Jelly Belly, Nivel, and Honey Birdette trust Kibo to bring simplicity and sophistication to commerce operations and deliver experiences that drive value.
KIBO's cutting-edge solution is MACH Alliance Certified and has been recognized by Forrester, Gartner, IDC, Internet Retailer, and TrustRadius. KIBO has been named a leader in The Forrester Wave™: Order Management Systems, Q1 2025 and in the IDC MarketScape report “Worldwide Enterprise Headless Digital Commerce Applications 2024 Vendor Assessment”.
By joining KIBO, you will be part of a team of Kibonauts all over the world in a remote-friendly environment. Whether your job is to build, sell, or support KIBO’s commerce solutions, we tackle challenges together with the approach of trust, growth mindset, and customer obsession. If you’re seeking a unique challenge with amazing growth potential, then come work with us!
- Manage and operate production-grade Kubernetes clusters (EKS preferred), ensuring high availability and scalability
- Troubleshoot real-time production issues across distributed systems and microservices
- Diagnose and resolve issues such as:
- Pod failures (CrashLoopBackOff, Pending, OOMKilled)
- Node failures, autoscaling, and resource constraints
- Networking, ingress, and service connectivity issues
- Build, maintain, and debug infrastructure using Terraform (modules, remote state, locking, drift handling)
- Implement and enhance monitoring & alerting systems using Prometheus, Grafana, and related tools
- Perform root cause analysis (RCA) for incidents and drive permanent fixes to improve system reliability
- Participate in a 24/7 on-call rotation, owning incidents and resolving them independently
- Collaborate with engineering teams to improve system performance, resilience, and deployment processes
- Automate deployments, infrastructure provisioning, and operational workflows to reduce manual effort
- Ensure adherence to security best practices across infrastructure and deployments