You solved the platform.
You abstracted the infra.
Engineers still can't troubleshoot it.
How OpsWorker started
Before writing a line of code, Ar spent months validating the idea. He interviewed engineers and tech leaders across 20+ companies - from large enterprises to fast-moving product startups - at every level: individual contributors, platform leads, VPs, CTOs.
The pattern was consistent. The investigation burden was real, widespread, and expensive, but for complex incidents. And it was not just about incidents. It was about the daily friction of operating complex systems - understanding what changed, why a workload is behaving differently, what dependency is introducing risk, why an alert is firing again after it was supposedly resolved last week.
During that process, Ar discussed the issue also with Nune Isabekyan, founder of Powerdata GmbH and a cloud architecture expert with deep hands-on experience in exactly the systems OpsWorker would need to reason about. A few sessions of focused brainstorming made the direction clear. Nune joined as co-founder and CTO, and OpsWorker was born.
The problem we could not ignore
That is not a failure of effort. Platform teams have spent years doing everything right - standardizing, abstracting, building internal developer platforms, writing runbooks, enforcing observability standards. And it still does not close the gap.
When something breaks in a complex cloud-native environment, or when a developer needs to understand why their workload is behaving differently in production, the knowledge required to investigate is deep, contextual, and hard to transfer. You cannot document your way out of it. You cannot platform-engineer your way out of it either.
After a decade building and operating multi-cloud, multi-cluster Kubernetes environments at scale - platforms running tens of thousands of microservices - Ar Kobian had tried most of the approaches. As platform owner and engineering lead, his work was not just building the infrastructure but ensuring software development teams could actually use it: understand it, troubleshoot it, provision against it without becoming infrastructure experts themselves.
The wall was not the complexity. That is expected. The wall was everything that happens when something goes wrong inside that complexity - or when an engineer needs to understand it, or explain it, or act on it under pressure.
Shifting the problem left or right does not make it disappear. Making software engineers infrastructure-literate at scale takes years and rarely sticks. Every platform evolution reopens the knowledge gap. And every incident, every slow workload, every unexplained degradation still ends the same way: find the right engineer and have them figure it out manually.
When the LLM era began, we saw something different. Not another abstraction layer. Not another runbook or self-service portal. The possibility of an AI layer that could investigate, explain, and guide - meeting engineers where they are, in the moment the system is breaking, with the specific answer they need right now.
That was the change worth building.
The team

A decade building and operating multi-cloud, multi-cluster Kubernetes environments at scale. Former platform owner and engineering lead running infrastructure for tens of thousands of microservices. Built OpsWorker to solve the problem he lived every day: closing the knowledge gap between platform complexity and engineering reality.
What we built
OpsWorker is an AI SRE Production Intelligence platform for engineering teams running Kubernetes in production.
When an alert fires - from Prometheus, Datadog, CloudWatch, or any webhook-compatible monitoring tool - OpsWorker begins investigating immediately, without waiting for an engineer to open a terminal. It examines pods, services, deployments, logs, events, configurations, and resource relationships in parallel. It evaluates multiple hypotheses. It correlates what happened in the infrastructure with what changed in recent deployments.
Then it delivers a root-cause analysis with specific remediation steps - including copy-paste kubectl commands - to Slack in under two minutes.
No new agents to deploy on day one. No new dashboards to learn. The investigation arrives where your team already works.
Beyond incidents, OpsWorker builds a living memory of your production systems - so every investigation makes the next one faster, and institutional knowledge stops walking out the door when engineers change teams or companies.
Alert received to root-cause analysis delivered in Slack
In investigation time compared to manual workflows
No human trigger required
Why we work this way
We are not building another observability platform. We are not replacing your existing monitoring stack. We are filling the gap that every tool in this space has left open: the investigation itself.
We know what it costs to operate complex systems at scale - the alert fatigue, the debugging sessions that outlast the fixes, the tribal knowledge that lives in the heads of two senior engineers and nowhere else. That experience shapes everything we build.
We are honest about what OpsWorker does and does not do. It investigates and recommends. It does not execute remediations without engineer approval - that boundary is intentional. Trust is built through transparency, not through claims that sound better than they are.
What's next
Building a knowledge graph that connects incidents, deployments, configuration changes, and investigation outcomes across time. OpsWorker will not just solve today's alert - it will recognize patterns, predict risk, and retain context across weeks and months.
Expanding beyond root-cause analysis into guided remediation workflows. Engineers will approve recommended fixes, and OpsWorker will execute them safely - with rollback plans, validation checks, and full audit trails.
Shifting left from incident response to risk detection. OpsWorker will identify configuration drift, resource contention, deployment anomalies, and dependency vulnerabilities before they trigger alerts - giving teams time to act before systems break.
Vision
The goal is not to replace engineers. It is to give them leverage.
We believe the future of platform engineering is not about adding more tools or building bigger abstractions. It is about reducing the cognitive load on the people operating those platforms - meeting them in the moment they need help, with the exact context and answer required to act confidently.
OpsWorker exists to close the gap between platform complexity and human understanding. To turn weeks of learning into minutes of clarity. To make institutional knowledge durable and accessible. To let engineers focus on building instead of firefighting.
That is the future we are building toward.
Join us early
OpsWorker is in active development, working with frontier companies and teams who run Kubernetes and microservices in production.
If you want to reduce firefighting, accelerate root cause analysis, and give your team a smarter way to operate complex cloud-native systems - we would like to talk.
