Posts on Shahram Anver

Posts on Shahram Anver https://shahramanver.com/posts/ Recent content in Posts on Shahram Anver Hugo -- gohugo.io en-us Wed, 04 Mar 2026 00:00:00 +0000 AI agent learning beats demo flashiness https://shahramanver.com/posts/ai-agent-learning-beats-demo-flashiness/ Wed, 04 Mar 2026 00:00:00 +0000 https://shahramanver.com/posts/ai-agent-learning-beats-demo-flashiness/ Every AI agent demo looks the same now. What’s matters is how well the agent learns in its specific domain. The best teams see production as a team sport, not a place for individual heroics. When one engineer figures out that OOM spike is always the Redis sidecar, we want all engineers to know. But… it’s never been easy making that happen. Learning is the real value of AI agents in production — not just MTTR reduction or alert triaging. Abundance mindset with AI SRE competitors https://shahramanver.com/posts/abundance-mindset-with-ai-sre-competitors/ Wed, 04 Feb 2026 00:00:00 +0000 https://shahramanver.com/posts/abundance-mindset-with-ai-sre-competitors/ At re:Invent, I walked by a guy wearing a Deductive AI jacket. They’re a direct competitor, so I stopped and introduced myself. Turns out it was Rakesh, their CEO. Within five minutes, we were excitedly swapping notes on AI SRE like old friends. Same thing happened with Stephen from incident.io - he reached out and we planned 30 minutes for coffee. We ended up talking for over an hour before he realized he was late to his next meeting. The coding interview needs to die https://shahramanver.com/posts/the-coding-interview-needs-to-die/ Wed, 04 Feb 2026 00:00:00 +0000 https://shahramanver.com/posts/the-coding-interview-needs-to-die/ The coding interview as we know it needs to die, but no one at the table knew what replaces it. At a dinner we hosted with engineers from Anthropic, OpenAI, TogetherAI, Apple, and other startups this was the most contentious topic: how do you review engineering as a craft when AI writes the code? Some basics are now even more important like: First-principles thinking, like asking why before reaching for a new pattern or tool. Two years in America https://shahramanver.com/posts/two-years-in-america/ Sun, 04 Jan 2026 00:00:00 +0000 https://shahramanver.com/posts/two-years-in-america/ It’s almost 2026, which means I’ve been in the US for nearly two years now. I thought I knew what to expect. America is always in the news… the ambition, the chaos, the scale of everything. Turns out it’s what I didn’t expect that defined my time here. A few weeks after we moved, I took my daughters to the park. Ice cream truck pulls up. I queue with them, excited to give them their first American ice cream truck experience. Hallucination rate is the wrong question https://shahramanver.com/posts/hallucination-rate-is-the-wrong-question/ Wed, 10 Dec 2025 00:00:00 +0000 https://shahramanver.com/posts/hallucination-rate-is-the-wrong-question/ What’s your hallucination rate? I get this question constantly. And for a while, I tried to answer it with benchmarks, percentages, confidence intervals. None of it moved the needle. Turns out the question isn’t really “how often does your agent lie?” It’s “should I trust this thing?” And that’s not something you answer with a number. It’s something Fred answers. Every org has a Fred. The senior engineer who’s picky, skeptical, hard to impress. Coding agents vs SRE agents are different beasts https://shahramanver.com/posts/coding-agents-vs-sre-agents/ Sat, 15 Nov 2025 00:00:00 +0000 https://shahramanver.com/posts/coding-agents-vs-sre-agents/ Coding agents and “AI SRE” are both agents, but they’re fundamentally different beasts. A coding agent goes deep. It has to handle infinite types of requests (add feature X) against one context: your repo. An SRE agent goes wide. Finite types of asks (why is X down?), but the context sprawls across everything you run. And that width is where the pain lives. To diagnose an issue: Your K8s cluster isn’t useful without your logs. We're asking the wrong question about AI agents in production https://shahramanver.com/posts/wrong-question-about-ai-agents-in-production/ Wed, 05 Nov 2025 00:00:00 +0000 https://shahramanver.com/posts/wrong-question-about-ai-agents-in-production/ We’re asking the wrong question about AI agents in production. The debate right now is if they’re only good for prototypes or if they can actually ship to prod. Both camps are loud. Both are right… in their own environment. The better question is what separates them. I’ve seen two big factors. 1/ Tech stack. LLMs know Python, Javascript, Go, Java far better than Scala or Elixir. If your stack’s niche, you’re fighting uphill. Dogfooding is an experiment, not a mandate https://shahramanver.com/posts/dogfooding-is-an-experiment-not-a-mandate/ Sun, 05 Oct 2025 00:00:00 +0000 https://shahramanver.com/posts/dogfooding-is-an-experiment-not-a-mandate/ All of the best products had brutal amounts of dogfooding. Slack. Kubernetes. Cloudflare. But a mistake teams make is treating dogfooding like a compliance ritual, something everyone has to do. That’s when it stops teaching you anything. Dogfooding is an experiment, not a mandate. The real question is: if people could easily leave, would they still choose your product? At Gojek, when I ran ML platforms, we never forced adoption. Teams could use GCP’s stack or OSS alternatives, whatever they liked. Founders should code a little https://shahramanver.com/posts/founders-should-code-a-little/ Wed, 01 Oct 2025 00:00:00 +0000 https://shahramanver.com/posts/founders-should-code-a-little/ Founders should code. Not all the time. Not never. A little. Coding is crack for technical CEOs. It feels like progress, it’s an easy dopamine hit. Shipping a PR is clean, measurable. Sending cold emails or sitting in ambiguity? Not so much. So I used to treat coding like junk food and cut it out completely. I only focused on GTM and product direction, the “real” CEO work. And that worked… until it didn’t. Kill the flies before fighting fires https://shahramanver.com/posts/kill-the-flies-before-fighting-fires/ Wed, 10 Sep 2025 00:00:00 +0000 https://shahramanver.com/posts/kill-the-flies-before-fighting-fires/ Kill the flies. Then you’ll finally have time for the fires. Hot take: most teams don’t have an “incident response” problem. They have a noise economy problem. We celebrate the Friday-night P0 save and ignore the 300 pages that ate someone’s entire week. Half of those “real” alerts auto-resolve. That’s not resilience, it’s Stockholm syndrome. If your on-call spends the week acknowledging PagerDuty, you’re not improving MTTR - you’re burning the team’s mean thinking time. Deploy AI agents with domain-confident teams first https://shahramanver.com/posts/deploy-ai-agents-with-domain-confident-teams/ Sun, 20 Apr 2025 00:00:00 +0000 https://shahramanver.com/posts/deploy-ai-agents-with-domain-confident-teams/ If you’re thinking about deploying AI agents, start with teams that have strong domain confidence. I’ve seen this pattern repeatedly and I suspect it’ll be an industry pattern through 2025. These teams know their systems well. They can tell when an agent is genuinely helpful, and when it’s off. That confidence makes all the difference. Early adopters tend to be the strongest teams. Low alert fatigue and high confidence in their systems. Engineering teams' informal failure banks https://shahramanver.com/posts/engineering-teams-informal-failure-banks/ Thu, 10 Apr 2025 00:00:00 +0000 https://shahramanver.com/posts/engineering-teams-informal-failure-banks/ Every engineering team has an informal “failure bank” distributed across different engineers’ memories. Engineer A knows all the edge cases of different configs. Engineer B has fought many battles with Kafka rebalancing. Engineer C knows all the quirks with Kubernetes autoscaling. This uneven distribution of troubleshooting effectiveness is both a strength and a vulnerability. Last week an engineer told me how he saved his team from a multi-hour outage. Kubernetes pods were stuck in ‘Pending’ with no error messages. Agent prompts evolved from yelling to specs https://shahramanver.com/posts/agent-prompts-evolved-from-yelling-to-specs/ Sat, 05 Apr 2025 00:00:00 +0000 https://shahramanver.com/posts/agent-prompts-evolved-from-yelling-to-specs/ I was looking at our agent prompts the other day and thought about how quickly the way we build the agent has changed. In 2023, our prompts looked like emergency broadcasts: “ALWAYS LOOK AT RECENT DEPLOYMENTS FIRST!!!” and “JUST GIVE ME JSON, ONLY THE FACTS!!!” We were yelling at the models, pleading for them to just follow basic instructions. Fast forward to today, and the prompts read more like technical specs with nuanced instructions. When your agent evals get too sentient https://shahramanver.com/posts/when-agent-evals-get-too-sentient/ Thu, 20 Mar 2025 00:00:00 +0000 https://shahramanver.com/posts/when-agent-evals-get-too-sentient/ Our agent evals got a little too sentient today. The agent detected it was in a simulated Kubernetes environment (Kwok) and refused to investigate further. Interesting… do you Lie to the agent and insist it’s a real env and keep going Pat it on the back and call it a successful eval? Earlier today, it complained that, and this is not a joke, that the system it is investigating isn’t its responsibility and we should escalate to another team. Small anomalies compound into catastrophic outages https://shahramanver.com/posts/small-anomalies-compound-into-catastrophic-outages/ Sat, 15 Mar 2025 00:00:00 +0000 https://shahramanver.com/posts/small-anomalies-compound-into-catastrophic-outages/ Most catastrophic outages don’t start with dramatic failures - they begin with small anomalies that compound over time. A minor latency increase, an occasional timeout, a query that’s slightly slower than usual. None trigger immediate action, but left unaddressed, they become the foundation for systemic failures. These “paper cuts” are a bandwidth problem, not a detection problem. Engineers already know about these issues - they’re buried in dashboards, buried in logs, buried in the team’s collective memory. Cross-team debugging friction is the real killer https://shahramanver.com/posts/cross-team-debugging-friction/ Wed, 05 Mar 2025 00:00:00 +0000 https://shahramanver.com/posts/cross-team-debugging-friction/ The real productivity killer in production isn’t technical complexity - it’s the organizational friction of debugging issues across team boundaries. Last week, a senior engineer at big-tech-company told me something that hit home: “When we spot an issue, we decide whether to band aid it now or spend a quarter chasing the real fix.” This isn’t a technical limitation. It’s what happens when the coordination cost exceeds fix cost. Here’s a familiar pattern: An alert fires. I quit my job to tackle complexity in software systems https://shahramanver.com/posts/quitting-my-job/ Sat, 28 Oct 2023 00:00:00 +0000 https://shahramanver.com/posts/quitting-my-job/ A few months ago I was on a call which led to quitting my job. I wasn’t angry, the call went great. I actually loved my job, had great colleagues and was having a ball solving complex problems. Instead, the call made me obsess about the new world that was coming and how we all needed to prepare for it. If engineering systems is your jam, read on. It was a regular review call with engineering leadership to review production issues in the past month.