I thought I was building the simplest agent possible. Instead I found the exact gap between an AI demo and an AI product: state, memory, grounding, and intent. This is the build log.

A calm calendar event sits above the waterline while a dense network of edge cases hides beneath it.

I thought this would take one evening. A Telegram bot, a Google Calendar connection, and an LLM in the middle to read what I typed. I send "lunch with Renga tomorrow 1 to 2," it pulls out a title, a date and a time, creates the event, done. The hello-world of agents.

Honestly, I was almost smiling at how simple the first version looked. Natural language in, calendar event out. On the happy path it worked, and I felt great for about a day. Then I started using it the way I actually use a calendar, and that is when the evening disappeared.

Then real humans started typing

The first failure was 11:30am-1:30pm. The bot asked me whether I meant am or pm. The am and pm were right there, glued to the numbers. The model had read it fine. The problem was downstream. I had the model emit a loose time string, and a hand-rolled regex tried to re-parse that string. The glued meridiem did not match the pattern I had imagined, so my own code flagged it as ambiguous and threw the question back at me.

Then 1 to 2, which is how I describe lunch most of the time. My parser only knew how to find a single start time. It had no concept of a start and an end as one expression, so it either guessed a duration or asked me for more. And weirdly, each broken message looked tiny at first. A format issue. A range issue. But every one was the same shape underneath: the model understood the human, and my deterministic layer around it did not.

The ones that made me close the laptop

The edit flow is the one that made me close the laptop for the night. The bot would render a draft event on screen. I would look at it and say, "change the title to Lunch with Renga." It would throw the whole draft away, run a fresh extraction, find no time in my three-word message and complain that I had not given it one.

The time was sitting in the draft it had shown me one second earlier. But I never fed that draft back into the next turn. Every message hit the model with an empty context window. The agent had no idea a proposal was already on screen.

Architecture amnesia.

That is the bug that reframed the whole project for me. The agent was not in a conversation. It was answering isolated requests while I was sitting in a stateful one. The draft on screen was shared state, and I was the only one holding it.

The duration bug was the same disease in a different organ. A 2-hour lunch got snapped to 1 hour, not because the model misread it, but because I had an allowed-durations whitelist and 120 minutes was not in it. The user was clear, the model was right, Google Calendar could store it, and my validation layer silently overruled a correct answer.

The pain kept changing shape

Multi-day was next. "9am Saturday to 2pm Sunday" should resolve to a single event spanning about twenty-nine hours. My code collapsed it into one short same-day block, because I had only ever modeled a start plus a duration, never an explicit end that crosses midnight.

Timezones were the dangerous one, because nothing looked wrong. The model resolved a time, the API returned a success, the confirmation looked clean. Then I opened Google Calendar and the event sat an hour off. That is a naive-versus-aware datetime bug, the kind that passes every surface check and quietly destroys trust. A calendar assistant cannot be almost right on time. Almost right is wrong.

"Block my calendar" taught me about intent. If I forward an invite and say "block my calendar," I mean create a busy block for me, with no guests. My pipeline ran entity extraction on the forwarded invite, pulled the named people and added them as attendees. Locally correct extraction, globally the exact opposite of what I wanted. The named people were the subject of the event, not invitees, and nothing in my system encoded that distinction.

Conflicts were the last one. When a new event overlapped something on my calendar, the bot treated the overlap as a hard failure and stopped. But a conflict is not an error. I wanted a soft warning with the overlap surfaced, then my decision. Sometimes I am double-booking on purpose. The system's job is to tell me, not to refuse.

I blamed the model. The model was fine.

For days my instinct was to reach up the stack. Maybe a bigger model. Maybe a different agent framework. Maybe re-host the whole thing and clean up the orchestration. Every one of those was the wrong layer. The model understood 11:30am-1:30pm. It understood the range. It understood "change the title." It understood "block my calendar." The intelligence was never the bottleneck.

I was the one taking the model's correct output and torturing it through a regex until it gave up. I had inverted the architecture. I made a probabilistic model produce a loose string, then handed authority to brittle deterministic code that rejected anything outside the shapes I had pre-imagined. The model was a component sitting behind my parser, when it should have been the thing doing the understanding.

That is a useful slap in the face. A smarter model would not have fixed the missing draft state. Re-hosting would not have fixed the timezone handling. A new framework would not have taught the system that "block my calendar" means no guests. None of my failures lived in the model. They lived in the plumbing.

Three stages: Understand, then Parse and Memory which is the real problem, then Resolve and Commit. Each calendar scenario maps to its failure mode and its fix.

Where the work actually lives. Understanding is the easy 10 percent. Parsing, memory and intent are the 90 percent.

What I actually changed

I flipped the architecture. The model does the understanding end to end and emits structured, typed output. The deterministic code became the guardrail, not the boss. Four changes carried most of the weight.

Grounding. I inject a NOW anchor into every prompt: the current timestamp with timezone. The model resolves relative language ("tomorrow," "1 to 2," "next Saturday") into explicit timezone-aware start and end values in ISO 8601, not a free-text time string I re-parse later.

State. I pass the current proposal back into the next turn as an explicit context block. So "change the title" is an edit against the draft on screen, not a fresh extraction. The agent finally has turn-to-turn memory, because I gave it the state it was missing.

Constraints. I widened the duration whitelist and the event model so 120-minute lunches, multi-day spans and explicit cross-midnight end times are first-class. Validation now rejects genuinely impossible input. It does not overrule valid input.

Intent. Self-block is its own intent. Named people in a forwarded invite become the event's subject, attendees stay empty, and the model is told the difference explicitly. Conflicts resolve to warnings that surface the overlap and hand me the decision.

The old regex still exists, as a fallback for when the model returns something unusable. It is a safety net, not the authority. That single inversion, model as resolver with code as guardrail instead of code as parser with model as helper, is what fixed most of the list at once.

Now it handles the whole set. Glued am/pm. 1 to 2. 2-hour lunches. 9am Saturday to 2pm Sunday. Timezones. "Block my calendar." Conflict warnings. From the outside it looks simple again. I type into Telegram for five seconds and the event appears. Nobody sees the dozen failure modes underneath. Nobody sees me fighting my own code. It earned its simplicity.

Why a calendar bot keeps me up at night

This is the simplest agent there is. A calendar bot. One tool. One user. A tiny schema. If even this took round after round of tuning every scenario, then "just plug in an LLM and ship an agent" is a fantasy. The model was maybe 10 percent of the work. The other 90 percent was the unglamorous layer: state management, context passing, timezone grounding, intent disambiguation, constraint design, conflict policy and figuring out what the human actually meant.

"Block my calendar" is three words. A human resolves the scope instantly. For an agent, those three words only work if the system around the model knows who "my" is, what "block" does to availability, and that the people in the forwarded invite are the subject, not the guest list. That knowledge is not in the weights. It is in the architecture you build around them.

That is the gap I keep sitting with. Not the gap between people who can code and people who cannot. The deeper gap between a demo that fires once and an agent that holds up on the hundredth weird message. Most builders hit this wall on their second feature. The model understands the request. Turning that understanding into a reliable, stateful, intent-aware system is where the real work lives, and it is brutal to rebuild for every new agent.

This is exactly why we are building VibeModel as the Pattern Intelligence Layer. Nobody should have to rediscover state passing, grounding and intent disambiguation from scratch every time they wire a model into a real workflow. The hard parts should be handled underneath, so the model can do what it is good at and you do not spend your evenings learning that your parser is fighting your model.

I still like this calendar bot. Maybe more now than when the first version worked. And the small Telegram bot stayed with me, not because it is impressive, but because it is the cleanest proof I have of where agents actually break. The magic is not making a demo look good once. It is making the messy human message work again tomorrow.

My Calendar Bot Was Supposed To Take One Evening

Then real humans started typing

The ones that made me close the laptop

The pain kept changing shape

I blamed the model. The model was fine.

What I actually changed

Why a calendar bot keeps me up at night

AI Drift Detection: How to Catch Behavioral Drift Before Users Do

Edge Case Discovery: Finding the Production Scenarios Your Tests Miss

VibeModel vs DataRobot, LangChain, and Arize: Production Reliability Compared