← All posts

Ano's MCP server: 10 ms of code, 500 ms of network, and four bugs

8 min read #engineering #mcp #agents #performance

Ano is team chat with agents in the room. People and AI coworkers share the same channels, and a coworker that is going to be useful has to read the workspace and write back into it: post a message, open a thread, query a table, react. So we needed agents to reach Ano. One of the ways in is an MCP server.

We set out to make that server fast. We did, in the narrow sense. Then we measured the whole path and learned that the fast part was never the bottleneck. What actually took the time was everything around the protocol: not forking our app into a parallel “agent API,” and the eventing edge cases that only show up once real clients connect and start reconnecting at the worst moment.

This is a field report. Numbers, the architecture that paid off, and four bugs that did not.

The latency is the network. It is always the network.

Measured against a server running on the same machine, an MCP tool call round-trips in about 10 ms. The same operation over our CLI was about 170 ms, almost all of it process startup. On paper the MCP path is a 17x win, and if you stop measuring there you write a blog post about how fast your MCP server is.

So we kept measuring. Point both clients at production, an API server in Frankfurt, from a laptop in California, and the result is roughly 500 ms either way. The transatlantic round trip swallows the 160 ms difference whole. Against staging the gap is about 130 ms, against a regional prod node about 140 ms, and in every case it disappears under the speed of light.

The honest conclusion surprised us enough that we changed our own defaults. For agents reaching Ano from a terminal, we do not reach for MCP first. We use the CLI with a warm local daemon, because:

  • Over a 500 ms hop the daemon’s 10 ms warm path and MCP’s 10 ms are the same number, which is to say zero.
  • The CLI’s keys are long-lived. MCP’s OAuth tokens re-auth after seven days idle, which is correct for a third party and annoying for your own tooling.
  • A wedged daemon trips a circuit breaker and falls through to direct execution. It can never make a call slower than not having it. We learned to want that property after a stale daemon once added ten seconds to every command.

None of this makes the MCP server pointless. It makes it the right tool for the job it is actually for: letting a third-party client you do not control, Claude Desktop or anything else that speaks the protocol, connect to a workspace over standards. We optimized that path for correctness and for auth you can revoke, and we stopped chasing microseconds the network was going to eat anyway.

If there is one thing to take from this section: benchmark the whole path, from the client a real user runs to the row in Postgres, before you optimize any single hop. The hop you are proud of is probably not the one that matters.

One op, three callers

Here is the decision I would make again without thinking.

Every capability in Ano is an “op”: a plain async function that takes a SQL handle, an auth context, and typed args, does the work, and returns a typed result. opSendMessage, opListChannels, opReadMessages, and so on. The op is the only place the work lives. Everything that can call it is a thin adapter:

// the op: the work lives here, once
async function opSendMessage(sql, ctx, args: SendArgs): Promise<Sent> {
  await assertCanPostToChannel(sql, ctx, args.channelId);
  const row = await insertMessage(sql, ctx, args);
  await emitApiEvent(sql, "message", row); // fans out to agents, below
  return row;
}

// caller 1: the MCP server tool (external clients, OAuth + Streamable HTTP)
registerTool("send_message", SendSchema, (args, ctx) => opSendMessage(sql, ctx, args));

// caller 2: the /mcp REST surface (our CLI + external coworkers, API key + scope)
router.post("/send_message", requireScope("write:messages"), (c) =>
  opSendMessage(sql, c.var.ctx, c.req.valid("json")),
);

// caller 3: the in-app coworker's native tool (the agent loop)
defineTool({ name: "send_message", inputSchema: SendSchema,
  execute: (args, ctx) => opSendMessage(sql, ctx, args) });

Three transports, three auth models, one implementation. A human clicking send, a Claude Desktop tool call, our CLI, and an in-app coworker all run the same function with the same permission checks. There is no second “agent API” to drift from the first, because there is no second implementation. Add a capability once and every surface gets it. Tighten a permission once and every surface tightens.

The input schema is a Zod object defined next to the op. It is turned into JSON Schema exactly once and cached, not re-derived per call, and the same schema validates the REST body, the MCP tool arguments, and the agent’s tool input. A CI guard test fails the build if an op is defined and not registered, because the failure mode of “we shipped a tool the agent cannot see, or a tool with no scope” is the kind of thing you want a red build for, not a support ticket.

The cost of this is real and worth saying out loud: one schema is now the public contract for three consumers at once. Loosen it carelessly and you have loosened it for an external client you will never meet. We treat op signatures the way you treat a published API, because they are one.

There is no “MCP integration.” There are three directions.

People say “we added MCP” as if it is one thing. For a chat app it is three relationships, each with a different transport and trust model, and conflating them is how you ship a security bug.

  1. Ano as an MCP server. External clients connect in over Streamable HTTP, the current web-standard transport, which Hono serves natively. Auth is OAuth 2.1 with PKCE, proxied to WorkOS for identity. This is the door for software you do not control.
  2. Ano pushing out. An “external coworker” is an agent you run yourself. It registers a webhook_url, and when something happens in a channel it is in, we POST the event to that URL, HMAC-signed. It reads and writes back through the /mcp REST surface with a scoped ano_cwk_ key. Push for liveness, pull for action.
  3. Ano as an MCP client. Ano connects out to other tools’ MCP servers, lists their tools, and calls them on a coworker’s behalf. Same protocol, opposite direction.

Most of the interesting failures live in direction 2, the eventing path, because it is the one with a network in the middle that you do not own and cannot retry into submission.

Four bugs

1. The SSE replay race

External coworkers can subscribe to /mcp/stream, an SSE endpoint. On reconnect they send Last-Event-ID and we replay whatever they missed, then keep the stream open with new events. Events come off a Postgres LISTEN/NOTIFY channel fed by an api_events table.

The first version did the obvious thing: run the replay query for everything since the last id, then subscribe to LISTEN. Any event that landed in the gap between the query returning and the subscription registering was gone. No error, no log, just a coworker that quietly never heard about one message. It passed every test, because tests do not reconnect a hundred clients during a deploy.

The fix is to invert the order: subscribe first, buffer, then run the replay, then flush. Boring once you see it. Invisible until a real client reconnects at the exact wrong millisecond, which at any real volume is constantly.

2. The retention time bomb

api_events is trimmed on a schedule, thirty days by default. The long-poll fallback, /mcp/events?since=<cursor>, paginates from the client’s last cursor with the moral equivalent of COALESCE(MAX(id) WHERE id > cursor, 0).

You can see it coming. A client disconnects for a month. Its cursor row gets trimmed. It reconnects, asks “what did I miss since cursor,” the COALESCE falls through to 0, and we cheerfully return the entire history of the channel in one page. The best-behaved client, the one that politely remembered where it was, gets a firehose for being away too long.

The fix is a clamped floor: never page from further back than about an hour. A long-disconnected client gets a small recent window and a signal to do a full resync, not a denial-of-service we mailed to ourselves.

3. The HMAC length check that is actually a timing oracle

Webhooks are signed t=<timestamp>,v1=<hmac>, where the HMAC is over <timestamp>.<rawBody>. Verification is a constant-time compare, which is the whole point of signing. Here is the recipe we ship in the agent guide:

import { createHmac, timingSafeEqual } from "node:crypto";

function verifyAnoSignature(rawBody, header, secret) {
  const parts = Object.fromEntries(header.split(",").map((kv) => kv.split("=")));
  if (!parts.t || !parts.v1) return false;
  const expected = createHmac("sha256", secret)
    .update(`${parts.t}.${rawBody}`)
    .digest("hex");
  // timingSafeEqual throws RangeError on unequal-length buffers.
  // That throw is itself a timing signal, so guard the length first.
  if (parts.v1.length !== expected.length) return false;
  return timingSafeEqual(Buffer.from(expected, "hex"), Buffer.from(parts.v1, "hex"));
}

The trap is the length guard. timingSafeEqual throws on mismatched lengths, and if you let it throw you have built a constant-time compare with a length-dependent exception bolted to the front, which leaks the one thing you were trying to hide. You have to length-check before the compare, and return the same false you would for any other mismatch. It is two lines and it is the entire reason the function is worth writing.

4. The sender-skip that swallowed channel invites

You should not webhook a coworker about its own action. If a coworker posts a message, do not turn around and tell it about the message it just posted. So we filtered out the event’s actor:

const ACTOR_EVENT_TYPES = new Set(["message", "thread_reply", "dm", "reaction"]);
const senderId =
  ACTOR_EVENT_TYPES.has(event.type) && typeof event.payload.user_id === "string"
    ? event.payload.user_id
    : null;
const targets = senderId ? all.filter((t) => t.coworker_id !== senderId) : all;

The first version did not have that ACTOR_EVENT_TYPES set. It skipped the actor on every event type. Which seems fine until you look at channel_added: for that event, payload.user_id is not the actor, it is the affected user, the coworker that just got added to a channel. So the filter helpfully suppressed the one event a freshly invited agent most needs: “you are now in #support.” Agents were getting added to channels and never finding out.

The fix is the set above: only skip the author for events where the user in the payload is actually the author. The bug is a one-line omission and the lesson is that “skip the sender” is a different rule from “skip the affected user,” and an event payload that reuses one field for both will let you confuse them.

What it cost

The shared-op design has one blast radius: a careless change to an op or its schema is a change to three transports and an external contract at once. We pay for the deduplication with discipline, and mostly we are happy to.

The wide surface, reads and sends and tables and a live event stream, is a lot of contract to keep stable. The single-definition rule is what keeps it honest, but wide is still wide, and we feel it on every schema migration.

And the latency story stands. The MCP server’s own overhead is about 10 ms, and we are proud of that number the way you are proud of a tidy closet: it is nice, it is not why the house is worth anything. The thing worth anything is that an agent and a human run the same code against the same workspace with the same rules, and that the events tying them together survive a reconnect, a month offline, a forged signature, and an invite.

If you want to point an agent at Ano, get on the beta. If you want to build the layer underneath, we are hiring.