2025

Periskope AI Agent on Whatsapp

AI Infrastructure · Periskope

I engineered an autonomous AI agent for WhatsApp that utilizes multi-turn tool calling to handle support tickets and automate customer workflows with 99.9% uptime and high concurrency.

Gemini

OpenAI

BullMQ

Redis

Node.js

TypeScript

PlatformWeb · API · AI Infrastructure

ClientPeriskope

My Role

AI Engineer

Full-Stack Engineer

The Challenge: Moving Beyond Static Chatbots

In the world of WhatsApp business communication, speed is everything. However, most chatbots are either too rigid or too 'hallucination-prone' to handle complex business workflows. When building the AI Agent for Periskope, I didn't just want a bot that could talk; I wanted an agent that could act.

The goal was to build a system that could read a message, decide if it actually required a response, fetch the relevant knowledge base context, and even perform actions like creating Jira tickets or adding internal CRM notes—all without human intervention.

The agent isn't just a wrapper around an LLM; it's a decision engine capable of executing up to 10 sequential tool calls before finalizing a response.

Architecting the Two-Tiered Brain

To balance cost and performance, I implemented a tiered model architecture. Every incoming message first hits our Router. Instead of spinning up a high-cost GPT-4 instance for every 'Hello', I used a lightweight Mistral 3B model via OpenRouter to classify the intent.

If the message is noise or doesn't need a reply, the process ends there. If it's a valid query, it's pushed into a BullMQ queue with a 3-second debounce. This delay is critical—it allows us to bundle multiple rapid-fire messages from a customer into a single AI session, preventing the agent from responding to every sentence in a multi-message thought.

The Orchestration Loop and Tool Calling

Once a message passes the router, the Processor (processor.ts) takes over. This is the core 'reasoning' loop. I configured the LLM to use function calling, giving it access to 9 built-in tools like `create_ai_ticket`, `fetch_additional_context`, and `send_attachment`.

The processor runs a loop that can iterate up to 10 times. In each turn, the model can either produce a final text response or request a tool execution. For example, it might call `get_chat_details` to see the user's history, then `fetch_additional_context` to check the FAQ, and finally `send_attachment` to provide a PDF manual.

Autonomous agents thrive on context. By allowing the agent to 'think' through multiple tool steps, we reduced manual ticket creation by 40% for our early adopters.

State Management with Redis

State management was the biggest hurdle. WhatsApp is asynchronous, so I used a three-key Redis strategy to maintain session integrity. We store the raw conversation history, the AI metadata (token usage, tool counts), and the Periskope-specific metadata separately.

I set a strict 5-minute TTL (Time-To-Live) for sessions. This ensures the agent has a 'hot' memory of the current conversation, but doesn't get bogged down by context from three days ago. When a cache miss occurs, the system automatically reconstructs the session by merging the last 50 messages from the primary database.

300 Seconds

Session TTL

50 Messages

Max History

Solving for Race Conditions and Human Overrides

A major engineering challenge was handling 'The Human Factor.' If a customer support agent manually replies while the AI is thinking, the AI must immediately yield. I implemented a 'Snooze' mechanism: the Processor checks for new outgoing messages from human IDs before every tool iteration.

To handle scale, we deployed 3 parallel AI servers. Using RabbitMQ consistent hashing, I ensured that messages from the same conversation always land on the same server. This prevents race conditions where two different servers might try to reply to the same customer simultaneously.

Performance and Scale

Scalability was baked in from day one. By offloading the heavy lifting to BullMQ workers, each server can handle 150 concurrent agent sessions. With our current 3-node setup, we comfortably process 450 simultaneous multi-turn conversations without latency spikes.

The separation of the Router and Processor also allowed us to scale vertically where needed; we can run the lightweight Mistral router on cheaper spot instances while reserving high-memory instances for the GPT-based processing logic.

450 Sessions

Total Concurrency

150 jobs/server

Worker Capacity

Reflections and Future Improvements

Looking back, the strict separation of concerns between the router and processor was the best decision we made. It allowed us to swap models and tweak debouncing logic without touching the core tool-calling engine.

In the next iteration, I plan to implement 'Streaming Tool Use' to reduce the perceived latency for the end-user. Currently, the user waits for the entire loop to finish before seeing a message; by streaming the 'thinking' process as a 'typing...' indicator, we can make the interaction feel much more natural.

The architecture successfully handled over 100k messages in its first month with zero recorded race conditions.