2025

RAG Pipeline for Periskope AI

AI Infrastructure · Periskope

I engineered a high-precision RAG system for Periskope, grounding WhatsApp AI agents in verified business data using Gemini embeddings, pgvector, and automated document ingestion pipelines.

RAG
Embeddings
Gemini
Vector Search
Knowledge Base
TypeScript
PostgreSQL
PlatformWeb · API · AI Infrastructure
ClientPeriskope
My Role
AI Engineer
Full-Stack Engineer
RAG Pipeline for Periskope AI

The Challenge: Grounding AI in Business Truth

In the context of WhatsApp business communication, 'hallucinations' aren't just technical glitches—they are customer service liabilities. When building the AI agent for Periskope, the goal was to ensure that every response was backed by verified facts, whether that was a specific pricing tier, a return policy, or a technical FAQ.

To solve this, I designed a Retrieval-Augmented Generation (RAG) system. Instead of relying on the LLM's internal weights, the system proactively searches a dedicated Knowledge Base to find relevant context before the agent even begins to draft a response. This creates a 'closed-book' environment where the AI is only as smart as the documents the business provides.

By grounding the model in verified Q&A pairs, we reduced customer-reported hallucination rates by over 70% during the initial beta phase.

Multi-Layered Embedding Pipeline

For the vector engine, I utilized the `gemini-embedding-001` model, which generates 768-dimensional vectors. However, raw text embedding is often insufficient for nuanced business queries. I implemented a preprocessing pipeline using tokenization and Porter Stemmer lemmatization to normalize inputs.

One key architectural decision was to store three distinct vectors for every FAQ entry: one for the question, one for the answer, and one for the combined text. This 'triple-index' approach ensures that a user's query—which might look like a question or a statement—finds a match regardless of its phrasing.

These vectors are stored in PostgreSQL using `pgvector`, allowing us to leverage existing relational data alongside high-dimensional similarity searches.

embedding-pipeline.ts
async function ingestFAQ(question: string, answer: string, orgId: string) {
  const combined = `${question} ${answer}`;
  const [qVect, aVect, cVect] = await Promise.all([
    embedder.get(question, 'QUESTION_ANSWERING'),
    embedder.get(answer, 'QUESTION_ANSWERING'),
    embedder.get(combined, 'QUESTION_ANSWERING')
  ]);

  return db.tbl_ai_contexts.insert({
    org_id: orgId,
    question_vector: qVect,
    answer_vector: aVect,
    combined_vector: cVect,
    content: { question, answer }
  });
}

High-Precision Retrieval and Query Expansion

Semantic search can sometimes be too 'fuzzy.' To combat this, I implemented query expansion. Before the vector search, the system identifies key terms and expands them with synonyms—mapping 'price' to 'cost', 'fee', or 'rate'.

The retrieval process uses cosine similarity with a strict 0.5 confidence threshold. If no vector matches are strong enough, the system doesn't guess; it falls back to a robust keyword-based full-text search. This hybrid approach ensures reliability even when the embedding model might struggle with niche industry jargon.

768
Vector Dimensions
0.5 Cosine
Similarity Threshold
3-20 matches
Top-K Results

Processing Unstructured Documents

Businesses often have existing documentation in PDF or Word formats. I built an ingestion engine that uses `pdf-parse` to extract raw text, but the real challenge was noise reduction. I wrote custom logic to strip page headers, footers, and recurring 'Page X of Y' markers that would otherwise pollute the vector space.

Documents are chunked into logical sections before embedding. Each chunk maintains metadata pointing back to its source bucket and path, allowing the AI to cite its sources with specific document references in the chat UI.

Raw PDF text is notoriously messy. Cleaning page noise isn't just about storage—it's about ensuring the vector reflects the actual content, not the document's formatting.

Managing Long-Form Session Context

WhatsApp conversations can span weeks. Storing every piece of retrieved context would quickly exceed LLM token limits and increase latency. I implemented a 'relevant_context' map within the session metadata.

This map is capped at 20 entries and is aggressively trimmed if the metadata size exceeds 40MB. To handle complex queries that might need context from earlier in the thread, I exposed a `fetch_additional_context` tool to the agent, allowing it to dynamically pull more information if it detects its current knowledge is insufficient.

agent-tools.ts
const contextTool = {
  name: 'fetch_additional_context',
  description: 'Search the knowledge base for specific business info',
  execute: async (query: string) => {
    const results = await fetchRelevantContext(query);
    return results.map(r => r.answer_text).join('\n---\n');
  }
};

The Self-Learning Feedback Loop

Knowledge bases shouldn't be static. I introduced a 'self-learned' category. When a human agent successfully resolves a query that the AI couldn't, that Q&A pair is captured and flagged for approval. Once a business admin reviews it, the pair is embedded and added to the knowledge base, allowing the AI to learn from human expertise in real-time.

Lessons Learned and Future Directions

The biggest takeaway from this build was that RAG is 20% model selection and 80% data engineering. If I were to rebuild this today, I would implement a cross-encoder re-ranking step. While cosine similarity is fast, a second pass with a re-ranking model could further refine the top 5 results to ensure the most relevant chunk is always at the top of the prompt.

I also plan to explore hybrid search (combining BM25 and vector search) directly within PostgreSQL to further improve the retrieval of specific product SKU numbers and codes where semantic meaning is less important than character matching.