I built this RAG Chat Assistant as a personal SaaS prototype because I kept seeing clients ask for 'chat with your documents' features and I wanted to deeply understand the architecture before implementing it professionally.
The pipeline works in two phases: indexing and querying. During indexing, uploaded documents (PDF or plain text) are split into overlapping chunks of ~512 tokens with a 50-token overlap to preserve context across boundaries. Each chunk is run through Hugging Face's all-MiniLM-L6-v2 sentence-transformer model to generate a 384-dimensional embedding vector. These vectors are stored in Supabase using the pgvector PostgreSQL extension alongside the original chunk text and metadata (document ID, page number, chunk index).
During querying, the user's question is also embedded using the same model, then a cosine similarity search retrieves the top-5 most relevant chunks from the vector store. These chunks are injected into a carefully engineered prompt that instructs GPT-4 to answer only based on the provided context and to cite which document sections it's drawing from. Responses stream back to the UI using the OpenAI streaming API.
The main technical challenges I solved: (1) choosing the right chunk size — too small loses context, too large wastes the context window; (2) handling PDFs with tables and images gracefully by falling back to OCR; (3) building the streaming chat UI that correctly handles partial JSON from the streaming endpoint.
Responsibilities
Designed and implemented the full RAG pipeline: chunking → embedding → vector storage → retrieval → generation
Integrated Hugging Face sentence-transformer models for document embedding
Set up Supabase with pgvector extension for semantic similarity search
Built streaming chat UI with Next.js and OpenAI streaming API
Engineered prompts for grounded, citation-aware responses
Solved PDF parsing edge cases including tables and image-heavy documents.