How I discovered that the hardest part of building an AI chatbot isn't the AI-it's turning 900+ messy Instagram videos into clean, structured data that the AI can actually use.
We generate more content today than at any point in human history. But generating content is only half the battle. Finding it when you actually need it is a completely different problem.
This became painfully obvious to me every time I tried to figure out what to eat for dinner in my hometown of Samarinda.
My brother is a food vlogger. Over the years, he has reviewed hundreds of local restaurants and shared them on Instagram. His videos contain incredible details: price ranges, atmosphere, specific dish recommendations, operational hours, and honest opinions about what's worth ordering.
But Instagram is built for a timeline, not an archive.
If I suddenly craved a specific type of soto late at night, finding his recommendation meant manually scrolling through months of old posts, watching videos one by one, and hoping I stumbled across the right one. And it wasn't just me-his followers regularly sent DMs asking "Where was that bakso place you reviewed last month?"
The information existed. It was detailed, valuable, and created by someone who genuinely cared about food. But it was trapped inside a format that was impossible to search.
I decided to fix this.
The obvious approach would be to build a chatbot that answers food questions using an LLM. But I quickly realized that a pure LLM chatbot would hallucinate restaurant names, invent prices, and recommend places that don't exist.
If someone asked "Recommend 3 affordable bakso places open right now," a pure GPT-4 response would sound confident but be completely fabricated. The model has no ground truth about restaurants in Samarinda.
I needed a Retrieval-Augmented Generation (RAG) approach: give the LLM real data to work with, so its answers are grounded in facts, not imagination.
But that meant I first needed the data. And getting that data was the hardest part of the entire project.
Most people think building an AI application means writing prompts and calling APIs. For this project, 80% of the engineering effort went into the data pipeline-a 4-step ETL process that turned 900+ messy Instagram posts into a clean, structured knowledge base.
The first challenge was data acquisition. Processing hundreds of videos manually is impossible, and building reliable extraction pipelines for social media requires navigating complex platform restrictions.
I engineered a systematic collection process to carefully gather over 900 of his Instagram videos over time. This approach allowed me to build a comprehensive raw dataset without compromising data integrity.
Once collected, each video was sent to Azure Speech API with Indonesian language support for transcription. Out of the 900+ videos, around 700 were successfully transcribed. The remaining videos were music-only clips, ambient noise, or audio too noisy for reliable transcription.
Raw transcriptions from speech-to-text models are messy. They're full of:
I used GPT-4o-mini to clean each transcript-fixing typos, restructuring broken sentences, and removing filler content while preserving the original meaning. This step was critical because bad transcriptions would poison every downstream process.
The cleaned transcripts then needed to be parsed into structured data. I used GPT-4o-mini to extract:
The standardization of food categories was surprisingly difficult. My brother might say "mi bakso" in one video and "bakso campur" in another-both referring to similar foods but using different terms. I built a taxonomy of 227 standardized categories so the search system could understand that queries for "bakso" should also match "mi bakso", "bakso urat", and "bakso beranak."
Extracted locations were enriched with Google Places API to add precise Google Maps links, coordinates, and verified addresses. This was the final piece-transforming text-based location descriptions into one-tap navigation links.
I built a run_pipeline.py orchestrator that could run all four steps sequentially, or any individual step independently:
python run_pipeline.py # Run all steps python run_pipeline.py --step 3 # Run only extraction python run_pipeline.py --skip 1,2 # Skip transcription and cleaning
Each step checked for required credentials (Azure Speech Key, OpenAI API Key, Google Maps API Key) before executing, and the pipeline could be resumed from any failure point.
With clean, structured data in hand, I built the retrieval-augmented generation service. But I quickly learned that naive RAG-just embedding documents and doing a single vector search-produces mediocre results. Real-world RAG requires multiple layers of intelligence.
The first version of my retrieval system used a simple approach: embed the user's query, search Qdrant for the top 10 similar documents, and pass them to the LLM. It worked... sometimes.
The problem was category specificity. When someone asked for "bakso places", a pure semantic search would return bakso restaurants, but also mie ayam places, soto places, and random food stalls that happened to mention bakso once. The results were relevant-ish but not precise.
I redesigned the retrieval into a multi-stage fallback system:
Category pre-filter: If the query mentions a known food category (mapped from 25+ keyword patterns), the Qdrant search is pre-filtered by kategori_makanan metadata. This dramatically improves precision for specific food queries.
Score threshold: Results below a minimum relevance score of 0.25 are filtered out. This threshold was calibrated by analyzing real query-result score distributions-bakso queries typically scored 0.55-0.62, while irrelevant results scored 0.05-0.15.
Post-filter fallback: If pre-filtering returns fewer than 3 results (because the category is sparse), the system runs an unfiltered search with 2x candidates and applies category matching as a post-filter. This catches edge cases where the Qdrant metadata doesn't exactly match the filter condition.
Scoreless fallback: If absolutely no results pass the score threshold, the system takes the top 10 results without score filtering-ensuring the user always gets something.
This cascading approach guarantees a 100% response rate while maintaining relevance.
Not all Instagram videos are restaurant reviews. Some are music clips, religious greetings, random vlogs, or audio recordings with no food content. Despite the ETL pipeline, some of these slipped through into the vector database.
I built a _is_gibberish_ringkasan() heuristic that filters out non-food content at retrieval time:
is_culinary_content flag set during the ETL processThis was a lesson in data quality: no matter how good your pipeline is, some garbage will always leak through. You need defense-in-depth at every layer.
A restaurant recommendation at 7 AM is completely different from one at 10 PM. The system automatically detects time context from the user's query and the current time:
Even more interesting: the system supports future time references. When someone says "tempat sarapan besok pagi" (breakfast places tomorrow morning), the parse_future_time() function extracts the target datetime and evaluates restaurant operational hours against that future time-not the current time.
Restaurants that are open at the target time are sorted to the top of the results.
Without streaming, the chatbot would go silent for 3-5 seconds while the LLM generated a complete response. That silence feels like the system is broken.
I implemented Server-Sent Events (SSE) streaming through FastAPI. The generate_response_stream() method yields tokens as they're generated by the LLM:
RestaurantCard objects.Each restaurant card includes actionable data: the original Instagram link (for watching the actual review), a Google Maps link (for one-tap navigation), menu items, prices, facilities, and real-time open/closed status.
One subtle but critical bug I encountered: the LLM might mention restaurants in its text response that didn't appear in the restaurant cards, or vice versa.
The root cause was that the LLM received one pool of candidates (from the system prompt context), while the cards were generated from a differently filtered pool.
The fix was simple but non-obvious: use the EXACT same candidate pool for both the LLM context and the card generation. I take annotated[:requested_count] once, build the LLM prompt from this pool, and generate cards from this exact same list. The code comments explicitly mark this as CRITICAL FIX to prevent future regression.
A single-turn chatbot is easy. Multi-turn conversations where context carries over between questions is hard.
When a user says "yang lebih murah" (something cheaper) after getting bakso recommendations, the RAG system needs to understand that "cheaper" refers to bakso places specifically.
I implemented query compression: the _compress_query_with_history() method sends the last 4 conversation exchanges to GPT-4o-mini and asks it to produce a standalone search query. The compressed query is then used for vector search instead of the raw user input.
Example:
This ensures the vector search always retrieves contextually relevant results, even when the user's input is ambiguous without conversation history.
I thought innovation meant creating something entirely new-new models, new algorithms, new data. I thought the AI model was the most important part of an AI application. I thought data engineering was a boring prerequisite.
This project completely changed my perspective.
Innovation through accessibility. I didn't create any new information. Every restaurant review already existed on Instagram. The innovation was making that information findable-transforming an unsearchable timeline into a conversational search engine. Sometimes the most valuable thing you can do is take existing, trapped information and simply make it easier to access.
The data pipeline is the product. The chatbot UI took days. The RAG service took a week. The data pipeline-downloading videos, transcribing audio, cleaning transcripts, extracting entities, standardizing categories, enriching locations-took weeks. In AI systems, the model is rarely the bottleneck. Data quality is.
Defense-in-depth for data quality. No matter how good your ETL pipeline is, garbage leaks through. You need multiple filtering layers: at ingestion time, at retrieval time, and at display time. The gibberish detector, the relevance score threshold, the CDN URL sanitizer, the deduplication logic-each one catches things the others miss.
Real-world search is not one-shot. A single vector search query is not enough for production use. Multi-stage retrieval with pre-filtering, post-filtering, and fallback strategies is essential for maintaining both precision and coverage.
There is an enormous amount of value hidden in plain sight-trapped in social media posts, buried in old videos, lost in unsearchable formats. Sometimes the best AI application isn't one that generates new content. It's one that unlocks the content that already exists.
Live Demo: food-recomendation-chatbot.vercel.app
For other projects, see Chatbot KKP PI, My-Jarvis-Gua, and VoiceInvoice.
Fullstack & AI Engineer passionate about building intelligent systems. Sharing insights on web development, AI, and software engineering.
Learn More →