How a childhood fascination with Iron Man led me to build a production-grade personal assistant, and why the 'God Prompt' approach failed spectacularly before function calling saved the project.
If I'm being completely honest, one of the main reasons I chose to study Computer Science was because of Tony Stark from Iron Man.
Like many tech enthusiasts, I was fascinated by the idea of JARVIS-an intelligent, omnipresent assistant capable of understanding deep context, managing complex information, and actively helping with everyday decisions. For years, it was just a cool sci-fi concept in my head.
But in 2026, I found myself with several months of free time. Instead of just taking a break, I decided to treat this time as a personal incubator. I wanted to challenge myself with an ambitious long-term goal: I was going to try and build my own JARVIS.
This is the story of how My-Jarvis-Gua started, how my first approach failed spectacularly, and what I learned about AI system design along the way.
To make this project genuinely meaningful, I didn't want to just build a cool tech demo. I needed it to solve actual problems I faced every day.
When I looked at how I managed my life, I realized it was completely fragmented:
Every day, I was manually entering data across a dozen different interfaces, switching between apps, and still losing information in the gaps between them. No single system gave me a complete picture of my life.
I wanted to explore a simple question:
"What if I could just talk to one system and have it handle everything?"
"
Instead of clicking through menus and filling out forms, what if I could just send a Telegram message saying "beli nasi goreng 15k" and have it automatically recorded, categorized, and reflected in a dashboard? What if I could send a voice note about my lunch and have the AI figure out the rest?
When I started coding the backend in Python, I took the most obvious route. I wrote a massive system prompt that tried to handle everything at once.
I told the AI:
"You are my personal assistant. You can record expenses, track health, manage vehicle maintenance, and organize tasks. Parse the user's message and respond accordingly."
"
It worked... terribly.
As I threw different types of information at it-from structured financial transactions to vague daily activity logs-the system became confused. The responses were unreliable. Sometimes it would "record" an expense by just saying "I've noted that down" without actually touching the database. Other times, it would hallucinate parameters that didn't exist.
The worst part was prompt fragility. Every time I tweaked the prompt to improve one behavior, something else would break. If I made it better at categorizing food expenses, it would suddenly forget how to handle income transactions.
This was my biggest technical bottleneck, but also my greatest learning moment.
I realized that the problem wasn't the AI model. The problem was that I was asking a single prompt to be an entire software system. An LLM is great at language understanding, but it's terrible at being a database, a state machine, and a business logic layer all at once.
The turning point came when I discovered OpenAI's function calling (tool use) capability.
Instead of asking the AI to "handle everything" in a single massive prompt, I could define explicit tools with strict JSON schemas. The AI's job was reduced to two things:
The actual execution-database writes, calculations, validations-happened in my Python code, where I had full control.
I defined 7 tools:
create_expense - for recording new transactionslist_expenses - with 9 filter parameters for flexible queryingupdate_expense - partial updates on existing recordsdelete_expense - soft deletionget_monthly_summary, get_yearly_summary, get_all_time_summary - financial analyticsEach tool had a strict JSON schema with "strict": true, meaning the AI couldn't hallucinate parameters. If the schema said amount must be a number, the AI had to send a number. Period.
This single architectural shift changed everything.
The system became dramatically more reliable. The AI stopped hallucinating database operations. And because each tool was just a Python function, I could test, debug, and iterate on them independently.
After text chat was working, I wanted to push further. What if the system could understand voice messages too?
I integrated OpenAI's Whisper model for speech-to-text. The pipeline works like this:
.ogg audio file.language="id" for optimized Indonesian recognition.One critical design decision was the transcribe_safe() method. Whisper can fail-bad audio quality, background noise, empty recordings. Instead of letting the entire pipeline crash, this method returns a tuple of (text, error). The handler checks for errors and gracefully informs the user instead of showing a cryptic error message.
The result was transformative. Recording a 3-second voice note saying "beli kopi 25 ribu" is far more natural than typing it out, especially on mobile. It made the bot feel genuinely useful in daily life rather than just a technical experiment.
One problem that took me an unexpectedly long time to solve was authentication.
The web dashboard uses Supabase Auth with JWT tokens-standard stuff. But the Telegram bot operates in a completely different world. Telegram sends webhooks with a chat_id, which is just a number. There's no JWT, no OAuth, no session cookie.
I needed a way to securely bridge these two authentication systems.
My solution was a one-time code verification flow:
MYJARVIS-AB12CD with a 15-minute expiration./connect MYJARVIS-AB12CD.telegram_chat_id to the user's profile.This seems simple in hindsight, but getting the security model right was tricky.
Because the Telegram bot has no JWT context, it must use Supabase's admin client (service role), which bypasses Row Level Security entirely. This means every single database query in the bot code must explicitly filter by user_id. A missing WHERE user_id = ? clause would expose one user's data to another.
To prevent this, I added an assert user_id check as the very first line of the AI service factory function. If somehow a handler forgot to pass the user ID, the system would crash immediately instead of silently leaking data.
Not everything needs AI. For users who prefer structured input, I built a guided 5-step expense flow using python-telegram-bot's ConversationHandler.
When a user types /addexpense, the bot walks them through:
/skip.YYYY-MM-DD format, can /skip for today.At any point, the user can type /cancel to abort.
This taught me an important lesson about handler registration order. In python-telegram-bot, handlers are evaluated in the order they're registered. If I registered the general text handler (the AI chat fallback) before the ConversationHandler, it would intercept messages that were meant for the active conversation state.
The fix was straightforward but non-obvious: always register ConversationHandler instances first, then command handlers, then the fallback text handler last. I spent hours debugging this before finding the root cause in the library documentation.
Beyond function calling, I implemented a semantic search layer using pgvector.
Every time a new expense is created or updated, the system generates a 1536-dimensional embedding vector from the transaction's metadata (type, amount, category, description, subcategory, payment method) using OpenAI's text-embedding-3-small model.
These embeddings are stored directly in the PostgreSQL expenses table alongside the structured data. A custom match_expense() SQL function performs cosine similarity search to find semantically related transactions.
This means when a user asks "how much did I spend on food last week?", the system doesn't just do a keyword match on "food"-it understands that "nasi goreng", "makan siang", and "jajan" are all semantically related to food, even if the exact word was never used.
A key design decision was making embedding generation error-tolerant. The generate_for_expenses_safe() method wraps the entire embedding pipeline in a try/except. If the OpenAI API is down or the embedding fails for any reason, the expense is still saved-the embedding column just stays NULL. The main user flow is never blocked by a failed embedding.
After months of iteration, the backend settled into a clean, layered architecture:
api/): FastAPI routes handling HTTP requests, authentication, and input validation.bot/): Telegram handlers with their own factory patterns and state management.services/): Business logic that's shared between the API and the bot-the same ExpenseService serves both interfaces without code duplication.repositories/): Pure database operations using the Supabase client.infrastructure/): Singleton clients for OpenAI and Supabase.This separation paid enormous dividends. When I wanted to add voice support, I didn't have to touch the expense creation logic at all. I just wrote a new handler that transcribed audio and fed the text into the existing AIService.chat() method. Everything downstream-function calling, tool dispatch, database operations-worked automatically.
AI Is the Easy Part - Connecting to GPT-4o and getting responses took an afternoon. Building reliable authentication, database schemas with soft deletes and vector search, secure cross-platform account linking, error-tolerant embedding pipelines, and a proper handler registration order-that took months. The "unglamorous infrastructure" is what makes AI actually useful in production.
Function Calling > Prompt Engineering - My "God Prompt" approach was fragile and unreliable. Function calling with strict schemas made the system predictable and testable. The lesson: don't ask the AI to be your software. Let it use your software.
Voice Is Underrated - Adding speech-to-text was technically simple (Whisper API is excellent), but it transformed the user experience more than any other feature. People use their phones while walking, cooking, driving. Text input is a barrier. Voice removes it.
Clean Architecture Enables Speed - When my codebase was a mess, adding a new feature meant rewriting half the system. After restructuring into clean layers (API → Service → Repository), I could add the entire voice pipeline in a single day because the business logic was already encapsulated.
Security Is Not Optional - Using an admin client that bypasses RLS was necessary for the Telegram bot, but it meant I had to be paranoid about user_id filtering. One missing WHERE clause could expose private financial data. I learned to treat security not as a feature to add later, but as an architectural constraint to design around from day one.
My-Jarvis-Gua is still actively developed. The finance module is complete, but the original vision is much bigger.
Future modules I'm planning:
Each new module will plug into the existing function calling framework as additional tools, which is exactly why I designed the architecture to be modular in the first place.
This project started as a childhood dream inspired by a movie. It became my most ambitious engineering challenge and my most effective learning platform.
The biggest lesson I learned isn't about pgvector, function calling, or Whisper transcription. It's that ambitious, intimidating goals become achievable when you break them down into meaningful, practical problems.
I started out wanting to build a sci-fi supercomputer. Along the way, I learned how to design clean architectures, build secure cross-platform authentication, orchestrate AI tool use, handle voice input gracefully, and create software that genuinely improves my everyday life.
It's not quite Tony Stark's JARVIS yet. But it's mine, it understands my voice, it manages my money, and it's getting smarter every day.
Fullstack & AI Engineer passionate about building intelligent systems. Sharing insights on web development, AI, and software engineering.
Learn More →