June 7, 2026

14 min read

Muhammad Fauza

Building My Own JARVIS: Why the Hardest Part Was Everything Around the AI

How a childhood fascination with Iron Man led me to build a production-grade personal assistant, and why the 'God Prompt' approach failed spectacularly before function calling saved the project.

#AI#OpenAI#Function Calling#System Architecture#Python#FastAPI#Telegram Bot

If I'm being completely honest, one of the main reasons I chose to study Computer Science was because of Tony Stark from Iron Man.

Like many tech enthusiasts, I was fascinated by the idea of JARVIS-an intelligent, omnipresent assistant capable of understanding deep context, managing complex information, and actively helping with everyday decisions. For years, it was just a cool sci-fi concept in my head.

But in 2026, I found myself with several months of free time. Instead of just taking a break, I decided to treat this time as a personal incubator. I wanted to challenge myself with an ambitious long-term goal: I was going to try and build my own JARVIS.

This is the story of how My-Jarvis-Gua started, how my first approach failed spectacularly, and what I learned about AI system design along the way.

The Problem: A Fragmented Digital Life

To make this project genuinely meaningful, I didn't want to just build a cool tech demo. I needed it to solve actual problems I faced every day.

When I looked at how I managed my life, I realized it was completely fragmented:

▸I tracked expenses in one budgeting app.
▸I logged vehicle maintenance in a spreadsheet.
▸I monitored health and fitness on my smartwatch.
▸I organized tasks in a dedicated to-do app.
▸I jotted down notes on sticky notes scattered across my desk.

Every day, I was manually entering data across a dozen different interfaces, switching between apps, and still losing information in the gaps between them. No single system gave me a complete picture of my life.

I wanted to explore a simple question:

"What if I could just talk to one system and have it handle everything?"

"

Instead of clicking through menus and filling out forms, what if I could just send a Telegram message saying "beli nasi goreng 15k" and have it automatically recorded, categorized, and reflected in a dashboard? What if I could send a voice note about my lunch and have the AI figure out the rest?

My First Mistake: The "God Prompt"

When I started coding the backend in Python, I took the most obvious route. I wrote a massive system prompt that tried to handle everything at once.

I told the AI:

"You are my personal assistant. You can record expenses, track health, manage vehicle maintenance, and organize tasks. Parse the user's message and respond accordingly."

"

It worked... terribly.

As I threw different types of information at it-from structured financial transactions to vague daily activity logs-the system became confused. The responses were unreliable. Sometimes it would "record" an expense by just saying "I've noted that down" without actually touching the database. Other times, it would hallucinate parameters that didn't exist.

The worst part was prompt fragility. Every time I tweaked the prompt to improve one behavior, something else would break. If I made it better at categorizing food expenses, it would suddenly forget how to handle income transactions.

This was my biggest technical bottleneck, but also my greatest learning moment.

I realized that the problem wasn't the AI model. The problem was that I was asking a single prompt to be an entire software system. An LLM is great at language understanding, but it's terrible at being a database, a state machine, and a business logic layer all at once.

The Function Calling Breakthrough

The turning point came when I discovered OpenAI's function calling (tool use) capability.

Instead of asking the AI to "handle everything" in a single massive prompt, I could define explicit tools with strict JSON schemas. The AI's job was reduced to two things:

▸Understand what the user wants.
▸Decide which tool to call with what parameters.

The actual execution-database writes, calculations, validations-happened in my Python code, where I had full control.

I defined 7 tools:

▸create_expense - for recording new transactions
▸list_expenses - with 9 filter parameters for flexible querying
▸update_expense - partial updates on existing records
▸delete_expense - soft deletion
▸get_monthly_summary, get_yearly_summary, get_all_time_summary - financial analytics

Each tool had a strict JSON schema with "strict": true, meaning the AI couldn't hallucinate parameters. If the schema said amount must be a number, the AI had to send a number. Period.

This single architectural shift changed everything.

The system became dramatically more reliable. The AI stopped hallucinating database operations. And because each tool was just a Python function, I could test, debug, and iterate on them independently.

Making It Hear: The Voice Pipeline

After text chat was working, I wanted to push further. What if the system could understand voice messages too?

I integrated OpenAI's Whisper model for speech-to-text. The pipeline works like this:

▸User records a voice message on Telegram.
▸The bot downloads the .ogg audio file.
▸Audio bytes are sent to Whisper with language="id" for optimized Indonesian recognition.
▸The transcribed text is immediately previewed to the user (so they can see what was understood).
▸The transcribed text is then fed into the exact same AI chat pipeline-including function calling.
▸The final response shows both the transcription and the AI's answer.

One critical design decision was the transcribe_safe() method. Whisper can fail-bad audio quality, background noise, empty recordings. Instead of letting the entire pipeline crash, this method returns a tuple of (text, error). The handler checks for errors and gracefully informs the user instead of showing a cryptic error message.

The result was transformative. Recording a 3-second voice note saying "beli kopi 25 ribu" is far more natural than typing it out, especially on mobile. It made the bot feel genuinely useful in daily life rather than just a technical experiment.

The Authentication Puzzle: Connecting Two Worlds

One problem that took me an unexpectedly long time to solve was authentication.

The web dashboard uses Supabase Auth with JWT tokens-standard stuff. But the Telegram bot operates in a completely different world. Telegram sends webhooks with a chat_id, which is just a number. There's no JWT, no OAuth, no session cookie.

I needed a way to securely bridge these two authentication systems.

My solution was a one-time code verification flow:

▸User logs into the Next.js dashboard and clicks "Connect Telegram."
▸The API generates a random code like MYJARVIS-AB12CD with a 15-minute expiration.
▸User opens the Telegram bot and sends /connect MYJARVIS-AB12CD.
▸The bot verifies the code against the database, checks expiration, and links the telegram_chat_id to the user's profile.
▸The code is immediately invalidated after use.

This seems simple in hindsight, but getting the security model right was tricky.

Because the Telegram bot has no JWT context, it must use Supabase's admin client (service role), which bypasses Row Level Security entirely. This means every single database query in the bot code must explicitly filter by user_id. A missing WHERE user_id = ? clause would expose one user's data to another.

To prevent this, I added an assert user_id check as the very first line of the AI service factory function. If somehow a handler forgot to pass the user ID, the system would crash immediately instead of silently leaking data.

The Conversation State Machine

Not everything needs AI. For users who prefer structured input, I built a guided 5-step expense flow using python-telegram-bot's ConversationHandler.

When a user types /addexpense, the bot walks them through:

▸Amount: "How much?" - validates it's a positive number.
▸Type: "Income or expense?" - enforces exactly two options.
▸Category: "What category?" - free text.
▸Description: "Any details?" - optional, can /skip.
▸Date: "When?" - validates YYYY-MM-DD format, can /skip for today.

At any point, the user can type /cancel to abort.

This taught me an important lesson about handler registration order. In python-telegram-bot, handlers are evaluated in the order they're registered. If I registered the general text handler (the AI chat fallback) before the ConversationHandler, it would intercept messages that were meant for the active conversation state.

The fix was straightforward but non-obvious: always register ConversationHandler instances first, then command handlers, then the fallback text handler last. I spent hours debugging this before finding the root cause in the library documentation.

The Embedding Layer: Teaching the Database to Understand Meaning

Beyond function calling, I implemented a semantic search layer using pgvector.

Every time a new expense is created or updated, the system generates a 1536-dimensional embedding vector from the transaction's metadata (type, amount, category, description, subcategory, payment method) using OpenAI's text-embedding-3-small model.

These embeddings are stored directly in the PostgreSQL expenses table alongside the structured data. A custom match_expense() SQL function performs cosine similarity search to find semantically related transactions.

This means when a user asks "how much did I spend on food last week?", the system doesn't just do a keyword match on "food"-it understands that "nasi goreng", "makan siang", and "jajan" are all semantically related to food, even if the exact word was never used.

A key design decision was making embedding generation error-tolerant. The generate_for_expenses_safe() method wraps the entire embedding pipeline in a try/except. If the OpenAI API is down or the embedding fails for any reason, the expense is still saved-the embedding column just stays NULL. The main user flow is never blocked by a failed embedding.

What the Architecture Actually Looks Like

After months of iteration, the backend settled into a clean, layered architecture:

▸API Layer (api/): FastAPI routes handling HTTP requests, authentication, and input validation.
▸Bot Layer (bot/): Telegram handlers with their own factory patterns and state management.
▸Service Layer (services/): Business logic that's shared between the API and the bot-the same ExpenseService serves both interfaces without code duplication.
▸Repository Layer (repositories/): Pure database operations using the Supabase client.
▸Infrastructure (infrastructure/): Singleton clients for OpenAI and Supabase.

This separation paid enormous dividends. When I wanted to add voice support, I didn't have to touch the expense creation logic at all. I just wrote a new handler that transcribed audio and fed the text into the existing AIService.chat() method. Everything downstream-function calling, tool dispatch, database operations-worked automatically.

What This Project Taught Me

AI Is the Easy Part - Connecting to GPT-4o and getting responses took an afternoon. Building reliable authentication, database schemas with soft deletes and vector search, secure cross-platform account linking, error-tolerant embedding pipelines, and a proper handler registration order-that took months. The "unglamorous infrastructure" is what makes AI actually useful in production.

Function Calling > Prompt Engineering - My "God Prompt" approach was fragile and unreliable. Function calling with strict schemas made the system predictable and testable. The lesson: don't ask the AI to be your software. Let it use your software.

Voice Is Underrated - Adding speech-to-text was technically simple (Whisper API is excellent), but it transformed the user experience more than any other feature. People use their phones while walking, cooking, driving. Text input is a barrier. Voice removes it.

Clean Architecture Enables Speed - When my codebase was a mess, adding a new feature meant rewriting half the system. After restructuring into clean layers (API → Service → Repository), I could add the entire voice pipeline in a single day because the business logic was already encapsulated.

Security Is Not Optional - Using an admin client that bypasses RLS was necessary for the Telegram bot, but it meant I had to be paranoid about user_id filtering. One missing WHERE clause could expose private financial data. I learned to treat security not as a feature to add later, but as an architectural constraint to design around from day one.

What's Next

My-Jarvis-Gua is still actively developed. The finance module is complete, but the original vision is much bigger.

Future modules I'm planning:

▸Health Tracking: Food photo analysis with GPT-4 Vision for calorie estimation, weight trend monitoring, TDEE calculations.
▸Vehicle Maintenance: Automated service reminders based on mileage and time intervals.
▸Task Management: Natural language task creation with intelligent prioritization.
▸Cross-Domain Insights: Correlating patterns across domains-like whether my spending habits change when I skip workouts.

Each new module will plug into the existing function calling framework as additional tools, which is exactly why I designed the architecture to be modular in the first place.

The Bigger Picture

This project started as a childhood dream inspired by a movie. It became my most ambitious engineering challenge and my most effective learning platform.

The biggest lesson I learned isn't about pgvector, function calling, or Whisper transcription. It's that ambitious, intimidating goals become achievable when you break them down into meaningful, practical problems.

I started out wanting to build a sci-fi supercomputer. Along the way, I learned how to design clean architectures, build secure cross-platform authentication, orchestrate AI tool use, handle voice input gracefully, and create software that genuinely improves my everyday life.

It's not quite Tony Stark's JARVIS yet. But it's mine, it understands my voice, it manages my money, and it's getting smarter every day.

Muhammad Fauza

Fullstack & AI Engineer passionate about building intelligent systems. Sharing insights on web development, AI, and software engineering.

Learn More →

Found This Helpful?

Let's connect and discuss your next project

Get in Touch

June 7, 2026

14 min read

Muhammad Fauza

Building My Own JARVIS: Why the Hardest Part Was Everything Around the AI

How a childhood fascination with Iron Man led me to build a production-grade personal assistant, and why the 'God Prompt' approach failed spectacularly before function calling saved the project.

#AI#OpenAI#Function Calling#System Architecture#Python#FastAPI#Telegram Bot

If I'm being completely honest, one of the main reasons I chose to study Computer Science was because of Tony Stark from Iron Man.

This is the story of how My-Jarvis-Gua started, how my first approach failed spectacularly, and what I learned about AI system design along the way.

The Problem: A Fragmented Digital Life

To make this project genuinely meaningful, I didn't want to just build a cool tech demo. I needed it to solve actual problems I faced every day.

When I looked at how I managed my life, I realized it was completely fragmented:

▸I tracked expenses in one budgeting app.
▸I logged vehicle maintenance in a spreadsheet.
▸I monitored health and fitness on my smartwatch.
▸I organized tasks in a dedicated to-do app.
▸I jotted down notes on sticky notes scattered across my desk.

I wanted to explore a simple question:

"What if I could just talk to one system and have it handle everything?"

"

My First Mistake: The "God Prompt"

When I started coding the backend in Python, I took the most obvious route. I wrote a massive system prompt that tried to handle everything at once.

I told the AI:

"You are my personal assistant. You can record expenses, track health, manage vehicle maintenance, and organize tasks. Parse the user's message and respond accordingly."

"

It worked... terribly.

This was my biggest technical bottleneck, but also my greatest learning moment.

The Function Calling Breakthrough

The turning point came when I discovered OpenAI's function calling (tool use) capability.

Instead of asking the AI to "handle everything" in a single massive prompt, I could define explicit tools with strict JSON schemas. The AI's job was reduced to two things:

▸Understand what the user wants.
▸Decide which tool to call with what parameters.

The actual execution-database writes, calculations, validations-happened in my Python code, where I had full control.

I defined 7 tools:

▸create_expense - for recording new transactions
▸list_expenses - with 9 filter parameters for flexible querying
▸update_expense - partial updates on existing records
▸delete_expense - soft deletion
▸get_monthly_summary, get_yearly_summary, get_all_time_summary - financial analytics

Each tool had a strict JSON schema with "strict": true, meaning the AI couldn't hallucinate parameters. If the schema said amount must be a number, the AI had to send a number. Period.

This single architectural shift changed everything.

Making It Hear: The Voice Pipeline

After text chat was working, I wanted to push further. What if the system could understand voice messages too?

I integrated OpenAI's Whisper model for speech-to-text. The pipeline works like this:

▸User records a voice message on Telegram.
▸The bot downloads the .ogg audio file.
▸Audio bytes are sent to Whisper with language="id" for optimized Indonesian recognition.
▸The transcribed text is immediately previewed to the user (so they can see what was understood).
▸The transcribed text is then fed into the exact same AI chat pipeline-including function calling.
▸The final response shows both the transcription and the AI's answer.

The Authentication Puzzle: Connecting Two Worlds

One problem that took me an unexpectedly long time to solve was authentication.

I needed a way to securely bridge these two authentication systems.

My solution was a one-time code verification flow:

▸User logs into the Next.js dashboard and clicks "Connect Telegram."
▸The API generates a random code like MYJARVIS-AB12CD with a 15-minute expiration.
▸User opens the Telegram bot and sends /connect MYJARVIS-AB12CD.
▸The bot verifies the code against the database, checks expiration, and links the telegram_chat_id to the user's profile.
▸The code is immediately invalidated after use.

This seems simple in hindsight, but getting the security model right was tricky.

The Conversation State Machine

Not everything needs AI. For users who prefer structured input, I built a guided 5-step expense flow using python-telegram-bot's ConversationHandler.

When a user types /addexpense, the bot walks them through:

▸Amount: "How much?" - validates it's a positive number.
▸Type: "Income or expense?" - enforces exactly two options.
▸Category: "What category?" - free text.
▸Description: "Any details?" - optional, can /skip.
▸Date: "When?" - validates YYYY-MM-DD format, can /skip for today.

At any point, the user can type /cancel to abort.

The Embedding Layer: Teaching the Database to Understand Meaning

Beyond function calling, I implemented a semantic search layer using pgvector.

What the Architecture Actually Looks Like

After months of iteration, the backend settled into a clean, layered architecture:

▸API Layer (api/): FastAPI routes handling HTTP requests, authentication, and input validation.
▸Bot Layer (bot/): Telegram handlers with their own factory patterns and state management.
▸Service Layer (services/): Business logic that's shared between the API and the bot-the same ExpenseService serves both interfaces without code duplication.
▸Repository Layer (repositories/): Pure database operations using the Supabase client.
▸Infrastructure (infrastructure/): Singleton clients for OpenAI and Supabase.

What This Project Taught Me

What's Next

My-Jarvis-Gua is still actively developed. The finance module is complete, but the original vision is much bigger.

Future modules I'm planning:

▸Health Tracking: Food photo analysis with GPT-4 Vision for calorie estimation, weight trend monitoring, TDEE calculations.
▸Vehicle Maintenance: Automated service reminders based on mileage and time intervals.
▸Task Management: Natural language task creation with intelligent prioritization.
▸Cross-Domain Insights: Correlating patterns across domains-like whether my spending habits change when I skip workouts.

Each new module will plug into the existing function calling framework as additional tools, which is exactly why I designed the architecture to be modular in the first place.

The Bigger Picture

This project started as a childhood dream inspired by a movie. It became my most ambitious engineering challenge and my most effective learning platform.

It's not quite Tony Stark's JARVIS yet. But it's mine, it understands my voice, it manages my money, and it's getting smarter every day.

Muhammad Fauza

Fullstack & AI Engineer passionate about building intelligent systems. Sharing insights on web development, AI, and software engineering.

Learn More →

Found This Helpful?

Let's connect and discuss your next project

Get in Touch