June 7, 2026

14 min read

Muhammad Fauza

Why I Deleted Half My Code and Built a Better Product

How building a voice-powered receipt app for traditional market merchants taught me that the best engineering decision is often removing complexity, not adding it.

#AI#Multimodal#Gemini#PWA#Offline-First#Next.js#Product Design

I didn't plan to build VoiceInvoice. It started from a simple observation.

I was at a traditional market, watching a merchant try to serve three customers at once. She was weighing rice with one hand and scribbling numbers on a piece of paper with the other, doing mental arithmetic while customers kept adding items to their orders.

She made two mistakes in the receipts I saw. Not because she was bad at math, but because the situation was impossible-simultaneous physical tasks competing for the same cognitive resources.

I watched other merchants too. Some used calculator apps, but typing on tiny phone screens while their hands were wet or dusty from handling goods was almost as slow as writing by hand. Others simply gave up on receipts entirely and did everything from memory.

That's when the question hit me:

"What if a merchant could just say what they sold and get a receipt automatically?"

"

Starting with the Wrong Architecture

Like most developers, my first instinct was to reach for the tools I already knew.

I designed a classic two-step AI pipeline:

▸Send the merchant's voice recording to OpenAI Whisper for speech-to-text transcription.
▸Take the transcribed text and send it to GPT-4 to parse it into structured JSON (item names, quantities, prices).

On paper, it looked elegant. Two specialized models, each doing what they're best at. The kind of architecture you'd proudly diagram in a technical blog post.

I built a working prototype. And it was... terrible.

The Three Problems I Didn't Anticipate

Problem 1: Latency Killed the Experience

On a WiFi connection in my apartment, the pipeline took 3-4 seconds. Acceptable for a demo.

But traditional market merchants don't use WiFi. They use mobile data-often 3G or spotty 4G. On real mobile networks, each API call added 1-2 seconds of network overhead. Two sequential API calls meant 6-8 seconds before the merchant saw any result.

For context: writing a receipt by hand takes about 15 seconds. If the AI system takes 8 seconds just to process, and then the merchant needs a few more seconds to review and save, the total time barely beats the manual process. The value proposition was gone.

Problem 2: Compounding Errors

Whisper is an excellent speech-to-text model. But it's not perfect, especially with Indonesian market terminology, informal pricing expressions, and background noise from busy market environments.

When Whisper misheard "lima belas ribu" (fifteen thousand) as "lima ribu" (five thousand), the downstream LLM would faithfully parse the wrong number into a perfectly formatted JSON object. The system would confidently produce an incorrect receipt, and the merchant would have no way of knowing unless they manually checked every number.

Two models in sequence meant two opportunities for errors to compound.

Problem 3: Two Points of Failure

If OpenAI's API had latency spikes, the entire pipeline slowed down. If the Whisper endpoint was temporarily unavailable, the system was completely broken-even if GPT-4 was working fine.

For merchants who might make 50-100 transactions per day, even a 2% failure rate would mean 1-2 broken transactions daily. That's enough to destroy trust in the system.

The Discovery That Changed Everything

While researching alternatives, I tested Google's Gemini 2.0 Flash-a multimodal model that can process audio, images, and text natively.

I sent it a raw audio file of someone saying "beras 5 kilo tujuh puluh ribu, gula 2 kilo dua puluh lima ribu" and asked it to return both the transcription and structured JSON.

It worked. In a single request.

No intermediate transcription step. No second API call. The model heard the audio directly and produced exactly the output I needed:

{
  "transcription": "beras 5 kilo tujuh puluh ribu, gula 2 kilo dua puluh lima ribu",
  "items": [
    {"item_name": "beras", "qty": 5, "unit": "kg", "unit_price": 70000, "subtotal": 350000},
    {"item_name": "gula", "qty": 2, "unit": "kg", "unit_price": 25000, "subtotal": 50000}
  ]
}

The latency was roughly half of the two-step pipeline. One request instead of two. One point of failure instead of two. And because the model heard the original audio directly-instead of working from an intermediate transcription-it made fewer errors on ambiguous pricing expressions.

I deleted the Whisper integration. I deleted the GPT-4 parsing logic. I replaced hundreds of lines of orchestration code with a single 84-line route handler.

That was the moment I learned:

Sometimes the best engineering decision is removing complexity, not adding it.

"

Making It Work Without Internet

After solving the AI pipeline, I faced the second major challenge: internet reliability.

Traditional markets in Indonesia often have poor mobile signal. Some areas get 3G at best. And even when signal is available, it can be intermittent-dropping in and out throughout the day.

For most web applications, "offline support" is a nice-to-have. For VoiceInvoice, offline was the primary mode of operation. A receipt system that only works when you have internet is useless for exactly the merchants who need it most.

I implemented an offline-first architecture using Dexie.js (a wrapper around the browser's IndexedDB):

▸When the merchant creates a receipt: The system first checks if the device is online.
▸If online: The invoice is saved directly to Supabase (PostgreSQL in the cloud).
▸If offline: The invoice is saved to IndexedDB on the device with a sync_status: 'pending_sync' flag.
▸When the device comes back online: A useOfflineSync hook listens for the browser's online event and automatically syncs all pending invoices to Supabase, one by one.
▸If syncing fails for any invoice: That invoice stays in pending_sync status and will be retried next time. Successfully synced invoices are marked as 'synced'.

A subtle but important detail: I added a isSyncingRef guard to prevent duplicate sync operations. Mobile browsers can fire multiple online events in rapid succession when signal is unstable, and without this guard, the same invoice could be posted to the server multiple times.

Building this system taught me that offline synchronization has edge cases that are deceptively complex. Partial sync failures, potential duplicate submissions, and conflict resolution are all problems that seem simple until you encounter them in production. My current implementation is adequate for this project's scale, but I now understand why companies like Notion and Linear invest enormous engineering effort into their sync engines.

The Browser Compatibility Rabbit Hole

I assumed that if my app worked on Chrome, it would work everywhere. I was wrong.

When I tested on a real Android phone, everything worked perfectly. When I tested on an iPhone, the voice recording silently failed.

The issue: Chrome Android supports audio/webm;codecs=opus, but Safari iOS does not.

The MediaRecorder API doesn't throw an error for unsupported MIME types on all browsers-some just produce empty audio files. The fix was a runtime capability check:

MediaRecorder.isTypeSupported('audio/webm;codecs=opus')
  ? 'audio/webm;codecs=opus'
  : 'audio/webm'

Two lines of code. Hours of debugging. And the bug was completely invisible in the iOS simulator-it only appeared on physical devices.

This experience permanently changed my testing habits. Mobile web features must be tested on real hardware. There is no substitute.

Live Transcription as a UX Affordance

When a merchant taps the record button and starts speaking, they need to know the system is actually "hearing" them. Silence and a spinning loader create anxiety-especially for users who aren't tech-savvy.

I implemented a dual recording strategy:

▸MediaRecorder API captures the actual audio bytes that get sent to Gemini.
▸Web Speech API runs simultaneously and provides a real-time transcription preview on screen.

The live preview is not used for the final receipt-Gemini processes the raw audio independently. The preview exists purely to build user confidence. When merchants see their words appearing on screen as they speak, they trust the system.

If Web Speech API isn't available (some browsers don't support it), the recording continues without the preview. No functionality is lost-just the visual feedback.

This taught me something about product design: perceived reliability is as important as actual reliability. A system that works perfectly but gives no feedback feels broken. A system that shows progress feels trustworthy.

The WhatsApp Integration

In Indonesia, WhatsApp is the universal communication tool-including for businesses. Many merchants already send order confirmations and receipts to customers via WhatsApp.

Instead of building a complex PDF generation system, I kept it simple. The NotaPreview component generates a plain-text formatted receipt and offers two actions:

▸Copy to clipboard - for pasting into any app.
▸Share via WhatsApp - opens wa.me/?text= with the pre-formatted receipt.

No PDF library. No print server. Just text that works everywhere.

This was another lesson in designing for the user's actual workflow instead of the technically impressive solution.

What Changed My Perspective

Building VoiceInvoice fundamentally changed how I think about software engineering.

Before This Project

I believed:

▸Better AI products come from adding more models and more features.
▸Internet connectivity can be assumed in modern applications.
▸Technical innovation is the primary measure of a system's quality.

After This Project

I learned:

▸Simpler architectures can outperform more complex ones. Replacing a two-model pipeline with a single multimodal request made the system faster, cheaper, and more reliable.
▸Offline reliability can be more valuable than advanced functionality. For the target users, a system that works without internet is infinitely more useful than one with beautiful animations that breaks when signal drops.
▸Great products succeed when they fit naturally into users' daily workflows. The WhatsApp share button is technically trivial-it's just a URL. But it's the feature that makes the product genuinely useful, because it connects to what merchants are already doing.

The Biggest Lesson

The most valuable insight from this project isn't about Gemini or Dexie.js or MediaRecorder MIME types.

It's that good technology should adapt to the realities of its users, not the other way around.

I started this project thinking about AI architectures and model capabilities. I finished it thinking about wet hands gripping phones, unreliable 3G signals, and the cognitive load of serving three customers simultaneously.

The best engineering decisions I made weren't about choosing the right model or the right framework. They were about understanding that a market merchant doesn't care whether the system uses one AI model or five. They care that they can say "beras 5 kilo tujuh puluh ribu" and get a receipt they can WhatsApp to their customer before the next person in line gets impatient.

That shift in thinking-from "what's technically impressive" to "what actually helps"-is something I'll carry into every project I build from here.

Muhammad Fauza

Fullstack & AI Engineer passionate about building intelligent systems. Sharing insights on web development, AI, and software engineering.

Learn More →

Found This Helpful?

Let's connect and discuss your next project

Get in Touch

June 7, 2026

14 min read

Muhammad Fauza

Why I Deleted Half My Code and Built a Better Product

How building a voice-powered receipt app for traditional market merchants taught me that the best engineering decision is often removing complexity, not adding it.

#AI#Multimodal#Gemini#PWA#Offline-First#Next.js#Product Design

I didn't plan to build VoiceInvoice. It started from a simple observation.

She made two mistakes in the receipts I saw. Not because she was bad at math, but because the situation was impossible-simultaneous physical tasks competing for the same cognitive resources.

That's when the question hit me:

"What if a merchant could just say what they sold and get a receipt automatically?"

"

Starting with the Wrong Architecture

Like most developers, my first instinct was to reach for the tools I already knew.

I designed a classic two-step AI pipeline:

▸Send the merchant's voice recording to OpenAI Whisper for speech-to-text transcription.
▸Take the transcribed text and send it to GPT-4 to parse it into structured JSON (item names, quantities, prices).

On paper, it looked elegant. Two specialized models, each doing what they're best at. The kind of architecture you'd proudly diagram in a technical blog post.

I built a working prototype. And it was... terrible.

The Three Problems I Didn't Anticipate

Problem 1: Latency Killed the Experience

On a WiFi connection in my apartment, the pipeline took 3-4 seconds. Acceptable for a demo.

Problem 2: Compounding Errors

Whisper is an excellent speech-to-text model. But it's not perfect, especially with Indonesian market terminology, informal pricing expressions, and background noise from busy market environments.

Two models in sequence meant two opportunities for errors to compound.

Problem 3: Two Points of Failure

If OpenAI's API had latency spikes, the entire pipeline slowed down. If the Whisper endpoint was temporarily unavailable, the system was completely broken-even if GPT-4 was working fine.

For merchants who might make 50-100 transactions per day, even a 2% failure rate would mean 1-2 broken transactions daily. That's enough to destroy trust in the system.

The Discovery That Changed Everything

While researching alternatives, I tested Google's Gemini 2.0 Flash-a multimodal model that can process audio, images, and text natively.

I sent it a raw audio file of someone saying "beras 5 kilo tujuh puluh ribu, gula 2 kilo dua puluh lima ribu" and asked it to return both the transcription and structured JSON.

It worked. In a single request.

No intermediate transcription step. No second API call. The model heard the audio directly and produced exactly the output I needed:

{
  "transcription": "beras 5 kilo tujuh puluh ribu, gula 2 kilo dua puluh lima ribu",
  "items": [
    {"item_name": "beras", "qty": 5, "unit": "kg", "unit_price": 70000, "subtotal": 350000},
    {"item_name": "gula", "qty": 2, "unit": "kg", "unit_price": 25000, "subtotal": 50000}
  ]
}

I deleted the Whisper integration. I deleted the GPT-4 parsing logic. I replaced hundreds of lines of orchestration code with a single 84-line route handler.

That was the moment I learned:

Sometimes the best engineering decision is removing complexity, not adding it.

"

Making It Work Without Internet

After solving the AI pipeline, I faced the second major challenge: internet reliability.

Traditional markets in Indonesia often have poor mobile signal. Some areas get 3G at best. And even when signal is available, it can be intermittent-dropping in and out throughout the day.

I implemented an offline-first architecture using Dexie.js (a wrapper around the browser's IndexedDB):

▸When the merchant creates a receipt: The system first checks if the device is online.
▸If online: The invoice is saved directly to Supabase (PostgreSQL in the cloud).
▸If offline: The invoice is saved to IndexedDB on the device with a sync_status: 'pending_sync' flag.
▸When the device comes back online: A useOfflineSync hook listens for the browser's online event and automatically syncs all pending invoices to Supabase, one by one.
▸If syncing fails for any invoice: That invoice stays in pending_sync status and will be retried next time. Successfully synced invoices are marked as 'synced'.

The Browser Compatibility Rabbit Hole

I assumed that if my app worked on Chrome, it would work everywhere. I was wrong.

When I tested on a real Android phone, everything worked perfectly. When I tested on an iPhone, the voice recording silently failed.

The issue: Chrome Android supports audio/webm;codecs=opus, but Safari iOS does not.

The MediaRecorder API doesn't throw an error for unsupported MIME types on all browsers-some just produce empty audio files. The fix was a runtime capability check:

MediaRecorder.isTypeSupported('audio/webm;codecs=opus')
  ? 'audio/webm;codecs=opus'
  : 'audio/webm'

Two lines of code. Hours of debugging. And the bug was completely invisible in the iOS simulator-it only appeared on physical devices.

This experience permanently changed my testing habits. Mobile web features must be tested on real hardware. There is no substitute.

Live Transcription as a UX Affordance

I implemented a dual recording strategy:

▸MediaRecorder API captures the actual audio bytes that get sent to Gemini.
▸Web Speech API runs simultaneously and provides a real-time transcription preview on screen.

If Web Speech API isn't available (some browsers don't support it), the recording continues without the preview. No functionality is lost-just the visual feedback.

The WhatsApp Integration

In Indonesia, WhatsApp is the universal communication tool-including for businesses. Many merchants already send order confirmations and receipts to customers via WhatsApp.

Instead of building a complex PDF generation system, I kept it simple. The NotaPreview component generates a plain-text formatted receipt and offers two actions:

▸Copy to clipboard - for pasting into any app.
▸Share via WhatsApp - opens wa.me/?text= with the pre-formatted receipt.

No PDF library. No print server. Just text that works everywhere.

This was another lesson in designing for the user's actual workflow instead of the technically impressive solution.

What Changed My Perspective

Building VoiceInvoice fundamentally changed how I think about software engineering.

Before This Project

I believed:

▸Better AI products come from adding more models and more features.
▸Internet connectivity can be assumed in modern applications.
▸Technical innovation is the primary measure of a system's quality.

After This Project

I learned:

▸Simpler architectures can outperform more complex ones. Replacing a two-model pipeline with a single multimodal request made the system faster, cheaper, and more reliable.
▸Offline reliability can be more valuable than advanced functionality. For the target users, a system that works without internet is infinitely more useful than one with beautiful animations that breaks when signal drops.
▸Great products succeed when they fit naturally into users' daily workflows. The WhatsApp share button is technically trivial-it's just a URL. But it's the feature that makes the product genuinely useful, because it connects to what merchants are already doing.

The Biggest Lesson

The most valuable insight from this project isn't about Gemini or Dexie.js or MediaRecorder MIME types.

It's that good technology should adapt to the realities of its users, not the other way around.

That shift in thinking-from "what's technically impressive" to "what actually helps"-is something I'll carry into every project I build from here.

Muhammad Fauza

Fullstack & AI Engineer passionate about building intelligent systems. Sharing insights on web development, AI, and software engineering.

Learn More →

Found This Helpful?

Let's connect and discuss your next project

Get in Touch