How building a voice-powered receipt app for traditional market merchants taught me that the best engineering decision is often removing complexity, not adding it.
I didn't plan to build VoiceInvoice. It started from a simple observation.
I was at a traditional market, watching a merchant try to serve three customers at once. She was weighing rice with one hand and scribbling numbers on a piece of paper with the other, doing mental arithmetic while customers kept adding items to their orders.
She made two mistakes in the receipts I saw. Not because she was bad at math, but because the situation was impossible-simultaneous physical tasks competing for the same cognitive resources.
I watched other merchants too. Some used calculator apps, but typing on tiny phone screens while their hands were wet or dusty from handling goods was almost as slow as writing by hand. Others simply gave up on receipts entirely and did everything from memory.
That's when the question hit me:
"What if a merchant could just say what they sold and get a receipt automatically?"
"
Like most developers, my first instinct was to reach for the tools I already knew.
I designed a classic two-step AI pipeline:
On paper, it looked elegant. Two specialized models, each doing what they're best at. The kind of architecture you'd proudly diagram in a technical blog post.
I built a working prototype. And it was... terrible.
On a WiFi connection in my apartment, the pipeline took 3-4 seconds. Acceptable for a demo.
But traditional market merchants don't use WiFi. They use mobile data-often 3G or spotty 4G. On real mobile networks, each API call added 1-2 seconds of network overhead. Two sequential API calls meant 6-8 seconds before the merchant saw any result.
For context: writing a receipt by hand takes about 15 seconds. If the AI system takes 8 seconds just to process, and then the merchant needs a few more seconds to review and save, the total time barely beats the manual process. The value proposition was gone.
Whisper is an excellent speech-to-text model. But it's not perfect, especially with Indonesian market terminology, informal pricing expressions, and background noise from busy market environments.
When Whisper misheard "lima belas ribu" (fifteen thousand) as "lima ribu" (five thousand), the downstream LLM would faithfully parse the wrong number into a perfectly formatted JSON object. The system would confidently produce an incorrect receipt, and the merchant would have no way of knowing unless they manually checked every number.
Two models in sequence meant two opportunities for errors to compound.
If OpenAI's API had latency spikes, the entire pipeline slowed down. If the Whisper endpoint was temporarily unavailable, the system was completely broken-even if GPT-4 was working fine.
For merchants who might make 50-100 transactions per day, even a 2% failure rate would mean 1-2 broken transactions daily. That's enough to destroy trust in the system.
While researching alternatives, I tested Google's Gemini 2.0 Flash-a multimodal model that can process audio, images, and text natively.
I sent it a raw audio file of someone saying "beras 5 kilo tujuh puluh ribu, gula 2 kilo dua puluh lima ribu" and asked it to return both the transcription and structured JSON.
It worked. In a single request.
No intermediate transcription step. No second API call. The model heard the audio directly and produced exactly the output I needed:
{ "transcription": "beras 5 kilo tujuh puluh ribu, gula 2 kilo dua puluh lima ribu", "items": [ {"item_name": "beras", "qty": 5, "unit": "kg", "unit_price": 70000, "subtotal": 350000}, {"item_name": "gula", "qty": 2, "unit": "kg", "unit_price": 25000, "subtotal": 50000} ] }
The latency was roughly half of the two-step pipeline. One request instead of two. One point of failure instead of two. And because the model heard the original audio directly-instead of working from an intermediate transcription-it made fewer errors on ambiguous pricing expressions.
I deleted the Whisper integration. I deleted the GPT-4 parsing logic. I replaced hundreds of lines of orchestration code with a single 84-line route handler.
That was the moment I learned:
Sometimes the best engineering decision is removing complexity, not adding it.
"
After solving the AI pipeline, I faced the second major challenge: internet reliability.
Traditional markets in Indonesia often have poor mobile signal. Some areas get 3G at best. And even when signal is available, it can be intermittent-dropping in and out throughout the day.
For most web applications, "offline support" is a nice-to-have. For VoiceInvoice, offline was the primary mode of operation. A receipt system that only works when you have internet is useless for exactly the merchants who need it most.
I implemented an offline-first architecture using Dexie.js (a wrapper around the browser's IndexedDB):
sync_status: 'pending_sync' flag.useOfflineSync hook listens for the browser's online event and automatically syncs all pending invoices to Supabase, one by one.pending_sync status and will be retried next time. Successfully synced invoices are marked as 'synced'.A subtle but important detail: I added a isSyncingRef guard to prevent duplicate sync operations. Mobile browsers can fire multiple online events in rapid succession when signal is unstable, and without this guard, the same invoice could be posted to the server multiple times.
Building this system taught me that offline synchronization has edge cases that are deceptively complex. Partial sync failures, potential duplicate submissions, and conflict resolution are all problems that seem simple until you encounter them in production. My current implementation is adequate for this project's scale, but I now understand why companies like Notion and Linear invest enormous engineering effort into their sync engines.
I assumed that if my app worked on Chrome, it would work everywhere. I was wrong.
When I tested on a real Android phone, everything worked perfectly. When I tested on an iPhone, the voice recording silently failed.
The issue: Chrome Android supports audio/webm;codecs=opus, but Safari iOS does not.
The MediaRecorder API doesn't throw an error for unsupported MIME types on all browsers-some just produce empty audio files. The fix was a runtime capability check:
MediaRecorder.isTypeSupported('audio/webm;codecs=opus') ? 'audio/webm;codecs=opus' : 'audio/webm'
Two lines of code. Hours of debugging. And the bug was completely invisible in the iOS simulator-it only appeared on physical devices.
This experience permanently changed my testing habits. Mobile web features must be tested on real hardware. There is no substitute.
When a merchant taps the record button and starts speaking, they need to know the system is actually "hearing" them. Silence and a spinning loader create anxiety-especially for users who aren't tech-savvy.
I implemented a dual recording strategy:
The live preview is not used for the final receipt-Gemini processes the raw audio independently. The preview exists purely to build user confidence. When merchants see their words appearing on screen as they speak, they trust the system.
If Web Speech API isn't available (some browsers don't support it), the recording continues without the preview. No functionality is lost-just the visual feedback.
This taught me something about product design: perceived reliability is as important as actual reliability. A system that works perfectly but gives no feedback feels broken. A system that shows progress feels trustworthy.
In Indonesia, WhatsApp is the universal communication tool-including for businesses. Many merchants already send order confirmations and receipts to customers via WhatsApp.
Instead of building a complex PDF generation system, I kept it simple. The NotaPreview component generates a plain-text formatted receipt and offers two actions:
wa.me/?text= with the pre-formatted receipt.No PDF library. No print server. Just text that works everywhere.
This was another lesson in designing for the user's actual workflow instead of the technically impressive solution.
Building VoiceInvoice fundamentally changed how I think about software engineering.
I believed:
I learned:
The most valuable insight from this project isn't about Gemini or Dexie.js or MediaRecorder MIME types.
It's that good technology should adapt to the realities of its users, not the other way around.
I started this project thinking about AI architectures and model capabilities. I finished it thinking about wet hands gripping phones, unreliable 3G signals, and the cognitive load of serving three customers simultaneously.
The best engineering decisions I made weren't about choosing the right model or the right framework. They were about understanding that a market merchant doesn't care whether the system uses one AI model or five. They care that they can say "beras 5 kilo tujuh puluh ribu" and get a receipt they can WhatsApp to their customer before the next person in line gets impatient.
That shift in thinking-from "what's technically impressive" to "what actually helps"-is something I'll carry into every project I build from here.
Fullstack & AI Engineer passionate about building intelligent systems. Sharing insights on web development, AI, and software engineering.
Learn More →