How a capstone project with 5 people, 10 AI agents, and a 10,000-line monolith taught me that the hardest part of building AI systems isn't the AI-it's the architecture, the teamwork, and the trust.
The idea sounded simple: build an AI system that predicts when industrial equipment will fail before it actually does.
Five months later, I was staring at a 10,000-line Python file containing 10 AI agents tangled together in ways that broke every time someone pushed code. The event loop crashed when the chatbot and REST API tried to run simultaneously. The database agent could technically execute arbitrary SQL through the LLM-a security nightmare we hadn't noticed for weeks.
This is the story of how Sentinel started as an ambitious capstone project, nearly collapsed under its own complexity, and eventually became a production-grade predictive maintenance platform through painful lessons about architecture, team leadership, and what it actually means to build AI for real users.
Our capstone project (Dicoding × Accenture ASAH Program) gave us freedom to choose our problem domain. As a team of 5 with varying skill levels, we needed something ambitious enough to challenge us but concrete enough to deliver.
We chose predictive maintenance because the cost asymmetry is staggering:
The question we set out to answer:
"Can we build a system where a maintenance engineer asks 'Which machines need attention this week?' in natural language and gets an answer backed by ML predictions, real sensor data, and actionable recommendations?"
"
The first component I built was the XGBoost failure prediction model. The training data came from UCI's predictive maintenance dataset (10,000 records), which I expanded with a synthetic data generator to simulate 70 machines over 7 months (~250,000 records).
Raw sensor data-temperature, RPM, torque, tool wear-isn't enough for reliable prediction. I engineered 44 features:
The default classification threshold of 0.5 gave us 99% accuracy but missed 15% of real failures. In a domain where missing one failure costs $260K/hour, that 15% is unacceptable.
I tuned the threshold to 0.3108, optimizing for recall:
| Metric | Default (0.5) | Tuned (0.3108) |
|---|---|---|
| Recall | ~85% | 95.45% |
| Precision | ~65% | 42.57% |
| ROC-AUC | 98.5% | 99.01% |
Yes, precision dropped-more false alarms. But in this domain, a false positive means an unnecessary inspection ($500-2,000). A false negative means catastrophic failure ($260,000/hour). The math is obvious.
This was my first lesson in applied ML: the right metric depends on the domain, not the textbook.
The core innovation of Sentinel is its multi-agent AI chatbot. Instead of a single LLM answering questions, the system uses 10 specialized agents, each with a specific capability:
These agents are orchestrated by a LangGraph state machine with 8 nodes: Prompt Refiner → Goal Decomposer → Intent Classifier → Tool Selector → Tool Executor → Agent Collaborator → Result Validator → Answer Generator.
Each node is a pure function: state in, state out. The state object (AgentState) carries 30+ fields including conversation history, detected language, intent classification, selected tools, tool results, reasoning steps, confidence scores, and sub-task progress.
Here's where the story gets real.
In the early weeks, I wrote all 10 agents in a single main.py file. It started at 2,000 lines. Then 5,000. Then 10,000. Alongside it, backend.py grew to 2,000+ lines handling REST API, database queries, and authentication.
It worked-barely. But the problems accumulated:
FastAPI is async. The LangGraph agents used synchronous LLM calls internally. When both ran in the same process, they fought over Python's asyncio event loop. Sometimes the chatbot would freeze mid-response because the REST API handler had claimed the event loop.
With everything in one file, you couldn't test one agent without loading all 10. Changing the prompt in the Ticket Creator could break the Database Query agent because they shared global variables.
The Database Query agent used the LLM to generate SQL from natural language. But the database connection used the main application credentials-full read/write access. If someone crafted a prompt that tricked the LLM into generating a DROP TABLE statement, the agent would happily execute it.
We didn't notice this for weeks. When I finally realized the risk, it kept me up at night.
With three weeks left before the capstone deadline, I made a risky decision: refactor the entire backend into a clean modular architecture. The team was skeptical-why break what "works"?
I restructured everything into back-end-refactor/:
app/ ├── api/ # REST endpoints (each module in its own file) ├── services/ # Shared business logic ├── schemas/ # Pydantic V2 strict validation ├── db/ # SQLAlchemy ORM models ├── ml/ # predictor.py (15K) + retrainer.py (42K) ├── agents/ # graph.py (41K) + state.py + tools.py + helpers.py └── core/ # Configuration, constants, LLM client singleton
Three critical fixes came from the refactoring:
1. Database Security. The Database Query agent now uses a dedicated read-only PostgreSQL account. Even if the LLM generates a malicious query, the database refuses to execute it. This single change eliminated the SQL injection risk entirely.
2. Event Loop Isolation. By separating the API routes from the agent execution, the async conflicts disappeared. The LangGraph graph runs in its own context, and the FastAPI endpoint streams results via SSE without competing for the event loop.
3. Centralized Tool Registration. Instead of each agent being hard-coded into the orchestration logic, all 10 tools are registered in a TOOL_MAP dictionary. The execute_tool() function handles lookup, execution, and error handling uniformly. Adding a new agent means adding one entry to the map-no changes to the graph logic.
One of the scariest bugs in the multi-agent system was infinite loops.
The LangGraph pipeline has cycles by design: if the Result Validator determines the answer is incomplete, it routes back to the Tool Selector for another round. This is intentional for complex queries that require multiple tool calls.
But without limits, a query like "Prepare all machines for next week" could trigger an endless loop: decompose → select tools → execute → validate → "not complete yet" → select more tools → execute → validate → forever.
I added three guardrails:
MAX_SUB_TASK_ITERATIONS - Maximum number of sub-task cycles per queryMAX_CONFIRMATION_ATTEMPTS - Maximum times the system asks the user for confirmation (prevents infinite "Are you sure?" loops)LANGGRAPH_RECURSION_LIMIT - Hard limit on total graph traversalsThese aren't elegant. They're blunt safety valves. But they prevent the system from consuming unlimited API tokens and response time. In production multi-agent systems, guardrails aren't optional-they're the first thing you should design.
A single agent's output isn't always trustworthy. The ML Prediction agent might say a machine has an 80% failure probability, but what if the model has been drifting and its recent accuracy is poor?
I implemented an Agent Collaboration Protocol where agents automatically consult each other:
Each consultation returns a response with a [CONFIDENCE] tag. The confidence scores from all consultations are aggregated into a composite confidence that determines whether the result is shown directly, needs user confirmation, or requires a retry.
When I showed the first prototype to potential users (maintenance engineers at the Accenture workshop), their immediate reaction was skepticism:
"How do I know the AI isn't making this up?"
"
This feedback changed my approach. I added reasoning transparency: every node in the LangGraph pipeline logs what it's doing, why, and with what confidence. These reasoning steps are streamed to the frontend as the AI "thinks."
A typical response now shows:
Engineers could see exactly how the AI arrived at its answer. This wasn't technically difficult to implement-it was just a list of timestamped strings. But it was the single most important feature for user adoption.
This was my first experience as a team lead. Five people with different skill levels, working on a complex system with tight deadlines.
Things I learned about technical leadership:
Architecture decisions are team decisions. When I decided to refactor the monolith with three weeks left, I needed the team to understand why. I couldn't just declare it-I had to explain the event loop bug, show the SQL injection risk, and demonstrate how the modular structure would make their work easier.
Clear module boundaries prevent merge conflicts. After the refactoring, each team member owned a specific module. The frontend engineer worked on front-end/, the database team member handled app/db/ and app/api/, and I focused on app/agents/ and app/ml/. Merge conflicts dropped from daily to weekly.
Code reviews are teaching moments. Instead of just fixing bugs in PRs, I used reviews to explain why certain patterns work. This slowed down my velocity but increased the team's overall capability.
Documentation is not optional in a team project. I wrote a comprehensive README (57,000+ bytes, 1,279 lines) covering architecture, API endpoints, all 10 agents, the data pipeline, and deployment. When team members had questions at 2 AM, the documentation answered them instead of my phone.
Multi-Agent Systems Are Deceptively Hard - Getting 10 agents to work individually is straightforward. Getting them to collaborate without infinite loops, conflicting outputs, and security vulnerabilities is an engineering challenge that requires explicit guardrails, dependency rules, and safety limits.
Architecture Is a Team Productivity Multiplier - The monolith was a bottleneck not because the code was bad, but because it forced everyone to work in the same files. Clean architecture-separating API, services, agents, and ML into distinct modules-was the single biggest improvement to team velocity.
The Right Metric Changes Everything - Optimizing for accuracy (99%) would have been the textbook approach. Optimizing for recall (95.45%) was the right approach. Understanding why required understanding the domain, not just the math.
Trust Is a Feature, Not a Nice-to-Have - In safety-critical domains like industrial maintenance, engineers won't use a system they can't understand. Reasoning transparency-showing the AI's thinking process step by step-was technically trivial but strategically essential.
Impactful Software Is Not Defined by Complexity - We built 10 agents, 44 ML features, an 8-node pipeline, and a complete MLOps system. But none of that matters if the maintenance engineer can't ask a simple question and get a trustworthy answer. The measure of success isn't what you built-it's what the user can do with it.
Project Demo: YouTube Overview | AI Chatbot Demo
Fullstack & AI Engineer passionate about building intelligent systems. Sharing insights on web development, AI, and software engineering.
Learn More →