Introduction
"Just ask your database questions in plain English."
This has been the Holy Grail of business intelligence for two decades. Every few years, a new technology promises to finally make it real: natural language processing in the 2000s, voice assistants in the 2010s, and now LLMs.
With GPT-4 and Claude able to write complex code, surely text-to-SQL is finally solved, right?
I spent a weekend building a proof-of-concept to find out. Spoiler: It worked beautifully for exactly 10 queries. Then reality hit.
This is the story of what I built, where it breaks at scale, and what I discovered about how production systems actually solve this. The gap between demo and production taught me more than the build itself.
Part 1: What I Built
The Architecture
I built a text-to-SQL agent with a straightforward architecture:
Frontend: Streamlit chat interface
Orchestration: LangChain managing workflow
Brain: Claude Sonnet 4 for reasoning
Database: SQLite with synthetic SaaS data (1,000 users, 50K+ transactions)
The three-step workflow:
- User asks question in natural language
- Claude generates SQL query
- Execute query and create visualization
- Return results with explanation
The User Experience
User: "Show me monthly revenue trends for the last 2 years"
Agent:
- Generates SQL query
- Executes against database
- Creates line chart
- Returns: "Based on the analysis, revenue peaked in December 2024
at $25,943, with a concerning downward trend starting in 2025..."
The agent also showed the exact SQL query used, providing full transparency.
Initial Results: It Worked!
The system handled queries I threw at it:
- "What was total revenue last quarter?"
- "Show me top 10 customers by revenue"
- "How many active subscriptions do we have?"
It generated accurate SQL, handled JOINs correctly, and created visualizations automatically. Response time was 10-15 seconds per query.
I was thrilled. Then I started thinking about scale.
Part 2: Why This Would Break in Real Production
Problem 1: The Token Limit Wall
What happened: I'm passing the entire database schema with every query. Works fine for 3 tables (~500 tokens). Breaks completely at 100 tables (~50K tokens, exceeds context limit).
The math:
- Claude Sonnet 4: 200K token context window
- Medium enterprise DB: 200-500 tables
- Average table schema: 500-1000 tokens
- Result: Can't fit schema in context
Real-world impact:
- Demo: 3 tables ✅
- Startup with 50 tables: Barely works ⚠️
- Enterprise with 300 tables: Completely broken ❌
Problem 2: Cost Spirals Out of Control
Costs per query:
- Input: 2K tokens (schema + question) = $0.006
- Output: 500 tokens (SQL + explanation) = $0.0075
- Total: ~$0.02-0.05 per query
At scale:
- 100 queries/day: $3-5/day = $90-150/month (manageable)
- 1,000 queries/day: $600-1,500/month (yikes)
- 10,000 queries/day: $6K-15K/month (unsustainable)
Compare to traditional SQL where the query cost is essentially $0. The LLM approach is 1000x more expensive.
Problem 3: Latency Kills Adoption
My query timeline:
- 3 Claude API calls: ~2-3 seconds each = 6-9 seconds
- Database query execution: 50-200ms
- Chart generation: 500ms
- Total: 7-10 seconds per query
User expectations:
- Google search: <1 second
- SQL query in BI tool: 1-3 seconds
- RAG agent: 10-15 seconds ❌
Problem 4: The "Last Mile" Problem
Some additional gaps in the real world:
- Business logic: "Revenue" = completed transactions only, exclude refunds, internal orders
- Data quality rules: Filter out test users with email ending in @test.com
- Temporal context: "Last quarter" depends on fiscal calendar, not calendar year
- Relationships: Some JOINs are valid but nonsensical in business context
Example failure:
User: "Show me revenue last quarter"
My agent: SELECT SUM(amount) FROM transactions WHERE date > '2024-10-01'
Problem: Includes refunds, test transactions, wrong quarter definition
Correct query: Much more complex with 5+ WHERE clauses
The LLM can't infer this from schema alone.
Part 3: What Production Systems Actually Do
Then I researched how companies solve this at scale. Turns out the answer isn't "throw more LLM at it."
Solution 1: Enterprise Platforms (Snowflake Cortex, Databricks Genie)
Semantic layer: Pre-define metrics once
revenue:
SUM(amount) WHERE status='completed' AND type!='refund'
description: "Total revenue from completed sales"
Vector-based retrieval: Embed table descriptions, retrieve only relevant 3-5 tables. Query optimization: Rewrite LLM SQL for performance. Aggressive caching: 90%+ cache hit rate.
The results:
- Cost: $0.001-0.01 per query (100x cheaper)
- Latency: <2 seconds
- Accuracy: 90%+
- Scale: Handles 1000+ table databases
Pros: Enterprise security/compliance, massive scale, vendor support, deep platform integration.
Cons: Expensive ($50K-500K+/year), vendor lock-in, only works with their platform.
Best for: Fortune 500s, large enterprises already on Snowflake/Databricks
Solution 2: Specialized Startups (Seek AI, Patterns, Outerbase)
Fine-tuning: Train model on your specific schemas and past queries. Intent classification: Cheap model ($0.0001) routes 80% to templates, 20% to LLM. Query validation: Check for cartesian joins, missing WHERE clauses. Human-in-loop: Uncertain queries go to Slack for approval.
User: "Show me revenue"
↓
Intent classifier: "Simple aggregation" (no LLM needed)
↓
Template: SELECT SUM(amount) FROM transactions WHERE...
↓
Result in 500ms
The results: 90%+ accuracy (vs 60% with base models), $0.005-0.02 per query, 1-3 seconds latency, learns from corrections.
Best for: Series A-C startups, mid-size companies (100-1K employees)
Solution 3: Cloud-Native (BigQuery + Gemini, Athena + Bedrock)
Uses existing cloud provider's LLM, schema embeddings stored in vector DB, query logs as training data, pay-as-you-go pricing.
The tradeoff: Lower accuracy (60-70%) but "good enough". Less features but simpler. Cheaper but more DIY.
Best for: AWS/GCP-native companies, cost-conscious teams
Solution 4: Open Source (Transform, Cube.js, dbt, Metriql)
Build semantic layer with dbt or Cube.js, custom query interface with Claude/GPT API, self-hosted caching (Redis), mix-and-match components.
Frontend: Custom Streamlit app
Semantic layer: dbt metrics
Query generation: Claude API
Caching: Redis
Data warehouse: Existing (Snowflake/BigQuery)
Total cost: $5K-20K/year
Best for: Technical teams, early-stage startups, open-source advocates
Part 4: The Decision Framework
Key Questions to Ask:
1. What's your scale?
- <100 queries/day: Any solution works
- 100-1K/day: Startup or cloud-native
- 1K-10K/day: Enterprise or open source
- 10K+/day: Enterprise only
2. What's your budget?
- <$5K/year: Open source only
- $5K-20K/year: Open source or cloud-native
- $20K-100K/year: Startups or cloud-native
- $100K+/year: Enterprise platforms
3. How complex is your data?
- <20 tables: Any solution
- 20-100 tables: Need vector retrieval
- 100-500 tables: Enterprise or custom
- 500+ tables: Enterprise only
4. Do you have engineering resources?
- No → Enterprise or startup solution
- Yes, 1 engineer → Cloud-native or startup
- Yes, 2+ engineers → Open source feasible
Part 5: What I'd Do Differently (v2 Roadmap)
1. Semantic Layer First
Build business metric definitions before any LLM:
metrics:
mrr:
sql: SUM(monthly_price) WHERE status='active'
description: "Monthly Recurring Revenue"
churn_rate:
sql: (canceled / total) * 100
description: "Customer churn percentage"
2. Vector-Based Schema Retrieval
Embed all table/column descriptions. User question → semantic search → retrieve 3-5 relevant tables only. Pass only relevant schema to LLM.
Impact: Handles 500+ table databases
3. Aggressive Caching
@cache(ttl=3600) # Cache for 1 hour
def execute_query(sql):
return db.query(sql)
Impact: 80-90% cost reduction
4. Tiered Approach
Simple query (aggregation) → Template (0ms, $0)
Medium query → Small model (500ms, $0.001)
Complex query → Claude (3s, $0.02)
Impact: 85% of queries never hit expensive LLM
5. Query Validation
def validate_query(sql):
if "SELECT * FROM" in sql and "LIMIT" not in sql:
return "Error: Must include LIMIT"
if count_joins(sql) > 5:
return "Error: Too many JOINs"
return "Valid"
Impact: Prevent expensive/dangerous queries
6. Fine-Tuning
Collect 500-1000 successful query pairs. Fine-tune Llama 3 or Mistral on company data.
Impact: 90% accuracy, $0.005/query (4x cheaper)
Part 6: Key Takeaways
1. Demos ≠ Production. My demo worked perfectly for 10 queries. Production needs 10 million.
2. Cost Compounds Quickly. $0.02 per query seems cheap until you multiply by 1000 queries/day x 365 days.
3. LLMs Are Part of the Solution, Not the Whole Solution. Production systems minimize LLM usage through caching, templates, and tiered models.
4. Business Logic > Schema. The schema tells you table structure. Business logic tells you what queries actually mean.
5. There's No One-Size-Fits-All. Startups need different solutions than enterprises. Technical teams have different options than non-technical ones.
The Pattern
Every successful production system:
- Encodes business logic once (semantic layer)
- Minimizes LLM calls (caching + templates)
- Uses vector search for scale
- Combines AI with traditional engineering
Version 2.0: The RAG Transformation
After publishing v1.0, I rebuilt the agent with Retrieval-Augmented Generation (RAG) to address the core accuracy issues.
What Changed
Knowledge Base (97 chunks, ~63KB):
metrics.md- 15+ SaaS metrics with formulas (MRR, churn, LTV)schema_docs.md- Complete table documentation & relationshipsbusiness_logic.md- 50+ query rules & best practicesexamples.json- 30 example query pairssqlite_syntax.md- Database-specific syntax rules
RAG Pipeline: ChromaDB for vector storage, Sentence Transformers for embeddings, semantic search retrieves top-k relevant docs, context injected into LLM prompts.
The Impact
Accuracy improvement: 50% → 90%
Before RAG (v1.0):
Query: "What's our MRR growth rate?"
Generated: SELECT SUM(amount) FROM subscriptions ❌
Issue: Doesn't account for billing cycles, churn, or time periods
After RAG (v2.0):
Query: "What's our MRR growth rate?"
Retrieved context: MRR formula, billing cycle handling, growth calculation
Generated: ✅ Correct SQL with:
- Billing cycle normalization (annual/12, quarterly/3)
- Active subscription filtering
- Month-over-month comparison
Key Learnings
- Context is everything - The LLM had the reasoning ability; it just lacked domain knowledge
- Quality > Quantity - 97 well-structured chunks outperformed throwing the entire schema at the model
- Examples matter - The
examples.jsonfile had outsized impact on complex queries - Database-specific syntax - Adding SQLite rules eliminated an entire class of errors
Conclusion
Self-serve analytics with LLMs isn't solved. It's just getting started. The gap between "working demo" and "production system" is enormous. But understanding that gap is the first step to bridging it.