Introduction

"Just ask your database questions in plain English."

This has been the Holy Grail of business intelligence for two decades. Every few years, a new technology promises to finally make it real: natural language processing in the 2000s, voice assistants in the 2010s, and now LLMs.

With GPT-4 and Claude able to write complex code, surely text-to-SQL is finally solved, right?

I spent a weekend building a proof-of-concept to find out. Spoiler: It worked beautifully for exactly 10 queries. Then reality hit.

This is the story of what I built, where it breaks at scale, and what I discovered about how production systems actually solve this. The gap between demo and production taught me more than the build itself.

Part 1: What I Built

The Architecture

I built a text-to-SQL agent with a straightforward architecture:

Frontend: Streamlit chat interface
Orchestration: LangChain managing workflow
Brain: Claude Sonnet 4 for reasoning
Database: SQLite with synthetic SaaS data (1,000 users, 50K+ transactions)

The three-step workflow:

  1. User asks question in natural language
  2. Claude generates SQL query
  3. Execute query and create visualization
  4. Return results with explanation

The User Experience

User: "Show me monthly revenue trends for the last 2 years"

Agent:
- Generates SQL query
- Executes against database
- Creates line chart
- Returns: "Based on the analysis, revenue peaked in December 2024
  at $25,943, with a concerning downward trend starting in 2025..."

The agent also showed the exact SQL query used, providing full transparency.

Initial Results: It Worked!

The system handled queries I threw at it:

It generated accurate SQL, handled JOINs correctly, and created visualizations automatically. Response time was 10-15 seconds per query.

I was thrilled. Then I started thinking about scale.

Part 2: Why This Would Break in Real Production

Problem 1: The Token Limit Wall

What happened: I'm passing the entire database schema with every query. Works fine for 3 tables (~500 tokens). Breaks completely at 100 tables (~50K tokens, exceeds context limit).

The math:

Real-world impact:

Problem 2: Cost Spirals Out of Control

Costs per query:

At scale:

Compare to traditional SQL where the query cost is essentially $0. The LLM approach is 1000x more expensive.

Problem 3: Latency Kills Adoption

My query timeline:

User expectations:

Problem 4: The "Last Mile" Problem

Some additional gaps in the real world:

Example failure:

User: "Show me revenue last quarter"
My agent: SELECT SUM(amount) FROM transactions WHERE date > '2024-10-01'

Problem: Includes refunds, test transactions, wrong quarter definition
Correct query: Much more complex with 5+ WHERE clauses

The LLM can't infer this from schema alone.

Part 3: What Production Systems Actually Do

Then I researched how companies solve this at scale. Turns out the answer isn't "throw more LLM at it."

Solution 1: Enterprise Platforms (Snowflake Cortex, Databricks Genie)

Semantic layer: Pre-define metrics once

revenue:
  SUM(amount) WHERE status='completed' AND type!='refund'
  description: "Total revenue from completed sales"

Vector-based retrieval: Embed table descriptions, retrieve only relevant 3-5 tables. Query optimization: Rewrite LLM SQL for performance. Aggressive caching: 90%+ cache hit rate.

The results:

Pros: Enterprise security/compliance, massive scale, vendor support, deep platform integration.

Cons: Expensive ($50K-500K+/year), vendor lock-in, only works with their platform.

Best for: Fortune 500s, large enterprises already on Snowflake/Databricks

Solution 2: Specialized Startups (Seek AI, Patterns, Outerbase)

Fine-tuning: Train model on your specific schemas and past queries. Intent classification: Cheap model ($0.0001) routes 80% to templates, 20% to LLM. Query validation: Check for cartesian joins, missing WHERE clauses. Human-in-loop: Uncertain queries go to Slack for approval.

User: "Show me revenue"
↓
Intent classifier: "Simple aggregation" (no LLM needed)
↓
Template: SELECT SUM(amount) FROM transactions WHERE...
↓
Result in 500ms

The results: 90%+ accuracy (vs 60% with base models), $0.005-0.02 per query, 1-3 seconds latency, learns from corrections.

Best for: Series A-C startups, mid-size companies (100-1K employees)

Solution 3: Cloud-Native (BigQuery + Gemini, Athena + Bedrock)

Uses existing cloud provider's LLM, schema embeddings stored in vector DB, query logs as training data, pay-as-you-go pricing.

The tradeoff: Lower accuracy (60-70%) but "good enough". Less features but simpler. Cheaper but more DIY.

Best for: AWS/GCP-native companies, cost-conscious teams

Solution 4: Open Source (Transform, Cube.js, dbt, Metriql)

Build semantic layer with dbt or Cube.js, custom query interface with Claude/GPT API, self-hosted caching (Redis), mix-and-match components.

Frontend: Custom Streamlit app
Semantic layer: dbt metrics
Query generation: Claude API
Caching: Redis
Data warehouse: Existing (Snowflake/BigQuery)

Total cost: $5K-20K/year

Best for: Technical teams, early-stage startups, open-source advocates

Part 4: The Decision Framework

Key Questions to Ask:

1. What's your scale?

2. What's your budget?

3. How complex is your data?

4. Do you have engineering resources?

Part 5: What I'd Do Differently (v2 Roadmap)

1. Semantic Layer First

Build business metric definitions before any LLM:

metrics:
  mrr:
    sql: SUM(monthly_price) WHERE status='active'
    description: "Monthly Recurring Revenue"

  churn_rate:
    sql: (canceled / total) * 100
    description: "Customer churn percentage"

2. Vector-Based Schema Retrieval

Embed all table/column descriptions. User question → semantic search → retrieve 3-5 relevant tables only. Pass only relevant schema to LLM.

Impact: Handles 500+ table databases

3. Aggressive Caching

@cache(ttl=3600)  # Cache for 1 hour
def execute_query(sql):
    return db.query(sql)

Impact: 80-90% cost reduction

4. Tiered Approach

Simple query (aggregation) → Template (0ms, $0)
Medium query → Small model (500ms, $0.001)
Complex query → Claude (3s, $0.02)

Impact: 85% of queries never hit expensive LLM

5. Query Validation

def validate_query(sql):
    if "SELECT * FROM" in sql and "LIMIT" not in sql:
        return "Error: Must include LIMIT"
    if count_joins(sql) > 5:
        return "Error: Too many JOINs"
    return "Valid"

Impact: Prevent expensive/dangerous queries

6. Fine-Tuning

Collect 500-1000 successful query pairs. Fine-tune Llama 3 or Mistral on company data.

Impact: 90% accuracy, $0.005/query (4x cheaper)

Part 6: Key Takeaways

1. Demos ≠ Production. My demo worked perfectly for 10 queries. Production needs 10 million.

2. Cost Compounds Quickly. $0.02 per query seems cheap until you multiply by 1000 queries/day x 365 days.

3. LLMs Are Part of the Solution, Not the Whole Solution. Production systems minimize LLM usage through caching, templates, and tiered models.

4. Business Logic > Schema. The schema tells you table structure. Business logic tells you what queries actually mean.

5. There's No One-Size-Fits-All. Startups need different solutions than enterprises. Technical teams have different options than non-technical ones.

The Pattern

Every successful production system:

Version 2.0: The RAG Transformation

After publishing v1.0, I rebuilt the agent with Retrieval-Augmented Generation (RAG) to address the core accuracy issues.

What Changed

Knowledge Base (97 chunks, ~63KB):

RAG Pipeline: ChromaDB for vector storage, Sentence Transformers for embeddings, semantic search retrieves top-k relevant docs, context injected into LLM prompts.

The Impact

Accuracy improvement: 50% → 90%

Before RAG (v1.0):

Query: "What's our MRR growth rate?"
Generated: SELECT SUM(amount) FROM subscriptions  ❌
Issue: Doesn't account for billing cycles, churn, or time periods

After RAG (v2.0):

Query: "What's our MRR growth rate?"
Retrieved context: MRR formula, billing cycle handling, growth calculation
Generated: ✅ Correct SQL with:
- Billing cycle normalization (annual/12, quarterly/3)
- Active subscription filtering
- Month-over-month comparison

Key Learnings

  1. Context is everything - The LLM had the reasoning ability; it just lacked domain knowledge
  2. Quality > Quantity - 97 well-structured chunks outperformed throwing the entire schema at the model
  3. Examples matter - The examples.json file had outsized impact on complex queries
  4. Database-specific syntax - Adding SQLite rules eliminated an entire class of errors

Conclusion

Self-serve analytics with LLMs isn't solved. It's just getting started. The gap between "working demo" and "production system" is enormous. But understanding that gap is the first step to bridging it.

LLM SQL Data Analysis AI Analytics Python RAG