The Reality of Text-to-SQL at Scale

A deep dive into what happens when you try to ship NL-to-SQL in production - the tradeoffs between cost, latency, accuracy, and the architecture decisions that actually matter.

Introduction

"Just ask your database questions in plain English."

This has been the Holy Grail of business intelligence for two decades. Every few years, a new technology promises to finally make it real: natural language processing in the 2000s, voice assistants in the 2010s, and now LLMs.

With GPT-4 and Claude able to write complex code, surely text-to-SQL is finally solved, right?

I spent a weekend building a proof-of-concept to find out. Spoiler: It worked beautifully for exactly 10 queries. Then reality hit.

This is the story of what I built, where it breaks at scale, and what I discovered about how production systems actually solve this. The gap between demo and production taught me more than the build itself.

Part 1: What I Built

The Architecture

I built a text-to-SQL agent with a straightforward architecture:

Frontend: Streamlit chat interface
Orchestration: LangChain managing workflow
Brain: Claude Sonnet 4 for reasoning
Database: SQLite with synthetic SaaS data (1,000 users, 50K+ transactions)

The three-step workflow:

User asks question in natural language
Claude generates SQL query
Execute query and create visualization
Return results with explanation

The User Experience

User: "Show me monthly revenue trends for the last 2 years"

Agent:
- Generates SQL query
- Executes against database
- Creates line chart
- Returns: "Based on the analysis, revenue peaked in December 2024
  at $25,943, with a concerning downward trend starting in 2025..."

The agent also showed the exact SQL query used, providing full transparency.

Initial Results: It Worked!

The system handled queries I threw at it:

"What was total revenue last quarter?"
"Show me top 10 customers by revenue"
"How many active subscriptions do we have?"

It generated accurate SQL, handled JOINs correctly, and created visualizations automatically. Response time was 10-15 seconds per query.

I was thrilled. Then I started thinking about scale.

Part 2: Why This Would Break in Real Production

Problem 1: The Token Limit Wall

What happened: I'm passing the entire database schema with every query. Works fine for 3 tables (~500 tokens). Breaks completely at 100 tables (~50K tokens, exceeds context limit).

The math:

Claude Sonnet 4: 200K token context window
Medium enterprise DB: 200-500 tables
Average table schema: 500-1000 tokens
Result: Can't fit schema in context

Real-world impact:

Demo: 3 tables ✅
Startup with 50 tables: Barely works ⚠️
Enterprise with 300 tables: Completely broken ❌

Problem 2: Cost Spirals Out of Control

Costs per query:

Input: 2K tokens (schema + question) = $0.006
Output: 500 tokens (SQL + explanation) = $0.0075
Total: ~$0.02-0.05 per query

At scale:

100 queries/day: $3-5/day = $90-150/month (manageable)
1,000 queries/day: $600-1,500/month (yikes)
10,000 queries/day: $6K-15K/month (unsustainable)

Compare to traditional SQL where the query cost is essentially $0. The LLM approach is 1000x more expensive.

Problem 3: Latency Kills Adoption

My query timeline:

3 Claude API calls: ~2-3 seconds each = 6-9 seconds
Database query execution: 50-200ms
Chart generation: 500ms
Total: 7-10 seconds per query

User expectations:

Google search: <1 second
SQL query in BI tool: 1-3 seconds
RAG agent: 10-15 seconds ❌

Problem 4: The "Last Mile" Problem

Some additional gaps in the real world:

Business logic: "Revenue" = completed transactions only, exclude refunds, internal orders
Data quality rules: Filter out test users with email ending in @test.com
Temporal context: "Last quarter" depends on fiscal calendar, not calendar year
Relationships: Some JOINs are valid but nonsensical in business context

Example failure:

User: "Show me revenue last quarter"
My agent: SELECT SUM(amount) FROM transactions WHERE date > '2024-10-01'

Problem: Includes refunds, test transactions, wrong quarter definition
Correct query: Much more complex with 5+ WHERE clauses

The LLM can't infer this from schema alone.

Part 3: What Production Systems Actually Do

Then I researched how companies solve this at scale. Turns out the answer isn't "throw more LLM at it."

Solution 1: Enterprise Platforms (Snowflake Cortex, Databricks Genie)

Semantic layer: Pre-define metrics once

revenue:
  SUM(amount) WHERE status='completed' AND type!='refund'
  description: "Total revenue from completed sales"

Vector-based retrieval: Embed table descriptions, retrieve only relevant 3-5 tables. Query optimization: Rewrite LLM SQL for performance. Aggressive caching: 90%+ cache hit rate.

The results:

Cost: $0.001-0.01 per query (100x cheaper)
Latency: <2 seconds
Accuracy: 90%+
Scale: Handles 1000+ table databases

Pros: Enterprise security/compliance, massive scale, vendor support, deep platform integration.

Cons: Expensive ($50K-500K+/year), vendor lock-in, only works with their platform.

Best for: Fortune 500s, large enterprises already on Snowflake/Databricks

Solution 2: Specialized Startups (Seek AI, Patterns, Outerbase)

Fine-tuning: Train model on your specific schemas and past queries. Intent classification: Cheap model ($0.0001) routes 80% to templates, 20% to LLM. Query validation: Check for cartesian joins, missing WHERE clauses. Human-in-loop: Uncertain queries go to Slack for approval.

User: "Show me revenue"
↓
Intent classifier: "Simple aggregation" (no LLM needed)
↓
Template: SELECT SUM(amount) FROM transactions WHERE...
↓
Result in 500ms

The results: 90%+ accuracy (vs 60% with base models), $0.005-0.02 per query, 1-3 seconds latency, learns from corrections.

Best for: Series A-C startups, mid-size companies (100-1K employees)

Solution 3: Cloud-Native (BigQuery + Gemini, Athena + Bedrock)

Uses existing cloud provider's LLM, schema embeddings stored in vector DB, query logs as training data, pay-as-you-go pricing.

The tradeoff: Lower accuracy (60-70%) but "good enough". Less features but simpler. Cheaper but more DIY.

Best for: AWS/GCP-native companies, cost-conscious teams

Solution 4: Open Source (Transform, Cube.js, dbt, Metriql)

Build semantic layer with dbt or Cube.js, custom query interface with Claude/GPT API, self-hosted caching (Redis), mix-and-match components.

Frontend: Custom Streamlit app
Semantic layer: dbt metrics
Query generation: Claude API
Caching: Redis
Data warehouse: Existing (Snowflake/BigQuery)

Total cost: $5K-20K/year

Best for: Technical teams, early-stage startups, open-source advocates

Part 4: The Decision Framework

Key Questions to Ask:

1. What's your scale?

<100 queries/day: Any solution works
100-1K/day: Startup or cloud-native
1K-10K/day: Enterprise or open source
10K+/day: Enterprise only

2. What's your budget?

<$5K/year: Open source only
$5K-20K/year: Open source or cloud-native
$20K-100K/year: Startups or cloud-native
$100K+/year: Enterprise platforms

3. How complex is your data?

<20 tables: Any solution
20-100 tables: Need vector retrieval
100-500 tables: Enterprise or custom
500+ tables: Enterprise only

4. Do you have engineering resources?

No → Enterprise or startup solution
Yes, 1 engineer → Cloud-native or startup
Yes, 2+ engineers → Open source feasible

Part 5: What I'd Do Differently (v2 Roadmap)

1. Semantic Layer First

Build business metric definitions before any LLM:

metrics:
  mrr:
    sql: SUM(monthly_price) WHERE status='active'
    description: "Monthly Recurring Revenue"

  churn_rate:
    sql: (canceled / total) * 100
    description: "Customer churn percentage"

2. Vector-Based Schema Retrieval

Embed all table/column descriptions. User question → semantic search → retrieve 3-5 relevant tables only. Pass only relevant schema to LLM.

Impact: Handles 500+ table databases

3. Aggressive Caching

@cache(ttl=3600)  # Cache for 1 hour
def execute_query(sql):
    return db.query(sql)

Impact: 80-90% cost reduction

4. Tiered Approach

Simple query (aggregation) → Template (0ms, $0)
Medium query → Small model (500ms, $0.001)
Complex query → Claude (3s, $0.02)

Impact: 85% of queries never hit expensive LLM

5. Query Validation

def validate_query(sql):
    if "SELECT * FROM" in sql and "LIMIT" not in sql:
        return "Error: Must include LIMIT"
    if count_joins(sql) > 5:
        return "Error: Too many JOINs"
    return "Valid"

Impact: Prevent expensive/dangerous queries

6. Fine-Tuning

Collect 500-1000 successful query pairs. Fine-tune Llama 3 or Mistral on company data.

Impact: 90% accuracy, $0.005/query (4x cheaper)

Part 6: Key Takeaways

1. Demos ≠ Production. My demo worked perfectly for 10 queries. Production needs 10 million.

2. Cost Compounds Quickly. $0.02 per query seems cheap until you multiply by 1000 queries/day x 365 days.

3. LLMs Are Part of the Solution, Not the Whole Solution. Production systems minimize LLM usage through caching, templates, and tiered models.

4. Business Logic > Schema. The schema tells you table structure. Business logic tells you what queries actually mean.

5. There's No One-Size-Fits-All. Startups need different solutions than enterprises. Technical teams have different options than non-technical ones.

The Pattern

Every successful production system:

Encodes business logic once (semantic layer)
Minimizes LLM calls (caching + templates)
Uses vector search for scale
Combines AI with traditional engineering

Version 2.0: The RAG Transformation

After publishing v1.0, I rebuilt the agent with Retrieval-Augmented Generation (RAG) to address the core accuracy issues.

What Changed

Knowledge Base (97 chunks, ~63KB):

metrics.md - 15+ SaaS metrics with formulas (MRR, churn, LTV)
schema_docs.md - Complete table documentation & relationships
business_logic.md - 50+ query rules & best practices
examples.json - 30 example query pairs
sqlite_syntax.md - Database-specific syntax rules

RAG Pipeline: ChromaDB for vector storage, Sentence Transformers for embeddings, semantic search retrieves top-k relevant docs, context injected into LLM prompts.

The Impact

Accuracy improvement: 50% → 90%

Before RAG (v1.0):

Query: "What's our MRR growth rate?"
Generated: SELECT SUM(amount) FROM subscriptions  ❌
Issue: Doesn't account for billing cycles, churn, or time periods

After RAG (v2.0):

Query: "What's our MRR growth rate?"
Retrieved context: MRR formula, billing cycle handling, growth calculation
Generated: ✅ Correct SQL with:
- Billing cycle normalization (annual/12, quarterly/3)
- Active subscription filtering
- Month-over-month comparison

Key Learnings

Context is everything - The LLM had the reasoning ability; it just lacked domain knowledge
Quality > Quantity - 97 well-structured chunks outperformed throwing the entire schema at the model
Examples matter - The examples.json file had outsized impact on complex queries
Database-specific syntax - Adding SQLite rules eliminated an entire class of errors

Conclusion

Self-serve analytics with LLMs isn't solved. It's just getting started. The gap between "working demo" and "production system" is enormous. But understanding that gap is the first step to bridging it.

LinkedIn GitHub Repo Read on Medium

The Reality of Text-to-SQL at Scale: Cost, Latency, and What Actually Works