Building an Open Data Infrastructure for the Agent Era: A Practical Guide

Overview

AI agents are transforming how organizations interact with data. Unlike humans, agents can run tens or hundreds of times more queries—and they don't need answers in milliseconds. This shift exposes a critical flaw in closed data ecosystems: every query travels through the same expensive compute path, regardless of its cost. As Anjan Kundavaram, Chief Product Officer at Fivetran, puts it, "It's kind of like using a Lamborghini to mow the lawn all the time."

Building an Open Data Infrastructure for the Agent Era: A Practical Guide — Source: thenewstack.io

In a recent podcast, Kundavaram argued that the only way to survive the agent era is to embrace open data infrastructure—a stack that consolidates data across sources, provides multiple compute engines, and enforces a disciplined semantic layer. Without this, organizations face a "triple whammy": poor AI answers, skyrocketing costs, and wasted context. This tutorial walks you through the steps to build such an infrastructure, drawing on principles from Fivetran’s Open Data Initiative and real-world best practices.

Prerequisites

Before you begin, ensure you have:

Basic knowledge of data warehousing concepts (e.g., ETL, compute engines).
Familiarity with AI agent workflows or chatbot integrations.
Access to at least one cloud data platform (e.g., Google BigQuery, Snowflake, Databricks).
Permissions to create and modify data pipelines.

Step-by-Step Instructions

Step 1: Audit Your Current Data Ecosystem

Map all data sources, compute engines, and access patterns. Note which queries are analytical (expensive) and which are operational (cheap). Identify silos—for example, customer data in Salesforce, product data in a data lake, and log data in a separate warehouse. Closed stacks often force all queries through one engine like a traditional data warehouse.

Example audit output:

Source: Salesforce (via Fivetran connector)
Target: Snowflake (single compute cluster)
Query types: 70% analytical, 30% lightweight
Issue: All queries pay for the same cluster

Step 2: Consolidate Data into an Open Lakehouse

Move data into an open format (e.g., Apache Parquet) on object storage (e.g., S3, GCS). Use frameworks like Delta Lake or Apache Iceberg for ACID transactions. This decouples storage from compute, allowing multiple engines to query the same data.

Code snippet (using Fivetran to sync into Iceberg):

-- Configure Fivetran connector to write to Iceberg
{
  "connector": "salesforce",
  "destination": "gcs://my-bucket/datalake/",
  "format": "iceberg",
  "table_config": {
    "partition_by": ["date"]
  }
}

Step 3: Implement a Semantic Layer

A semantic layer defines business meaning (e.g., "revenue = sum(sales.amount)") and governs access. Tools like DBT, LookML, or AtScale let you define metrics once and reuse them across engines. This ensures AI agents get consistent, high-quality context.

Example DBT model:

-- models/customer_metrics.sql
WITH orders AS (
  SELECT customer_id, amount FROM {{ ref('orders_raw') }}
)
SELECT
  customer_id,
  COUNT(*) AS total_orders,
  SUM(amount) AS revenue
FROM orders
GROUP BY customer_id

Step 4: Deploy Multiple Compute Engines

Use a mix of engines: a GPU-accelerated engine for heavy analytical queries (e.g., Google BigQuery, Snowflake with scaling), and a lighter engine for simple lookups (e.g., DuckDB, Apache Druid). Route queries intelligently. For example, an agent asking "What was revenue yesterday?" goes to DuckDB; "Generate a regression model on 5 years of sales" goes to BigQuery.

Routing logic pseudocode:

def route_query(query):
    if "model" in query or "regression" in query:
        return bigquery.execute(query)
    else:
        return duckdb.execute(query)

Step 5: Set Up Cost Monitoring and Quotas

Don't blindly lock down budgets—instead, monitor and educate. Set per-agent or per-query cost thresholds. Use open benchmarks like Fivetran's Open Data Infrastructure Data Access Benchmark to compare costs across engines.

Monitoring dashboard query (SQL):

SELECT
  agent_id,
  engine,
  query_count,
  total_cost
FROM cost_metrics
WHERE date = CURRENT_DATE
ORDER BY total_cost DESC

Step 6: Train Agents to Be Cost-Aware

Agents can learn to prefer cheaper engines. For each request, the agent evaluates cost vs. latency trade-offs. For example, an agent looking for a list of top customers can use a cached view (cheap) instead of a full scan (expensive).

Agent prompt example:

"User query: Show top 10 customers by revenue.
Routing options:
- Engine A (BigQuery): $0.10 per query, 2 sec latency
- Engine B (DuckDB): $0.001 per query, 0.5 sec latency
Decision: Use Engine B (likely data in cache)."

Common Mistakes

Panic budget cuts: Reducing budgets without addressing architecture leads to frustrated users and abandoned AI initiatives. Instead, invest in open infrastructure that lowers per-query cost.
Ignoring semantic discipline: Without a semantic layer, agents produce inconsistent answers—e.g., different definitions of "churn." This wastes compute and trust.
Single-engine lock-in: Using one compute engine for all queries is the Lamborghini-mowing-the-lawn problem. It inflates costs and limits scalability.
Skipping cost monitoring: You can't optimize what you don't measure. Deploy dashboards early to track agent query costs across engines.

Summary

Closed data stacks are ill-equipped for the agent era. By consolidating data into an open lakehouse, implementing a semantic layer, deploying multiple compute engines, and monitoring costs, you can unlock agent productivity without runaway expenses. The key is to innovate rather than lock down—embrace open infrastructure and budget-aware routing. As Kundavaram notes, the productivity unlock only materializes when you refuse the lockdown instinct.