Digital Marketing

Building an Open Data Infrastructure for the Agent Era: A Practical Guide

2026-05-14 09:52:10

Overview

AI agents are transforming how organizations interact with data. Unlike humans, agents can run tens or hundreds of times more queries—and they don't need answers in milliseconds. This shift exposes a critical flaw in closed data ecosystems: every query travels through the same expensive compute path, regardless of its cost. As Anjan Kundavaram, Chief Product Officer at Fivetran, puts it, "It's kind of like using a Lamborghini to mow the lawn all the time."

Building an Open Data Infrastructure for the Agent Era: A Practical Guide
Source: thenewstack.io

In a recent podcast, Kundavaram argued that the only way to survive the agent era is to embrace open data infrastructure—a stack that consolidates data across sources, provides multiple compute engines, and enforces a disciplined semantic layer. Without this, organizations face a "triple whammy": poor AI answers, skyrocketing costs, and wasted context. This tutorial walks you through the steps to build such an infrastructure, drawing on principles from Fivetran’s Open Data Initiative and real-world best practices.

Prerequisites

Before you begin, ensure you have:

Step-by-Step Instructions

Step 1: Audit Your Current Data Ecosystem

Map all data sources, compute engines, and access patterns. Note which queries are analytical (expensive) and which are operational (cheap). Identify silos—for example, customer data in Salesforce, product data in a data lake, and log data in a separate warehouse. Closed stacks often force all queries through one engine like a traditional data warehouse.

Example audit output:

Source: Salesforce (via Fivetran connector)
Target: Snowflake (single compute cluster)
Query types: 70% analytical, 30% lightweight
Issue: All queries pay for the same cluster

Step 2: Consolidate Data into an Open Lakehouse

Move data into an open format (e.g., Apache Parquet) on object storage (e.g., S3, GCS). Use frameworks like Delta Lake or Apache Iceberg for ACID transactions. This decouples storage from compute, allowing multiple engines to query the same data.

Code snippet (using Fivetran to sync into Iceberg):

-- Configure Fivetran connector to write to Iceberg
{
  "connector": "salesforce",
  "destination": "gcs://my-bucket/datalake/",
  "format": "iceberg",
  "table_config": {
    "partition_by": ["date"]
  }
}

Step 3: Implement a Semantic Layer

A semantic layer defines business meaning (e.g., "revenue = sum(sales.amount)") and governs access. Tools like DBT, LookML, or AtScale let you define metrics once and reuse them across engines. This ensures AI agents get consistent, high-quality context.

Example DBT model:

-- models/customer_metrics.sql
WITH orders AS (
  SELECT customer_id, amount FROM {{ ref('orders_raw') }}
)
SELECT
  customer_id,
  COUNT(*) AS total_orders,
  SUM(amount) AS revenue
FROM orders
GROUP BY customer_id

Step 4: Deploy Multiple Compute Engines

Use a mix of engines: a GPU-accelerated engine for heavy analytical queries (e.g., Google BigQuery, Snowflake with scaling), and a lighter engine for simple lookups (e.g., DuckDB, Apache Druid). Route queries intelligently. For example, an agent asking "What was revenue yesterday?" goes to DuckDB; "Generate a regression model on 5 years of sales" goes to BigQuery.

Building an Open Data Infrastructure for the Agent Era: A Practical Guide
Source: thenewstack.io

Routing logic pseudocode:

def route_query(query):
    if "model" in query or "regression" in query:
        return bigquery.execute(query)
    else:
        return duckdb.execute(query)

Step 5: Set Up Cost Monitoring and Quotas

Don't blindly lock down budgets—instead, monitor and educate. Set per-agent or per-query cost thresholds. Use open benchmarks like Fivetran's Open Data Infrastructure Data Access Benchmark to compare costs across engines.

Monitoring dashboard query (SQL):

SELECT
  agent_id,
  engine,
  query_count,
  total_cost
FROM cost_metrics
WHERE date = CURRENT_DATE
ORDER BY total_cost DESC

Step 6: Train Agents to Be Cost-Aware

Agents can learn to prefer cheaper engines. For each request, the agent evaluates cost vs. latency trade-offs. For example, an agent looking for a list of top customers can use a cached view (cheap) instead of a full scan (expensive).

Agent prompt example:

"User query: Show top 10 customers by revenue.
Routing options:
- Engine A (BigQuery): $0.10 per query, 2 sec latency
- Engine B (DuckDB): $0.001 per query, 0.5 sec latency
Decision: Use Engine B (likely data in cache)."

Common Mistakes

Summary

Closed data stacks are ill-equipped for the agent era. By consolidating data into an open lakehouse, implementing a semantic layer, deploying multiple compute engines, and monitoring costs, you can unlock agent productivity without runaway expenses. The key is to innovate rather than lock down—embrace open infrastructure and budget-aware routing. As Kundavaram notes, the productivity unlock only materializes when you refuse the lockdown instinct.

Explore

How to Upgrade to React Native 0.83 and Leverage Its New Capabilities 7 Critical Facts About Tennessee's New Crypto ATM Ban and What It Means for Consumers docs.rs Streamlines Documentation Builds: Default Targets Reduced to One How to Use Linux Mint's HWE ISOs for Enhanced Hardware Support How to Upgrade Your Blazor WebAssembly App to .NET 10 for Enhanced Performance