Data Science

Empowering Analysts: Building Data Pipelines with YAML, dlt, dbt, and Trino – A Step-by-Step Guide

2026-05-04 02:55:31

Overview

Data pipelines have traditionally been the domain of software engineers wielding PySpark or Python scripts. However, a new stack — dlt (data load tool), dbt (data build tool), and Trino — allows analysts to build and maintain pipelines using nothing more than YAML configuration files. This guide walks you through replacing complex PySpark pipelines with four YAML files, cutting delivery time from weeks to a single day. By the end, you’ll understand how to set up a pipeline that extracts, loads, transforms, and queries data without writing a single line of Python or Spark code.

Empowering Analysts: Building Data Pipelines with YAML, dlt, dbt, and Trino – A Step-by-Step Guide
Source: towardsdatascience.com

Prerequisites

Before diving in, ensure you have:

This guide assumes you are comfortable running terminal commands and editing configuration files.

Step-by-Step Instructions

1. Setting Up the Tools

Install dlt and dbt using pip (or conda):

pip install dlt dbt-core trino

Verify installations:

dlt --version
dbt --version
trino --version

Create a project directory:

mkdir my_pipeline
cd my_pipeline

2. Configuring the Source – dlt YAML

dlt extracts data from sources and loads it into a destination. Create a file sources.yml:

# sources.yml
sources:
  my_api:
    type: rest_api
    config:
      base_url: "https://api.example.com/v1"
      endpoint: /data
      pagination: true
    # Add authentication if needed
    auth:
      api_key: "${API_KEY}"

This YAML tells dlt to fetch data from an API endpoint with pagination. Replace the URL and API key with your own. dlt supports many source types (databases, cloud storage, etc.).

3. Loading Data – dlt Destination YAML

Create destinations.yml to specify where data goes:

# destinations.yml
destinations:
  my_trino:
    type: trino
    config:
      host: localhost
      port: 8080
      database: my_db
      user: analyst
      password: "${TRINO_PASSWORD}"

Now define a pipeline in pipeline.yml that links the source and destination:

# pipeline.yml
pipeline:
  name: my_first_pipeline
  source: my_api
  destination: my_trino
  tables:
    - name: raw_data
      primary_key: id
      incremental: true

Run the pipeline with a single command:

dlt pipeline run pipeline.yml

Data is now loaded into Trino under the raw_data table.

4. Transforming with dbt

dbt allows analysts to write SQL models. Initialize a dbt project inside your directory:

dbt init my_dbt_project

Edit profiles.yml to point to your Trino instance:

# profiles.yml
my_dbt_project:
  outputs:
    dev:
      type: trino
      method: none
      server: localhost:8080
      database: my_db
      schema: analytics
      user: analyst
      password: "${TRINO_PASSWORD}"
  target: dev

Create a transformation model in models/ – for example, aggregated_data.sql:

Empowering Analysts: Building Data Pipelines with YAML, dlt, dbt, and Trino – A Step-by-Step Guide
Source: towardsdatascience.com
-- models/aggregated_data.sql
SELECT
    EXTRACT(YEAR FROM event_date) AS year,
    EXTRACT(MONTH FROM event_date) AS month,
    category,
    SUM(revenue) AS total_revenue
FROM {{ source('raw_data', 'raw_data') }}
GROUP BY 1,2,3

Run dbt to apply transformations:

dbt run

This creates a table or view in Trino’s analytics schema.

5. Querying with Trino

Now you can query the transformed data using any SQL client connected to Trino. For example:

-- Query from Trino CLI or your BI tool
SELECT * FROM my_db.analytics.aggregated_data
WHERE total_revenue > 100000
ORDER BY year, month;

That’s it – a complete pipeline defined in just four YAML files (sources.yml, destinations.yml, pipeline.yml, and dbt’s profiles.yml plus one SQL model).

Common Mistakes

Summary

By replacing PySpark with a stack of dlt, dbt, and Trino, organizations empower analysts to build and maintain data pipelines using YAML and SQL alone. The process reduces delivery time from weeks to one day, eliminates the need for dedicated engineering support, and keeps pipelines version-controlled and auditable. This guide demonstrated a complete end-to-end pipeline with four configuration files, covering extraction, loading, transformation, and querying. Start with a single use case, and scale from there.

Explore

How to Leverage Open-Source Hardware Security Modules for Cloud Trust: The Azure Integrated HSM Approach Navigating ASML's Lithography Roadmap: From DUV to Hyper-NA and Beyond — A Comprehensive Guide Windows 11 Run Menu Gets a Modern Makeover: Dark Mode, New Commands, and More The Ucayali River: A Serpentine Wonder from the Amazon Seen from Space Mastering Transparency in Agentic AI: A Practical Guide to the Decision Node Audit