Waiting for engine...
Skip to main content

Data Flow governance for Data Integration: automated audit trails with CI/CD

· 15 min read
Guy Raz
Guy Raz
Product Manager, Data Integration @Boomi
note
  • This blog is specifically about Data Integration — not Boomi Integration (iPaaS). The assets, concepts, and APIs here apply only to Data Integration Data Flows.
  • The Boomi Data Integration API uses river in its field names and endpoints. For example, river_cross_id, and river_status. Throughout this blog, Data Flow is the product term. API field names are shown exactly as they appear in the API request and responses.

Building a Data Flow in Boomi Data Integration is easy: drag, drop, connect, done. But here's the question nobody thinks about until it's urgent — who changed that PostgreSQL-to-Snowflake schedule at 2 AM, and why? Or when did a source mapping shift?

When your governance framework or compliance audit asks, "What changed and when?" you need a real answer. Without a change history, you're stuck piecing things together from UI session logs or asking around until someone remembers.

That's exactly what this version control system solves. It treats your Data Flow configurations as version-controlled artifacts: a CI/CD pipeline automatically snapshots your Data Flows, detects changes, and opens a pull request for every Data Flow that drifts. Your team gets a living audit trail with zero manual effort.

Typical audit lifecycle

Here is how a change-detection-to-review cycle looks end-to-end:

  1. A team member updates a Data Flow in the Boomi Data Integration UI.

  2. The bdi-audit pipeline runs on its next scheduled trigger — for example, every hour.

  3. Get all rivers.py fetches the current Data Flow list and writes new_changes.json with any Data Flows modified in the last 24 hours.

  4. BDI audit.py fetches the full configuration for each changed Data Flow, compares it to the stored snapshot, and writes the summary files.

  5. Pipeline creates a branch named bdi-audit/{river_cross_id}-{timestamp}, commits the updated snapshot, and opens a pull request.

  6. Pipeline attempts to auto-merge using the squash strategy:

    • If the merge succeeds, the snapshot is automatically updated, and the branch is closed.

    • If the merge is blocked — for example, by a branch protection rule requiring approval — the PR stays open for a reviewer to inspect.

  7. The next audit run starts fresh: new_changes.json contains only Data Flows modified in the 24 hours before that run.

Data Flow Version Control Lifecycle

How it works

Under the hood, the version control system runs two CI/CD pipelines against your Boomi Data Integration environment — one you run once, and one that runs on repeat.

bdi-baseline is a one-time setup pipeline. It fetches all Data Flows in your environment, snapshots each one as a JSON file, and commits everything to the repository. Think of it as Day 0: you're establishing the known-good state.

bdi-audit is a scheduled pipeline. It runs on a schedule (for example, every hour by default), re-fetches all your Data Flow configurations, and diffs them against the stored or committed snapshots. For any Data Flow built with Interface = New (is_api_v2: true) that changed in the last 24 hours, the pipeline opens one pull request per changed Data Flow — complete with a full diff. If nothing changed, the pipeline exits cleanly. If auto-merge succeeds (squash strategy), snapshot updates are automatically applied. If auto-merge is blocked by a branch protection rule, the PR stays open for your team to review and approve manually.

The result is a pull request-driven review model for your Data Flows: one PR per Data Flow, so reviews stay focused and easy to approve or reject independently.

note

bdi-audit checks for Data Flows with is_api_v2: true and a last_updated_at within the last 24 hours. Any Data Flow that doesn't meet both conditions is skipped.

Prerequisites

Before you set up the version control system, make sure you have the following:

  • A Boomi Data Integration account with at least one active environment.

  • A private Git repository on your preferred hosting platform.

  • Admin-level access to the repository to configure CI/CD pipelines and secret variables.

Boomi Data Integration credentials

You need three values from your Boomi Data Integration environment. Store these as secret variables in your CI/CD settings. The exact location varies by platform. For example, in Bitbucket, that's Repository Settings > Repository variables; in GitHub, it's Settings > Secrets and variables > Actions.

VariableWhere to find it
DATA_INTEGRATION_ACCOUNT_IDLog in to the Data Integration console, open any Data Flow, and read the account ID from the browser URL: https://console.boomi.com/accounts/{account_id}/environments/...
DATA_INTEGRATION_ENVIRONMENT_IDSame URL as above. The environment ID appears between /environments/ and /rivers/
DATA_INTEGRATION_API_TOKENGo to Data Integration console > Settings > API Tokens > Add Token, enter a name, and click Create. Copy the token immediately — it's only shown once.
warning

API tokens are environment-specific. If you have Dev, Staging, and Prod environments, create a separate DATA_INTEGRATION_API_TOKEN for each one.

Supported Data Flow types

Not all Data Flow types are audited equally. The pipeline only audits Data Flows built with Interface = New, which is the new source-to-target (s2t) experience in Boomi Data Integration. Internally, these Data Flows have is_api_v2: true in the API response. If a Data Flow was built with the older interface (is_api_v2: false), it's skipped automatically.

Interface typeWhat you see in the UISupport
Interface = NewNew source-to-target experience✅ Supported
Interface = OldLegacy interface❌ Not supported

Within supported Data Flows, the following step types apply:

Step typeSupport
Logic Flow (logic rivers)✅ Supported
SQL steps✅ Supported
Python steps❌ Not supported

Repository layout

Here's what your repository looks like after the initial push. We've named it bdi-audit in the reference implementation, but feel free to rename it to match your team's naming conventions:

bdi-audit/
├── bitbucket-pipelines.yml # Pipeline definitions (bdi-baseline and bdi-audit)
├── bdi-audit/
│ ├── Get all rivers.py # Fetches all Data Flows; writes rivers/*.json files
│ ├── Get river details.py # Fetches per-Data-Flow detail for baseline runs
│ ├── BDI audit.py # Change detection script for audit runs
│ └── workflows/
│ └── BDI audit.yml # Equivalent GitHub Actions workflow (alternative)
├── rivers/
│ ├── all-rivers.json # Full list of all Data Flows in the environment
│ ├── api_v2_rivers.json # Filtered list: Interface = New Data Flows only (is_api_v2 = true)
│ └── new_changes.json # Delta: Interface = New Data Flows modified in last 24 h
└── river-details/
└── <river_cross_id> <river_name>.json # One file per Data Flow

A couple of things worth noting: The rivers/ and river-details/ directories are fully pipeline-generated — don't create them manually. Once bdi-baseline runs for the first time, you'll have one snapshot file in river-details/ for every Interface = New Data Flow in your environment.

Setting up the CI/CD pipeline

With your repository in place, you're just a few Git commands away from a running pipeline.

Copy the repository files to your local machine, then initialize and push:

cd /path/to/your-repo-files
git init
git checkout -b main
git add .
git commit -m "Initial commit: version control for Data Integration"
git remote add origin <your-repository-clone-url>
git push -u origin main

Replace <your-repository-clone-url> with the HTTPS clone URL of your repository. You'll find it on your repository's main page. Once the push lands, you're ready to run the pipelines.

Running the pipelines

Run the baseline pipeline

The baseline pipeline establishes your initial Data Flow snapshots. You only need to run it once, unless you want to reseed the repository after a large batch of Data Flow changes.

Trigger the bdi-baseline pipeline manually from your CI/CD platform. It runs Get all rivers.py to fetch all Data Flows, then Get river details.py to snapshot each Data Flow individually, and commits everything to main. Depending on how many Data Flows are in your environment, this typically takes one to five minutes.

Once it's done, verify these files exist in your repository:

  • rivers/all-rivers.json
  • rivers/api_v2_rivers.json
  • rivers/new_changes.json
  • river-details/: one .json file per Interface = New Data Flow

Smart first-run behavior

You don't have to run bdi-baseline manually at all if you're in a hurry. On its first run, bdi-audit checks whether rivers/all-rivers.json and the river-details/ directory exist. If either is missing, it automatically falls back to baseline mode, seeds the repository, and exits cleanly with the message: "Baseline complete. Re-run bdi-audit after the next change window."

Schedule ongoing audits

Set up a recurring schedule so bdi-audit runs automatically. The default cron expression in the pipeline is 0 * * * * (every hour on the hour). Adjust it to match your team's review cadence using your CI/CD platform's built-in scheduler.

What the pipeline looks like

This section walks you through the pipeline configuration — what it does, what to customize, and what to expect when it runs. The reference repository includes pipeline configuration files for Bitbucket Pipelines and GitHub Actions.

  • What it does: The configuration defines two pipelines: bdi-baseline for the one-time initial snapshot, and bdi-audit for the recurring change-detection run. Both pipelines authenticate to your Boomi Data Integration environment using the three Boomi credentials you set up as CI/CD variables earlier. On each audit run, the pipeline calls the BDI Python scripts, diffs snapshots, and opens one pull request per changed Data Flow.


    Three things to customize before you run:

    • Cron expression: Update 0 * * * * in the schedule trigger to match your preferred audit cadence.
    • Auth variables: Replace the BITBUCKET_USERNAME and BITBUCKET_ACCESS_TOKEN variable references with the authentication variable names your platform uses.
    • Boomi variables: The three Boomi variables - DATA_INTEGRATION_ACCOUNT_ID, DATA_INTEGRATION_ENVIRONMENT_ID, DATA_INTEGRATION_API_TOKEN should remain same across platforms. Just ensure they are set as secrets in your CI/CD settings.
  • What output to expect: When the pipeline runs, console output logs each step. You will see No baseline changes to commit if nothing is new, or Created PR #N for {river_cross_id} for each changed Data Flow. Artifacts are written to rivers/ and river-details/ and attached to each pipeline run.

Here is the Bitbucket Pipelines configuration from the bdi-audit repository:

image: python:3.11

pipelines:
custom:
# Baseline: Fetches every Data Flow and commits initial snapshots.
# Run once after pushing the repo for the first time.
bdi-baseline:
- step:
name: Fetch and commit baseline Data Flow data
script:
- git remote set-url origin "https://${BITBUCKET_USERNAME}:${BITBUCKET_ACCESS_TOKEN}@bitbucket.org/${BITBUCKET_WORKSPACE}/${BITBUCKET_REPO_SLUG}.git"
- git config user.name "Bitbucket Pipelines"
- git config user.email "pipelines@bitbucket.org"
- git pull origin main --rebase
- python "bdi-audit/Get all rivers.py" --output rivers/all-rivers.json
- python "bdi-audit/Get river details.py" --input rivers/api_v2_rivers.json --output-dir river-details
- git add rivers river-details
- |
if git diff --cached --quiet; then
echo "No baseline changes to commit."
else
git commit -m "Baseline Data Flow data"
git push origin HEAD
fi
artifacts:
- rivers/**
- river-details/**

# Audit: Detects changed Data Flows and opens one PR per changed Data Flow.
# On first run (no baseline present), falls back to baseline mode automatically.
bdi-audit:
- step:
name: Data Flow change detection
script:
- git remote set-url origin "https://${BITBUCKET_USERNAME}:${BITBUCKET_ACCESS_TOKEN}@..."
- git config user.name "Bitbucket Pipelines"
- git pull origin main --rebase
- python "bdi-audit/Get all rivers.py" --output rivers/all-rivers.json
- python "bdi-audit/BDI audit.py" \

How the scripts work together

Each of the three Python scripts in bdi-audit/ owns one piece of the pipeline. Here's what each one does, what you might want to tweak, and what to expect as output.

Get all rivers.py

  • What it does: Calls the Data Integration REST API and writes three output files in one pass: all-rivers.json (all Data Flows, any version), api_v2_rivers.json (filtered to is_api_v2: true), and new_changes.json (filtered to is_api_v2: true and last_updated_at within the last 24 hours). The 24-hour window determines which Data Flows the audit script picks up each run.
  • What to modify: If you're running audits more frequently, for example, every 15 minutes, adjust the pipeline_last_updated_at_hours value in this script to match your schedule in the detection window. The three Boomi environment variables are automatically read from the environment.
  • What output to expect: Three JSON files in rivers/. An empty data.items array in new_changes.json just means no Data Flows were changed in the detection window. That's a clean result, not an error.

Get river details.py

  • What it does: Reads api_v2_rivers.json and fetches the full configuration for each Data Flow it finds. This script runs exclusively during baseline runs to build the initial snapshots. Each Data Flow gets its own file named {river_cross_id} {river_name}.json in river-details/.
  • What to modify: No customization required. The pipeline reads --input and writes to --output-dir. You run it indirectly through bdi-baseline.
  • What output to expect: One JSON file per Data Flow in river-details/. After a successful baseline, the file count in river-details/ should match the item count in api_v2_rivers.json.

BDI audit.py

  • What it does: Reads new_changes.json (the 24-hour delta), fetches the current configuration for each Data Flow in that list, diffs it against the stored snapshot in river-details/, and writes two summary files — bdi-audit-summary.txt (human-readable) and bdi-audit-summary.json (machine-readable, used by the pipeline to create PRs).
  • What to modify: No customization needed for standard use. If you want to filter specific Data Flows or add custom comparison logic, extend this script.
  • What output to expect: bdi-audit-summary.json is populated with one entry per changed Data Flow. An empty array ([]) means no changes were detected, and the pipeline exits cleanly without opening any PRs.

Example: what the output files look like

Let's look at what the two key output files actually contain after a real run.

new_changes.json

After a run that detects a recently modified Data Flow, new_changes.json looks like this (these are real IDs from the environment):

{
"source": "Boomi Data Integration API",
"generated_at": "2026-05-18T03:00:04.000000+00:00",
"account_id": "55bf7c4270fdca16cac18761",
"environment_id": "6025a4b7f5682c739d83f41f",
"filters": {
"is_api_v2": true,
"pipeline_last_updated_at_hours": 24
},
"data": {
"items": [
{
"name": "postgresql-snowflake 2/11/2026, 5:06:16 PM",
"river_status": "disabled",
"river_cross_id": "698c9aef8a05d7507fae51ec",
"last_updated_at": "2026-05-18T02:45:12.000000Z",
"is_api_v2": true,
"river_type": "source_to_target"
}
]
}
}

Field names such as river_status, river_cross_id, and river_type are returned directly by the Boomi Data Integration API, and they use the API's internal naming convention.

When new_changes.json contains items, the pipeline opens a separate PR for each one. An empty data.items array is the normal result when nothing has changed.

  • What to modify: To change the detection window, update pipeline_last_updated_at_hours in Get all rivers.py. The value appears in this file for traceability; you do not edit it here directly.

bdi-audit-summary.json

After BDI audit.py runs, this file drives the PR creation loop:

[
{
"river_cross_id": "698c9aef8a05d7507fae51ec",
"river_name": "postgresql-snowflake 2-11-2026, 5-06-16 PM",
"status": "updated",
"path": "river-details/698c9aef8a05d7507fae51ec postgresql-snowflake 2-11-2026, 5-06-16 PM.json"
}
]

Each entry here maps directly to one branch, one commit, and one pull request. An empty array ([]) indicates that no changes were detected and that no PRs were opened.

  • What to modify: This file is generated output — don't edit it directly. To add custom fields to the PR metadata (for example, a team label or Jira ticket reference), extend BDI audit.py to write those additional fields here.

Going further

Once the baseline is running and your team is comfortable with the PR review model, you can extend the system in a few directions:

  • Slack or Teams notifications: Post a message to a channel whenever a new audit PR opens, so your team doesn't have to watch the repository manually.

  • Multi-environment audits: Add separate pipeline steps for Dev, Staging, and Prod, each with their own DATA_INTEGRATION_ENVIRONMENT_ID and DATA_INTEGRATION_API_TOKEN. Store snapshots in environment-scoped subdirectories.

  • Tighter change windows: The default detection window is 24 hours. If you run the audit every 15 minutes, reduce pipeline_last_updated_at_hours in Get all rivers.py to match, so each run only picks up the most recent changes.

  • Historical analysis: Every merged PR is a squash commit in Git, which means you get a full history for free. Run git log --oneline -- river-details/ or git diff to answer questions like "how many times did this Data Flow change in the last 30 days?"

Data Flow governance doesn't have to be a manual chore. Once your Data Flow configurations live in Git, every change is reviewable, attributable, and auditable, whether the auto-merge handles it silently or a human approves it manually.

Ready to try it? Grab the reference implementation from the bdi-audit repository, run the baseline once, schedule the audit, and let your CI/CD pipeline do the watching. Got questions, improvements, or a different platform-specific setup that worked for you? Drop a comment — we'd love to hear how you're using it.

For the full REST API reference, including all Data Flow endpoints, refer to Authentication and API tokens and Data Integration Rivers API.