Data Flow governance for Data Integration: automated audit trails with CI/CD
- This blog is specifically about Data Integration — not Boomi Integration (iPaaS). The assets, concepts, and APIs here apply only to Data Integration Data Flows.
- The Boomi Data Integration API uses
riverin its field names and endpoints. For example,river_cross_id, andriver_status. Throughout this blog, Data Flow is the product term. API field names are shown exactly as they appear in the API request and responses.
Building a Data Flow in Boomi Data Integration is easy: drag, drop, connect, done. But here's the question nobody thinks about until it's urgent — who changed that PostgreSQL-to-Snowflake schedule at 2 AM, and why? Or when did a source mapping shift?
When your governance framework or compliance audit asks, "What changed and when?" you need a real answer. Without a change history, you're stuck piecing things together from UI session logs or asking around until someone remembers.
That's exactly what this version control system solves. It treats your Data Flow configurations as version-controlled artifacts: a CI/CD pipeline automatically snapshots your Data Flows, detects changes, and opens a pull request for every Data Flow that drifts. Your team gets a living audit trail with zero manual effort.
Typical audit lifecycle
Here is how a change-detection-to-review cycle looks end-to-end:
-
A team member updates a Data Flow in the Boomi Data Integration UI.
-
The
bdi-auditpipeline runs on its next scheduled trigger — for example, every hour. -
Get all rivers.pyfetches the current Data Flow list and writesnew_changes.jsonwith any Data Flows modified in the last 24 hours. -
BDI
audit.pyfetches the full configuration for each changed Data Flow, compares it to the stored snapshot, and writes the summary files. -
Pipeline creates a branch named
bdi-audit/{river_cross_id}-{timestamp}, commits the updated snapshot, and opens a pull request. -
Pipeline attempts to auto-merge using the squash strategy:
-
If the merge succeeds, the snapshot is automatically updated, and the branch is closed.
-
If the merge is blocked — for example, by a branch protection rule requiring approval — the PR stays open for a reviewer to inspect.
-
-
The next audit run starts fresh:
new_changes.jsoncontains only Data Flows modified in the 24 hours before that run.

How it works
Under the hood, the version control system runs two CI/CD pipelines against your Boomi Data Integration environment — one you run once, and one that runs on repeat.
bdi-baseline is a one-time setup pipeline. It fetches all Data Flows in your environment, snapshots each one as a JSON file, and commits everything to the repository. Think of it as Day 0: you're establishing the known-good state.
bdi-audit is a scheduled pipeline. It runs on a schedule (for example, every hour by default), re-fetches all your Data Flow configurations, and diffs them against the stored or committed snapshots. For any Data Flow built with Interface = New (is_api_v2: true) that changed in the last 24 hours, the pipeline opens one pull request per changed Data Flow — complete with a full diff. If nothing changed, the pipeline exits cleanly. If auto-merge succeeds (squash strategy), snapshot updates are automatically applied. If auto-merge is blocked by a branch protection rule, the PR stays open for your team to review and approve manually.
The result is a pull request-driven review model for your Data Flows: one PR per Data Flow, so reviews stay focused and easy to approve or reject independently.
bdi-audit checks for Data Flows with is_api_v2: true and a last_updated_at within the last 24 hours. Any Data Flow that doesn't meet both conditions is skipped.
Prerequisites
Before you set up the version control system, make sure you have the following:
-
A Boomi Data Integration account with at least one active environment.
-
A private Git repository on your preferred hosting platform.
-
Admin-level access to the repository to configure CI/CD pipelines and secret variables.
Boomi Data Integration credentials
You need three values from your Boomi Data Integration environment. Store these as secret variables in your CI/CD settings. The exact location varies by platform. For example, in Bitbucket, that's Repository Settings > Repository variables; in GitHub, it's Settings > Secrets and variables > Actions.
| Variable | Where to find it |
|---|---|
DATA_INTEGRATION_ACCOUNT_ID | Log in to the Data Integration console, open any Data Flow, and read the account ID from the browser URL: https://console.boomi.com/accounts/{account_id}/environments/... |
DATA_INTEGRATION_ENVIRONMENT_ID | Same URL as above. The environment ID appears between /environments/ and /rivers/ |
DATA_INTEGRATION_API_TOKEN | Go to Data Integration console > Settings > API Tokens > Add Token, enter a name, and click Create. Copy the token immediately — it's only shown once. |
API tokens are environment-specific. If you have Dev, Staging, and Prod environments, create a separate DATA_INTEGRATION_API_TOKEN for each one.
Supported Data Flow types
Not all Data Flow types are audited equally. The pipeline only audits Data Flows built with Interface = New, which is the new source-to-target (s2t) experience in Boomi Data Integration. Internally, these Data Flows have is_api_v2: true in the API response. If a Data Flow was built with the older interface (is_api_v2: false), it's skipped automatically.
| Interface type | What you see in the UI | Support |
|---|---|---|
| Interface = New | New source-to-target experience | ✅ Supported |
| Interface = Old | Legacy interface | ❌ Not supported |
Within supported Data Flows, the following step types apply:
| Step type | Support |
|---|---|
| Logic Flow (logic rivers) | ✅ Supported |
| SQL steps | ✅ Supported |
| Python steps | ❌ Not supported |
Repository layout
Here's what your repository looks like after the initial push. We've named it bdi-audit in the reference implementation, but feel free to rename it to match your team's naming conventions:
bdi-audit/
├── bitbucket-pipelines.yml # Pipeline definitions (bdi-baseline and bdi-audit)
├── bdi-audit/
│ ├── Get all rivers.py # Fetches all Data Flows; writes rivers/*.json files
│ ├── Get river details.py # Fetches per-Data-Flow detail for baseline runs
│ ├── BDI audit.py # Change detection script for audit runs
│ └── workflows/
│ └── BDI audit.yml # Equivalent GitHub Actions workflow (alternative)
├── rivers/
│ ├── all-rivers.json # Full list of all Data Flows in the environment
│ ├── api_v2_rivers.json # Filtered list: Interface = New Data Flows only (is_api_v2 = true)
│ └── new_changes.json # Delta: Interface = New Data Flows modified in last 24 h
└── river-details/
└── <river_cross_id> <river_name>.json # One file per Data Flow
A couple of things worth noting: The rivers/ and river-details/ directories are fully pipeline-generated — don't create them manually. Once bdi-baseline runs for the first time, you'll have one snapshot file in river-details/ for every Interface = New Data Flow in your environment.
Setting up the CI/CD pipeline
With your repository in place, you're just a few Git commands away from a running pipeline.
Copy the repository files to your local machine, then initialize and push:
cd /path/to/your-repo-files
git init
git checkout -b main
git add .
git commit -m "Initial commit: version control for Data Integration"
git remote add origin <your-repository-clone-url>
git push -u origin main
Replace <your-repository-clone-url> with the HTTPS clone URL of your repository. You'll find it on your repository's main page. Once the push lands, you're ready to run the pipelines.
Running the pipelines
Run the baseline pipeline
The baseline pipeline establishes your initial Data Flow snapshots. You only need to run it once, unless you want to reseed the repository after a large batch of Data Flow changes.
Trigger the bdi-baseline pipeline manually from your CI/CD platform. It runs Get all rivers.py to fetch all Data Flows, then Get river details.py to snapshot each Data Flow individually, and commits everything to main. Depending on how many Data Flows are in your environment, this typically takes one to five minutes.
Once it's done, verify these files exist in your repository:
rivers/all-rivers.jsonrivers/api_v2_rivers.jsonrivers/new_changes.jsonriver-details/: one .json file per Interface = New Data Flow
Smart first-run behavior
You don't have to run bdi-baseline manually at all if you're in a hurry. On its first run, bdi-audit checks whether rivers/all-rivers.json and the river-details/ directory exist. If either is missing, it automatically falls back to baseline mode, seeds the repository, and exits cleanly with the message: "Baseline complete. Re-run bdi-audit after the next change window."
Schedule ongoing audits
Set up a recurring schedule so bdi-audit runs automatically. The default cron expression in the pipeline is 0 * * * * (every hour on the hour). Adjust it to match your team's review cadence using your CI/CD platform's built-in scheduler.
What the pipeline looks like
This section walks you through the pipeline configuration — what it does, what to customize, and what to expect when it runs. The reference repository includes pipeline configuration files for Bitbucket Pipelines and GitHub Actions.
-
What it does: The configuration defines two pipelines:
bdi-baselinefor the one-time initial snapshot, andbdi-auditfor the recurring change-detection run. Both pipelines authenticate to your Boomi Data Integration environment using the three Boomi credentials you set up as CI/CD variables earlier. On each audit run, the pipeline calls the BDI Python scripts, diffs snapshots, and opens one pull request per changed Data Flow.
Three things to customize before you run:- Cron expression: Update 0 * * * * in the schedule trigger to match your preferred audit cadence.
- Auth variables: Replace the BITBUCKET_USERNAME and BITBUCKET_ACCESS_TOKEN variable references with the authentication variable names your platform uses.
- Boomi variables: The three Boomi variables -
DATA_INTEGRATION_ACCOUNT_ID,DATA_INTEGRATION_ENVIRONMENT_ID,DATA_INTEGRATION_API_TOKENshould remain same across platforms. Just ensure they are set as secrets in your CI/CD settings.
-
What output to expect: When the pipeline runs, console output logs each step. You will see
No baseline changes to commitif nothing is new, orCreated PR #N for {river_cross_id}for each changed Data Flow. Artifacts are written torivers/andriver-details/and attached to each pipeline run.
Here is the Bitbucket Pipelines configuration from the bdi-audit repository:
image: python:3.11
pipelines:
custom:
# Baseline: Fetches every Data Flow and commits initial snapshots.
# Run once after pushing the repo for the first time.
bdi-baseline:
- step:
name: Fetch and commit baseline Data Flow data
script:
- git remote set-url origin "https://${BITBUCKET_USERNAME}:${BITBUCKET_ACCESS_TOKEN}@bitbucket.org/${BITBUCKET_WORKSPACE}/${BITBUCKET_REPO_SLUG}.git"
- git config user.name "Bitbucket Pipelines"
- git config user.email "pipelines@bitbucket.org"
- git pull origin main --rebase
- python "bdi-audit/Get all rivers.py" --output rivers/all-rivers.json
- python "bdi-audit/Get river details.py" --input rivers/api_v2_rivers.json --output-dir river-details
- git add rivers river-details
- |
if git diff --cached --quiet; then
echo "No baseline changes to commit."
else
git commit -m "Baseline Data Flow data"
git push origin HEAD
fi
artifacts:
- rivers/**
- river-details/**
# Audit: Detects changed Data Flows and opens one PR per changed Data Flow.
# On first run (no baseline present), falls back to baseline mode automatically.
bdi-audit:
- step:
name: Data Flow change detection
script:
- git remote set-url origin "https://${BITBUCKET_USERNAME}:${BITBUCKET_ACCESS_TOKEN}@..."
- git config user.name "Bitbucket Pipelines"
- git pull origin main --rebase
- python "bdi-audit/Get all rivers.py" --output rivers/all-rivers.json
- python "bdi-audit/BDI audit.py" \
How the scripts work together
Each of the three Python scripts in bdi-audit/ owns one piece of the pipeline. Here's what each one does, what you might want to tweak, and what to expect as output.
Get all rivers.py
- What it does: Calls the Data Integration REST API and writes three output files in one pass:
all-rivers.json(all Data Flows, any version),api_v2_rivers.json(filtered tois_api_v2: true), andnew_changes.json(filtered tois_api_v2: trueandlast_updated_atwithin the last 24 hours). The 24-hour window determines which Data Flows the audit script picks up each run. - What to modify: If you're running audits more frequently, for example, every 15 minutes, adjust the
pipeline_last_updated_at_hoursvalue in this script to match your schedule in the detection window. The three Boomi environment variables are automatically read from the environment. - What output to expect: Three JSON files in
rivers/. An emptydata.itemsarray innew_changes.jsonjust means no Data Flows were changed in the detection window. That's a clean result, not an error.
Get river details.py
- What it does: Reads
api_v2_rivers.jsonand fetches the full configuration for each Data Flow it finds. This script runs exclusively during baseline runs to build the initial snapshots. Each Data Flow gets its own file named{river_cross_id} {river_name}.jsoninriver-details/. - What to modify: No customization required. The pipeline reads
--inputand writes to--output-dir. You run it indirectly throughbdi-baseline. - What output to expect: One JSON file per Data Flow in
river-details/. After a successful baseline, the file count inriver-details/should match the item count inapi_v2_rivers.json.
BDI audit.py
- What it does: Reads
new_changes.json(the 24-hour delta), fetches the current configuration for each Data Flow in that list, diffs it against the stored snapshot inriver-details/, and writes two summary files —bdi-audit-summary.txt(human-readable) andbdi-audit-summary.json(machine-readable, used by the pipeline to create PRs). - What to modify: No customization needed for standard use. If you want to filter specific Data Flows or add custom comparison logic, extend this script.
- What output to expect:
bdi-audit-summary.jsonis populated with one entry per changed Data Flow. An empty array ([]) means no changes were detected, and the pipeline exits cleanly without opening any PRs.
Example: what the output files look like
Let's look at what the two key output files actually contain after a real run.
new_changes.json
After a run that detects a recently modified Data Flow, new_changes.json looks like this (these are real IDs from the environment):
{
"source": "Boomi Data Integration API",
"generated_at": "2026-05-18T03:00:04.000000+00:00",
"account_id": "55bf7c4270fdca16cac18761",
"environment_id": "6025a4b7f5682c739d83f41f",
"filters": {
"is_api_v2": true,
"pipeline_last_updated_at_hours": 24
},
"data": {
"items": [
{
"name": "postgresql-snowflake 2/11/2026, 5:06:16 PM",
"river_status": "disabled",
"river_cross_id": "698c9aef8a05d7507fae51ec",
"last_updated_at": "2026-05-18T02:45:12.000000Z",
"is_api_v2": true,
"river_type": "source_to_target"
}
]
}
}
Field names such as river_status, river_cross_id, and river_type are returned directly by the Boomi Data Integration API, and they use the API's internal naming convention.
When new_changes.json contains items, the pipeline opens a separate PR for each one. An empty data.items array is the normal result when nothing has changed.
- What to modify: To change the detection window, update
pipeline_last_updated_at_hoursinGet all rivers.py. The value appears in this file for traceability; you do not edit it here directly.
bdi-audit-summary.json
After BDI audit.py runs, this file drives the PR creation loop:
[
{
"river_cross_id": "698c9aef8a05d7507fae51ec",
"river_name": "postgresql-snowflake 2-11-2026, 5-06-16 PM",
"status": "updated",
"path": "river-details/698c9aef8a05d7507fae51ec postgresql-snowflake 2-11-2026, 5-06-16 PM.json"
}
]
Each entry here maps directly to one branch, one commit, and one pull request. An empty array ([]) indicates that no changes were detected and that no PRs were opened.
- What to modify: This file is generated output — don't edit it directly. To add custom fields to the PR metadata (for example, a team label or Jira ticket reference), extend BDI
audit.pyto write those additional fields here.
Going further
Once the baseline is running and your team is comfortable with the PR review model, you can extend the system in a few directions:
-
Slack or Teams notifications: Post a message to a channel whenever a new audit PR opens, so your team doesn't have to watch the repository manually.
-
Multi-environment audits: Add separate pipeline steps for Dev, Staging, and Prod, each with their own
DATA_INTEGRATION_ENVIRONMENT_IDandDATA_INTEGRATION_API_TOKEN. Store snapshots in environment-scoped subdirectories. -
Tighter change windows: The default detection window is 24 hours. If you run the audit every 15 minutes, reduce
pipeline_last_updated_at_hoursinGet all rivers.pyto match, so each run only picks up the most recent changes. -
Historical analysis: Every merged PR is a squash commit in Git, which means you get a full history for free. Run
git log --oneline -- river-details/orgit diffto answer questions like "how many times did this Data Flow change in the last 30 days?"
Data Flow governance doesn't have to be a manual chore. Once your Data Flow configurations live in Git, every change is reviewable, attributable, and auditable, whether the auto-merge handles it silently or a human approves it manually.
Ready to try it? Grab the reference implementation from the bdi-audit repository, run the baseline once, schedule the audit, and let your CI/CD pipeline do the watching. Got questions, improvements, or a different platform-specific setup that worked for you? Drop a comment — we'd love to hear how you're using it.
For the full REST API reference, including all Data Flow endpoints, refer to Authentication and API tokens and Data Integration Rivers API.
