Arcform

Local-first, asset-aware data pipelines. YAML steps, DuckDB engine, one Rust binary.

Arcform runs analytical data pipelines from a single arcform.yaml — steps that produce assets, wired by the data dependencies between them, executed against DuckDB by one static binary (arc).

Early Release

Arcform is at v0.1.0 and under active development. The manifest schema, CLI, and asset model may change between releases. There is no published binary yet — build from source, and pin to a commit if stability matters.

Why it matters

A real analytical workflow is a chain: fetch a couple of files, load them into DuckDB, model them with SQL, export the result. People wire this together with a Makefile or a pile of shell scripts — and nothing in that pile knows what depends on what, so a one-line change means re-running everything.

Arcform treats the data outputs as the graph. It reads each SQL step with sqlparser-rs to discover which tables the step reads and writes, so the dependency graph is built from lineage rather than hand-maintained ordering. Preconditions skip steps whose inputs are still fresh; a step's SQL is content-hashed so unchanged work isn't repeated. The manifest is the whole pipeline — the same one runs on your laptop and on a schedule.

Meridian's own dataset publishing is converging on arcform: the goal is that every published dataset carries its arcform.yaml as its provenance record — the pipeline you can read and rerun. (Today's publish scripts are moving over incrementally.)

Installation

No release binaries yet — build from source with a recent Rust toolchain:

git clone https://github.com/meridian-online/arcform
cd arcform
cargo install --path .

Arcform delegates SQL execution to the DuckDB CLI, so duckdb must be on your PATH at run time. The build also links the DuckDB library; on macOS, if the linker reports library 'duckdb' not found, install DuckDB (brew install duckdb) and point the linker at it:

LIBRARY_PATH=/opt/homebrew/lib cargo install --path .

Commands

arc init <name>            # scaffold a new project (arcform.yaml, models/, sources/)
arc run                    # run the pipeline in the current directory
arc run --force            # re-run every step, ignoring staleness
arc run --param KEY=VALUE  # override a runtime parameter (repeatable)

arc registry list/show/fetch/run is scaffolding for a future curated pipeline catalogue; the transport isn't wired to a live index yet.

A worked example: publishing a dataset

The repo's reference pipeline, examples/brewtrend, ranks trending Homebrew packages — fetch, load, model, publish a Parquet. Every step is either a command: (raw shell) or a sql: (a .sql file run against DuckDB). Abridged:

# examples/brewtrend/arcform.yaml
name: brewtrend
engine: duckdb
engine_version: ">=1.2"
db: trends.db

params:
  trend_threshold:
    default: "10"

steps:
  # Fetch: pull Homebrew analytics (cached 24h via a precondition)
  - name: fetch_cask_30d
    command: "mkdir -p data/cask && curl -fsSL https://formulae.brew.sh/api/analytics/cask-install/homebrew-cask/30d.json -o data/cask/30d.json"
    produces: [cask_30d]
    preconditions:
      - modified_after: { path: data/cask/30d.json, period: 24h }
  # … five more fetch_* steps (cask/formula catalogues + 90d analytics) …

  # Load: raw JSON → DuckDB tables
  - name: load
    sql: models/load_data.sql
    depends_on: [cask_json, formula_json, cask_30d, cask_90d, formula_30d, formula_90d]

  # Transform: compute the trend metric (reads trend_threshold)
  - name: trending
    sql: models/trending.sql

  # Publish: final ranking → data/ranking.parquet
  - name: rank
    sql: models/full_ranking.sql

The rank step is a plain sql: file that builds the ranking table and exports it:

-- models/full_ranking.sql (tail)
CREATE OR REPLACE TABLE ranking AS SELECT … ;

-- Export the ranking as the pipeline's portable output artifact.
COPY ranking TO 'data/ranking.parquet';

Arcform reads that SQL and learns, on its own, that rank writes the ranking table and depends on what the upstream steps produced — no manual DAG wiring. Run it:

cd examples/brewtrend
arc run

This fetches the source data into data/, loads it into trends.db, computes the ranking, and writes data/ranking.parquet — the pipeline's portable output. Re-running within 24h skips the cached fetches and just recomputes; arc run --force re-runs everything; arc run --param trend_threshold=50 raises the bar for what counts as trending.

The example also ships a Frictionless datapackage.json describing that output — field names, types, and provenance. Per Meridian's convention, a Data Package describes the data; it is not the runnable artifact. arc run executes arcform.yaml — there is no arc run datapackage.json.

Why it matters

Installation

Commands

A worked example: publishing a dataset

Links

See also

On this page