MeridianMERIDIAN

Build a Typed DuckDB Pipeline

Go from raw CSV to a fully typed DuckDB database — profile, export a schema, validate, materialise, and query with proper types.

Goal: Turn a raw CSV into a fully typed DuckDB database — profile it, export a JSON Schema, then validate and materialise a typed table in one pass.

Prerequisites

ToolPurpose
FineTypeSemantic type detection and SQL generation
DuckDBAnalytical SQL engine

Sample data

Save this as orders.csv:

order_id,customer_email,order_date,ship_date,amount,currency,country
ORD-001,[email protected],2024-01-15,2024-01-18,149.99,USD,United States
ORD-002,[email protected],2024-02-20,2024-02-22,2500.00,GBP,United Kingdom
ORD-003,[email protected],03/08/2024,03/10/2024,89.50,USD,United States
ORD-004,[email protected],2024-04-12,2024-04-15,1200.00,EUR,Germany
ORD-005,[email protected],2024-05-01,2024-05-03,340.75,AUD,Australia

Notice the date formats are inconsistent — rows 1, 2, 4, and 5 use ISO format (2024-01-15) while row 3 uses US slash format (03/08/2024). FineType profiles the column as the majority format, then flags the odd row out at validation time.

Steps

1. Profile the CSV

Run profile to discover the semantic type of every column:

finetype profile -f orders.csv
FineType Column Profile — "orders.csv" (5 rows, 7 columns)
════════════════════════════════════════════════════════════════════════════════

  COLUMN                    TYPE                                      BROAD   CONF
  ──────────────────────────────────────────────────────────────────────────────
  order_id                  representation.identifier.alphanumeric_id  VARCHAR  96.7%
  customer_email            identity.person.email                   VARCHAR 100.0%
  order_date                datetime.date.iso                          DATE 100.0%
  ship_date                 datetime.date.iso                          DATE 100.0%
  amount                    finance.currency.amount                 DECIMAL  99.8% [header_hint_cross_domain:amount]
  currency                  finance.currency.currency_code          VARCHAR 100.0%
  country                   geography.location.country              VARCHAR 100.0%

7/7 columns typed, 5 rows analyzed

FineType has classified all seven columns. It reads order_date and ship_date as ISO dates — the majority format — at full confidence. The lone US-slash row isn't visible here; it surfaces as a reject when you validate against the schema in step 3.

2. Export a JSON Schema

Export the profile as a JSON Schema — the contract validate casts against:

finetype profile -f orders.csv -o json-schema > orders.schema.json

Every property carries its semantic type (x-finetype-label), the validation pattern, and length constraints — a machine-readable contract you can commit and validate against.

3. Validate and materialise the typed table

Run validate with --db/--table to gate the data and cast the valid rows into a typed DuckDB table in a single pass:

finetype validate orders.csv orders.schema.json --db orders.duckdb --table orders
Validation Report
════════════════════════════════════════════════════════════
  Input:        orders.csv
  Schema:       orders.schema.json
  Output DB:    orders.duckdb
  Target table: orders
  Scan ID:      1

  Total rows:             5
  Valid rows:             4
  Invalid rows:           1
  Rejects:                2
    SEMANTIC_TYPE:        2
    TRANSFORM_FAILED:     0
  Grade:             C
════════════════════════════════════════════════════════════

The slash-format row (ORD-003) fails the ISO-date pattern on both order_date and ship_date — two rejects, one invalid row, Grade C, exit code 1. The database file orders.duckdb now contains a fully typed orders table holding the 4 valid rows, plus a finetype_reject_errors sidecar capturing everything that failed validation or the typed cast.

4. See the difference

Reading the CSV directly, DuckDB sniffs amount as a number but leaves the dates as text. Compare:

duckdb -c "SELECT column_name, column_type FROM (DESCRIBE SELECT * FROM read_csv('orders.csv'))"
┌────────────────┬─────────────┐
│  column_name   │ column_type │
├────────────────┼─────────────┤
│ order_id       │ VARCHAR     │
│ customer_email │ VARCHAR     │
│ order_date     │ VARCHAR     │
│ ship_date      │ VARCHAR     │
│ amount         │ DOUBLE      │
│ currency       │ VARCHAR     │
│ country        │ VARCHAR     │
└────────────────┴─────────────┘

Now check the typed table:

duckdb orders.duckdb -c "SELECT column_name, column_type FROM (DESCRIBE orders)"
┌────────────────┬───────────────┐
│  column_name   │  column_type  │
├────────────────┼───────────────┤
│ order_id       │ VARCHAR       │
│ customer_email │ VARCHAR       │
│ order_date     │ DATE          │
│ ship_date      │ DATE          │
│ amount         │ DECIMAL(18,2) │
│ currency       │ VARCHAR       │
│ country        │ VARCHAR       │
└────────────────┴───────────────┘

Dates are now DATE, and amount is a fixed-precision DECIMAL(18,2) rather than a lossy DOUBLE. Queries that subtract dates or sum money will work correctly.

5. Query with proper types

With proper types in place, date arithmetic and numeric aggregation just work:

-- Orders shipped within 2 days
SELECT order_id, customer_email, order_date, ship_date,
       ship_date - order_date AS shipping_days
FROM orders
WHERE ship_date - order_date <= 2;
┌──────────┬──────────────────┬────────────┬────────────┬───────────────┐
│ order_id │  customer_email  │ order_date │ ship_date  │ shipping_days │
├──────────┼──────────────────┼────────────┼────────────┼───────────────┤
│ ORD-002  │ [email protected]  │ 2024-02-20 │ 2024-02-22 │             2 │
│ ORD-005  │ [email protected] │ 2024-05-01 │ 2024-05-03 │             2 │
└──────────┴──────────────────┴────────────┴────────────┴───────────────┘
-- Total revenue by currency
SELECT currency, sum(amount) AS total, count(*) AS orders
FROM orders
GROUP BY currency
ORDER BY total DESC;
┌──────────┬─────────┬────────┐
│ currency │  total  │ orders │
├──────────┼─────────┼────────┤
│ GBP      │ 2500.00 │      1 │
│ EUR      │ 1200.00 │      1 │
│ AUD      │  340.75 │      1 │
│ USD      │  149.99 │      1 │
└──────────┴─────────┴────────┘

Only the four valid rows made it into the table — ORD-003 was rejected at validation, so it never reaches your queries. With typed data, sum() returns a decimal and subtracting two DATEs returns a whole number of days.

What you learned

  • finetype profile discovers semantic types in a CSV
  • finetype profile -o json-schema exports those types as a JSON Schema contract
  • finetype validate ... --db --table gates the data and materialises a typed DuckDB table in one pass
  • Typed columns enable date arithmetic, numeric aggregation, and correct sorting

See also

On this page