Build a Typed DuckDB Pipeline
Go from raw CSV to a fully typed DuckDB database — profile, export a schema, validate, materialise, and query with proper types.
Goal: Turn a raw CSV into a fully typed DuckDB database — profile it, export a JSON Schema, then validate and materialise a typed table in one pass.
Prerequisites
Sample data
Save this as orders.csv:
order_id,customer_email,order_date,ship_date,amount,currency,country
ORD-001,[email protected],2024-01-15,2024-01-18,149.99,USD,United States
ORD-002,[email protected],2024-02-20,2024-02-22,2500.00,GBP,United Kingdom
ORD-003,[email protected],03/08/2024,03/10/2024,89.50,USD,United States
ORD-004,[email protected],2024-04-12,2024-04-15,1200.00,EUR,Germany
ORD-005,[email protected],2024-05-01,2024-05-03,340.75,AUD,AustraliaNotice the date formats are inconsistent — rows 1, 2, 4, and 5 use ISO format (2024-01-15) while row 3 uses US slash format (03/08/2024). FineType profiles the column as the majority format, then flags the odd row out at validation time.
Steps
1. Profile the CSV
Run profile to discover the semantic type of every column:
finetype profile -f orders.csvFineType Column Profile — "orders.csv" (5 rows, 7 columns)
════════════════════════════════════════════════════════════════════════════════
COLUMN TYPE BROAD CONF
──────────────────────────────────────────────────────────────────────────────
order_id representation.identifier.alphanumeric_id VARCHAR 96.7%
customer_email identity.person.email VARCHAR 100.0%
order_date datetime.date.iso DATE 100.0%
ship_date datetime.date.iso DATE 100.0%
amount finance.currency.amount DECIMAL 99.8% [header_hint_cross_domain:amount]
currency finance.currency.currency_code VARCHAR 100.0%
country geography.location.country VARCHAR 100.0%
7/7 columns typed, 5 rows analyzedFineType has classified all seven columns. It reads order_date and ship_date as ISO dates — the majority format — at full confidence. The lone US-slash row isn't visible here; it surfaces as a reject when you validate against the schema in step 3.
2. Export a JSON Schema
Export the profile as a JSON Schema — the contract validate casts against:
finetype profile -f orders.csv -o json-schema > orders.schema.jsonEvery property carries its semantic type (x-finetype-label), the validation pattern, and length constraints — a machine-readable contract you can commit and validate against.
3. Validate and materialise the typed table
Run validate with --db/--table to gate the data and cast the valid rows into a typed DuckDB table in a single pass:
finetype validate orders.csv orders.schema.json --db orders.duckdb --table ordersValidation Report
════════════════════════════════════════════════════════════
Input: orders.csv
Schema: orders.schema.json
Output DB: orders.duckdb
Target table: orders
Scan ID: 1
Total rows: 5
Valid rows: 4
Invalid rows: 1
Rejects: 2
SEMANTIC_TYPE: 2
TRANSFORM_FAILED: 0
Grade: C
════════════════════════════════════════════════════════════The slash-format row (ORD-003) fails the ISO-date pattern on both order_date and ship_date — two rejects, one invalid row, Grade C, exit code 1. The database file orders.duckdb now contains a fully typed orders table holding the 4 valid rows, plus a finetype_reject_errors sidecar capturing everything that failed validation or the typed cast.
4. See the difference
Reading the CSV directly, DuckDB sniffs amount as a number but leaves the dates as text. Compare:
duckdb -c "SELECT column_name, column_type FROM (DESCRIBE SELECT * FROM read_csv('orders.csv'))"┌────────────────┬─────────────┐
│ column_name │ column_type │
├────────────────┼─────────────┤
│ order_id │ VARCHAR │
│ customer_email │ VARCHAR │
│ order_date │ VARCHAR │
│ ship_date │ VARCHAR │
│ amount │ DOUBLE │
│ currency │ VARCHAR │
│ country │ VARCHAR │
└────────────────┴─────────────┘Now check the typed table:
duckdb orders.duckdb -c "SELECT column_name, column_type FROM (DESCRIBE orders)"┌────────────────┬───────────────┐
│ column_name │ column_type │
├────────────────┼───────────────┤
│ order_id │ VARCHAR │
│ customer_email │ VARCHAR │
│ order_date │ DATE │
│ ship_date │ DATE │
│ amount │ DECIMAL(18,2) │
│ currency │ VARCHAR │
│ country │ VARCHAR │
└────────────────┴───────────────┘Dates are now DATE, and amount is a fixed-precision DECIMAL(18,2) rather than a lossy DOUBLE. Queries that subtract dates or sum money will work correctly.
5. Query with proper types
With proper types in place, date arithmetic and numeric aggregation just work:
-- Orders shipped within 2 days
SELECT order_id, customer_email, order_date, ship_date,
ship_date - order_date AS shipping_days
FROM orders
WHERE ship_date - order_date <= 2;┌──────────┬──────────────────┬────────────┬────────────┬───────────────┐
│ order_id │ customer_email │ order_date │ ship_date │ shipping_days │
├──────────┼──────────────────┼────────────┼────────────┼───────────────┤
│ ORD-002 │ [email protected] │ 2024-02-20 │ 2024-02-22 │ 2 │
│ ORD-005 │ [email protected] │ 2024-05-01 │ 2024-05-03 │ 2 │
└──────────┴──────────────────┴────────────┴────────────┴───────────────┘-- Total revenue by currency
SELECT currency, sum(amount) AS total, count(*) AS orders
FROM orders
GROUP BY currency
ORDER BY total DESC;┌──────────┬─────────┬────────┐
│ currency │ total │ orders │
├──────────┼─────────┼────────┤
│ GBP │ 2500.00 │ 1 │
│ EUR │ 1200.00 │ 1 │
│ AUD │ 340.75 │ 1 │
│ USD │ 149.99 │ 1 │
└──────────┴─────────┴────────┘Only the four valid rows made it into the table — ORD-003 was rejected at validation, so it never reaches your queries. With typed data, sum() returns a decimal and subtracting two DATEs returns a whole number of days.
What you learned
finetype profilediscovers semantic types in a CSVfinetype profile -o json-schemaexports those types as a JSON Schema contractfinetype validate ... --db --tablegates the data and materialises a typed DuckDB table in one pass- Typed columns enable date arithmetic, numeric aggregation, and correct sorting
See also
profilecommand reference — output formats and filtering optionsvalidatecommand reference — check-only and materialise modes, exit codes- Validate a Data Delivery — add a repeatable quality gate
- DuckDB Extension — run the same profile and validate steps in-SQL with
ft_profileandft_validate