Quick Start
Go from raw CSV to a typed, validated DuckDB table in a handful of commands.
Every analyst knows the feeling: you open a CSV and every column is a string. Is that column of numbers a price or a postal code? Are the dates ISO 8601 or DD/MM/YYYY? Is that blank cell a missing value or an empty string?
This guide walks you through the full FineType pipeline — from raw CSV to a typed, validated DuckDB table — in five steps. Along the way, you will catch real data quality issues that would otherwise surface as silent errors downstream.
Prerequisites
FineType and DuckDB run on macOS, Linux, and Windows.
Install DuckDB
brew install duckdbsudo apt install duckdbwinget install DuckDB.clicurl https://install.duckdb.org | shInstall FineType
The install script detects your platform and drops the finetype binary into ~/.local/bin. It works on macOS, Linux, and in CI environments:
curl -fsSL https://install.meridian.online/finetype | bashTo pin a specific version (recommended for CI pipelines):
curl -fsSL https://install.meridian.online/finetype | bash -s -- v0.6.23brew install meridian-online/tap/finetypecargo install finetype-cliFor Windows (PowerShell 5.1+):
irm https://install.meridian.online/finetype/win | iexVerify both are working:
duckdb --version
finetype --versionThe Sample Dataset
We will use a small contacts dataset with realistic quality issues — the kind you find in every export from a CRM, spreadsheet, or third-party API.
Download the file directly: contacts.csv
Or copy and save it locally:
id,name,email,created_at,ip_address,amount
1,Alice Chen,[email protected],2024-01-15T09:30:00Z,192.168.1.10,149.99
2,Bob Smith,[email protected],2024-02-20T14:15:00Z,10.0.0.42,2500.00
3,Carol Wu,[email protected],2024-03-08T11:00:00Z,172.16.0.1,89.50
4,Dan Reeves,[email protected],2024-04-12T16:45:00Z,192.168.0.5,1200.00
5,Eve Nakamura,[email protected],2024-05-01T08:00:00Z,10.10.10.1,340.75
6,Frank Osei,not-an-email,2024-06-15T10:20:00Z,203.0.113.7,75.00
7,Grace Patel,[email protected],2024-07-15T18:30:00Z,10.0.0.99,1850.25
8,Hiro Tanaka,[email protected],2024-08-03T13:45:00Z,,425.00
9,Isla Morgan,[email protected],2024-09-19T07:30:00Z,192.168.1.200,N/A
10,Jin Li,[email protected],2024-10-05T16:00:00Z,999.999.999.999,3200.00
11,Kara Novak,[email protected],2024-11-11T09:00:00Z,10.0.1.50,620.00
12,Leo Santos,[email protected],2024-12-01T12:00:00Z,172.16.5.20,This looks tidy at first glance, but there are problems hiding in plain sight:
- Row 6:
not-an-emailis not a valid email address - Row 8: missing IP address
- Row 9:
N/Ain the amount column — not a number - Row 10:
999.999.999.999is not a valid IP address - Row 12: missing amount
DuckDB will happily read this file and load everything as VARCHAR. The problems only surface later, when a CAST fails or an aggregate silently ignores nulls. FineType catches these issues upfront — and, as you will see, draws a clear line between a value that is missing (rows 8 and 12) and a value that is wrong (rows 6, 9, and 10).
Step 1: Classify Individual Values
Before profiling the whole file, build intuition for how FineType thinks. The infer command classifies text. The default model is column-aware, so pass --mode column to classify a single value:
$ finetype infer -i "[email protected]" --mode column --confidence
identity.person.email
confidence: 0.9638 (1 samples)
$ finetype infer -i "192.168.1.10" --mode column --confidence
technology.internet.ip_v4
confidence: 0.9951 (1 samples)Each value gets a three-part label: domain.category.type. The confidence score tells you how certain the model is. Clean, distinctive values like an email or an IP address are classified with high confidence.
Now try the timestamp on its own:
$ finetype infer -i "2024-01-15T09:30:00Z" --mode column --confidence
representation.identifier.alphanumeric_id
confidence: 0.8591 (1 samples)In isolation, one timestamp looks like it could be any structured identifier — there is no column around it to confirm the pattern. This is exactly why the next step matters: profiling the whole column gives FineType the distribution and the header it needs to resolve created_at to a timestamp.
A bad value never crashes — FineType classifies what it sees:
$ finetype infer -i "not-an-email" --mode column --confidence
representation.text.plain_text
confidence: 0.6116 (1 samples)not-an-email falls back to plain text, and the low confidence is the tell: when a value does not match any sharp pattern, the score drops. Hold that thought — it is the same signal that surfaces as a reject once we validate the whole column.
Step 2: Profile the Dataset
The profile command analyses every column and picks the best semantic type for each:
finetype profile -f contacts.csvFineType Column Profile — "contacts.csv" (12 rows, 6 columns)
════════════════════════════════════════════════════════════════════════════════
COLUMN TYPE BROAD CONF
──────────────────────────────────────────────────────────────────────────────
id representation.identifier.increment BIGINT 97.6% [numeric_sequential_detection]
name identity.person.full_name VARCHAR 98.2%
email identity.person.email VARCHAR 100.0%
created_at datetime.timestamp.iso_8601 TIMESTAMP 99.1%
ip_address technology.internet.ip_v4 VARCHAR 100.0% [ipv4_detection]
amount finance.currency.amount DECIMAL 99.9% [header_hint_cross_domain:amount]
6/6 columns typed, 12 rows analyzedA few things to notice:
- BROAD is the DuckDB storage type FineType recommends. VARCHAR columns stay as strings (emails and IPs are semantically typed but stored as text);
created_atbecomes TIMESTAMP;amountbecomes DECIMAL. Notice thatcreated_at— ambiguous as a single value in Step 1 — resolves to a timestamp once the whole column is in view. - CONF is the column-level confidence. Every column is typed with high confidence: column mode looks at the distribution, so a single rogue value (the bad email in row 6) does not change what the column fundamentally is. The profiler tells you the intended type; catching the rows that violate it is the job of validation in Step 4.
- Sense hints (in brackets) show which detection strategy fired.
numeric_sequential_detectionspotted the runningid,ipv4_detectionmatched the address pattern, andheader_hint_cross_domain:amountused the column header to settle on a currency amount.
FineType has now told you what each column should be. The next step is to capture that as a formal schema.
Step 3: Generate a Schema
Run profile with -o json-schema to turn the profile results into a JSON Schema that describes the expected shape and types of your data:
finetype profile -f contacts.csv -o json-schema{
"$id": "finetype://contacts.csv",
"$schema": "https://json-schema.org/draft/2020-12/schema",
"properties": {
"amount": {
"pattern": "^[\\$£¥₹₩₿\\p{Sc}]?\\s?-?\\s?[0-9]{1,3}(,[0-9]{3})*(\\.[0-9]{1,2})?$|^-?[\\$£¥₹₩₿\\p{Sc}]\\s?[0-9]{1,3}(,[0-9]{3})*(\\.[0-9]{1,2})?$|^\\([\\$£¥₹₩₿\\p{Sc}]?[0-9]{1,3}(,[0-9]{3})*(\\.[0-9]{1,2})?\\)$|^-?[0-9]+(\\.[0-9]+)?([eE][+-]?[0-9]+)?$",
"type": "string",
"x-finetype-label": "finance.currency.amount",
"x-finetype-pii": false
},
"created_at": {
"maxLength": 27,
"minLength": 20,
"pattern": "^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}(\\.\\d{1,6})?Z$",
"type": "string",
"x-finetype-label": "datetime.timestamp.iso_8601",
"x-finetype-pii": false
},
"email": {
"maxLength": 254,
"minLength": 5,
"pattern": "^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$",
"type": "string",
"x-finetype-label": "identity.person.email",
"x-finetype-pii": true
},
"id": {
"pattern": "^[0-9]+$",
"type": "string",
"x-finetype-label": "representation.identifier.increment",
"x-finetype-pii": false
},
"ip_address": {
"pattern": "^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$",
"type": "string",
"x-finetype-label": "technology.internet.ip_v4",
"x-finetype-pii": false
},
"name": {
"maxLength": 200,
"minLength": 2,
"pattern": "^[\\p{L}\\s'\\-.,]+$",
"type": "string",
"x-finetype-label": "identity.person.full_name",
"x-finetype-pii": true
}
},
"required": [
"created_at",
"email",
"id",
"name"
],
"title": "contacts",
"type": "object"
}Save this to a file — it becomes the contract between your raw data and your typed tables:
finetype profile -f contacts.csv -o json-schema > schema.jsonTip
Add --stats to attach observed-data constraints (length and range bounds, enum values, null rate, cardinality) to the exported schema.
Each property carries a pattern (the regex every value must match), an x-finetype-label (the semantic type), and an x-finetype-pii flag marking columns that hold personal data — here, name and email. Standard JSON Schema validators understand the pattern; the x-finetype-* extensions are ignored by other tools but drive FineType's own validation in the next step.
Note the required array: it lists only created_at, email, id, and name — the columns with no empty cells. FineType saw blanks in ip_address and amount (rows 8 and 12) and left them optional, so a missing value there is allowed rather than flagged.
Step 4: Validate and Materialise
Now that you have a schema, validate every row against it. By default validate runs check-only and reports how many rows pass:
finetype validate contacts.csv schema.jsonValidation Report
════════════════════════════════════════════════════════════
Input: contacts.csv
Schema: schema.json
Mode: check-only (no .db written)
Total rows: 12
Valid rows: 9
Invalid rows: 3
Rejects: 3
Grade: C
════════════════════════════════════════════════════════════Nine of the twelve rows pass; three are rejected, earning the dataset a Grade of C. The exit code tells a CI pipeline what happened: 0 = no rejects, 1 = rejects present, 2 = error.
To gate the data and cast the valid rows into a typed DuckDB table in the same pass, add --db and --table:
finetype validate contacts.csv schema.json --db contacts.db --table contactsThis writes a typed contacts table (valid rows, per-column transforms applied) plus a finetype_reject_errors sidecar table capturing everything that failed. Inspect the rejects with DuckDB:
duckdb contacts.db -c "SELECT line, column_name, error_type, expected_type FROM finetype_reject_errors ORDER BY line;"┌──────┬─────────────┬───────────────┬───────────────────────────┐
│ line │ column_name │ error_type │ expected_type │
├──────┼─────────────┼───────────────┼───────────────────────────┤
│ 6 │ email │ SEMANTIC_TYPE │ identity.person.email │
│ 9 │ amount │ SEMANTIC_TYPE │ finance.currency.amount │
│ 10 │ ip_address │ SEMANTIC_TYPE │ technology.internet.ip_v4 │
└──────┴─────────────┴───────────────┴───────────────────────────┘The three rejected rows are exactly the bad values we spotted earlier:
| Row | Column | Issue |
|---|---|---|
| 6 | not-an-email fails the email pattern | |
| 9 | amount | N/A is not a valid number |
| 10 | ip_address | 999.999.999.999 fails the IPv4 range check |
Rows 8 and 12 have missing values but are not rejected — ip_address and amount are optional, so a blank is a valid null. The distinction matters: a missing value is expected; N/A typed as a string where a number should be is a data quality error.
Step 5: Query the Typed Table
The --db/--table run in Step 4 already materialised a fully typed contacts table — no manual CAST statements, no guessing at date formats, no silent type coercion errors. Open it and query with proper types:
duckdb contacts.db-- Query with proper types
SELECT name, amount, created_at
FROM contacts
WHERE created_at > TIMESTAMP '2024-06-01'
ORDER BY amount DESC;┌─────────────┬─────────┬─────────────────────┐
│ name │ amount │ created_at │
├─────────────┼─────────┼─────────────────────┤
│ Grace Patel │ 1850.25 │ 2024-07-15 18:30:00 │
│ Kara Novak │ 620.00 │ 2024-11-11 09:00:00 │
│ Hiro Tanaka │ 425.00 │ 2024-08-03 13:45:00 │
│ Leo Santos │ NULL │ 2024-12-01 12:00:00 │
└─────────────┴─────────┴─────────────────────┘Dates are TIMESTAMP, amounts are DECIMAL — date arithmetic and numeric aggregation just work. The four rows are the post-June contacts; Leo Santos shows a NULL amount, the empty cell from row 12 carried through cleanly as a null rather than a parse error.
The full pipeline in four commands:
finetype infer -i "[email protected]" --mode column --confidence # Build intuition
finetype profile -f contacts.csv # Discover column types
finetype profile -f contacts.csv -o json-schema > schema.json # Capture the contract
finetype validate contacts.csv schema.json --db contacts.db --table contacts # Validate & materialiseBonus: DuckDB Extension
If you prefer to stay in SQL, the FineType DuckDB extension lets you profile, validate, and classify directly in queries:
INSTALL finetype FROM community;
LOAD finetype;
-- Classify a single value
SELECT ft_infer('[email protected]');
-- → identity.person.email
-- Or profile a whole table at once
FROM ft_profile('contacts');See the DuckDB Extension docs for the full function reference.
What's Next?
Keep exploring
Dive deeper into the tools and the type system.