MeridianMERIDIAN

Quick Start

Go from raw CSV to a typed, validated DuckDB table in a handful of commands.

Every analyst knows the feeling: you open a CSV and every column is a string. Is that column of numbers a price or a postal code? Are the dates ISO 8601 or DD/MM/YYYY? Is that blank cell a missing value or an empty string?

This guide walks you through the full FineType pipeline — from raw CSV to a typed, validated DuckDB table — in five steps. Along the way, you will catch real data quality issues that would otherwise surface as silent errors downstream.

Prerequisites

FineType and DuckDB run on macOS, Linux, and Windows.

ToolPurpose
DuckDBIn-memory analytical SQL engine
FineTypeSemantic type detection (240 types)

Install DuckDB

brew install duckdb
sudo apt install duckdb
winget install DuckDB.cli
curl https://install.duckdb.org | sh

Install FineType

The install script detects your platform and drops the finetype binary into ~/.local/bin. It works on macOS, Linux, and in CI environments:

curl -fsSL https://install.meridian.online/finetype | bash

To pin a specific version (recommended for CI pipelines):

curl -fsSL https://install.meridian.online/finetype | bash -s -- v0.6.23
brew install meridian-online/tap/finetype
cargo install finetype-cli

For Windows (PowerShell 5.1+):

irm https://install.meridian.online/finetype/win | iex

Verify both are working:

duckdb --version
finetype --version

The Sample Dataset

We will use a small contacts dataset with realistic quality issues — the kind you find in every export from a CRM, spreadsheet, or third-party API.

Download the file directly: contacts.csv

Or copy and save it locally:

id,name,email,created_at,ip_address,amount
1,Alice Chen,[email protected],2024-01-15T09:30:00Z,192.168.1.10,149.99
2,Bob Smith,[email protected],2024-02-20T14:15:00Z,10.0.0.42,2500.00
3,Carol Wu,[email protected],2024-03-08T11:00:00Z,172.16.0.1,89.50
4,Dan Reeves,[email protected],2024-04-12T16:45:00Z,192.168.0.5,1200.00
5,Eve Nakamura,[email protected],2024-05-01T08:00:00Z,10.10.10.1,340.75
6,Frank Osei,not-an-email,2024-06-15T10:20:00Z,203.0.113.7,75.00
7,Grace Patel,[email protected],2024-07-15T18:30:00Z,10.0.0.99,1850.25
8,Hiro Tanaka,[email protected],2024-08-03T13:45:00Z,,425.00
9,Isla Morgan,[email protected],2024-09-19T07:30:00Z,192.168.1.200,N/A
10,Jin Li,[email protected],2024-10-05T16:00:00Z,999.999.999.999,3200.00
11,Kara Novak,[email protected],2024-11-11T09:00:00Z,10.0.1.50,620.00
12,Leo Santos,[email protected],2024-12-01T12:00:00Z,172.16.5.20,

This looks tidy at first glance, but there are problems hiding in plain sight:

  • Row 6: not-an-email is not a valid email address
  • Row 8: missing IP address
  • Row 9: N/A in the amount column — not a number
  • Row 10: 999.999.999.999 is not a valid IP address
  • Row 12: missing amount

DuckDB will happily read this file and load everything as VARCHAR. The problems only surface later, when a CAST fails or an aggregate silently ignores nulls. FineType catches these issues upfront — and, as you will see, draws a clear line between a value that is missing (rows 8 and 12) and a value that is wrong (rows 6, 9, and 10).

Step 1: Classify Individual Values

Before profiling the whole file, build intuition for how FineType thinks. The infer command classifies text. The default model is column-aware, so pass --mode column to classify a single value:

$ finetype infer -i "[email protected]" --mode column --confidence
identity.person.email
  confidence: 0.9638 (1 samples)

$ finetype infer -i "192.168.1.10" --mode column --confidence
technology.internet.ip_v4
  confidence: 0.9951 (1 samples)

Each value gets a three-part label: domain.category.type. The confidence score tells you how certain the model is. Clean, distinctive values like an email or an IP address are classified with high confidence.

Now try the timestamp on its own:

$ finetype infer -i "2024-01-15T09:30:00Z" --mode column --confidence
representation.identifier.alphanumeric_id
  confidence: 0.8591 (1 samples)

In isolation, one timestamp looks like it could be any structured identifier — there is no column around it to confirm the pattern. This is exactly why the next step matters: profiling the whole column gives FineType the distribution and the header it needs to resolve created_at to a timestamp.

A bad value never crashes — FineType classifies what it sees:

$ finetype infer -i "not-an-email" --mode column --confidence
representation.text.plain_text
  confidence: 0.6116 (1 samples)

not-an-email falls back to plain text, and the low confidence is the tell: when a value does not match any sharp pattern, the score drops. Hold that thought — it is the same signal that surfaces as a reject once we validate the whole column.

Step 2: Profile the Dataset

The profile command analyses every column and picks the best semantic type for each:

finetype profile -f contacts.csv
FineType Column Profile — "contacts.csv" (12 rows, 6 columns)
════════════════════════════════════════════════════════════════════════════════

  COLUMN                    TYPE                                      BROAD   CONF
  ──────────────────────────────────────────────────────────────────────────────
  id                        representation.identifier.increment      BIGINT  97.6% [numeric_sequential_detection]
  name                      identity.person.full_name               VARCHAR  98.2%
  email                     identity.person.email                   VARCHAR 100.0%
  created_at                datetime.timestamp.iso_8601            TIMESTAMP  99.1%
  ip_address                technology.internet.ip_v4               VARCHAR 100.0% [ipv4_detection]
  amount                    finance.currency.amount                 DECIMAL  99.9% [header_hint_cross_domain:amount]

6/6 columns typed, 12 rows analyzed

A few things to notice:

  • BROAD is the DuckDB storage type FineType recommends. VARCHAR columns stay as strings (emails and IPs are semantically typed but stored as text); created_at becomes TIMESTAMP; amount becomes DECIMAL. Notice that created_at — ambiguous as a single value in Step 1 — resolves to a timestamp once the whole column is in view.
  • CONF is the column-level confidence. Every column is typed with high confidence: column mode looks at the distribution, so a single rogue value (the bad email in row 6) does not change what the column fundamentally is. The profiler tells you the intended type; catching the rows that violate it is the job of validation in Step 4.
  • Sense hints (in brackets) show which detection strategy fired. numeric_sequential_detection spotted the running id, ipv4_detection matched the address pattern, and header_hint_cross_domain:amount used the column header to settle on a currency amount.

FineType has now told you what each column should be. The next step is to capture that as a formal schema.

Step 3: Generate a Schema

Run profile with -o json-schema to turn the profile results into a JSON Schema that describes the expected shape and types of your data:

finetype profile -f contacts.csv -o json-schema
{
  "$id": "finetype://contacts.csv",
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "properties": {
    "amount": {
      "pattern": "^[\\$£¥₹₩₿\\p{Sc}]?\\s?-?\\s?[0-9]{1,3}(,[0-9]{3})*(\\.[0-9]{1,2})?$|^-?[\\$£¥₹₩₿\\p{Sc}]\\s?[0-9]{1,3}(,[0-9]{3})*(\\.[0-9]{1,2})?$|^\\([\\$£¥₹₩₿\\p{Sc}]?[0-9]{1,3}(,[0-9]{3})*(\\.[0-9]{1,2})?\\)$|^-?[0-9]+(\\.[0-9]+)?([eE][+-]?[0-9]+)?$",
      "type": "string",
      "x-finetype-label": "finance.currency.amount",
      "x-finetype-pii": false
    },
    "created_at": {
      "maxLength": 27,
      "minLength": 20,
      "pattern": "^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}(\\.\\d{1,6})?Z$",
      "type": "string",
      "x-finetype-label": "datetime.timestamp.iso_8601",
      "x-finetype-pii": false
    },
    "email": {
      "maxLength": 254,
      "minLength": 5,
      "pattern": "^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$",
      "type": "string",
      "x-finetype-label": "identity.person.email",
      "x-finetype-pii": true
    },
    "id": {
      "pattern": "^[0-9]+$",
      "type": "string",
      "x-finetype-label": "representation.identifier.increment",
      "x-finetype-pii": false
    },
    "ip_address": {
      "pattern": "^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$",
      "type": "string",
      "x-finetype-label": "technology.internet.ip_v4",
      "x-finetype-pii": false
    },
    "name": {
      "maxLength": 200,
      "minLength": 2,
      "pattern": "^[\\p{L}\\s'\\-.,]+$",
      "type": "string",
      "x-finetype-label": "identity.person.full_name",
      "x-finetype-pii": true
    }
  },
  "required": [
    "created_at",
    "email",
    "id",
    "name"
  ],
  "title": "contacts",
  "type": "object"
}

Save this to a file — it becomes the contract between your raw data and your typed tables:

finetype profile -f contacts.csv -o json-schema > schema.json

Tip

Add --stats to attach observed-data constraints (length and range bounds, enum values, null rate, cardinality) to the exported schema.

Each property carries a pattern (the regex every value must match), an x-finetype-label (the semantic type), and an x-finetype-pii flag marking columns that hold personal data — here, name and email. Standard JSON Schema validators understand the pattern; the x-finetype-* extensions are ignored by other tools but drive FineType's own validation in the next step.

Note the required array: it lists only created_at, email, id, and name — the columns with no empty cells. FineType saw blanks in ip_address and amount (rows 8 and 12) and left them optional, so a missing value there is allowed rather than flagged.

Step 4: Validate and Materialise

Now that you have a schema, validate every row against it. By default validate runs check-only and reports how many rows pass:

finetype validate contacts.csv schema.json
Validation Report
════════════════════════════════════════════════════════════
  Input:        contacts.csv
  Schema:       schema.json
  Mode:         check-only (no .db written)

  Total rows:            12
  Valid rows:             9
  Invalid rows:           3
  Rejects:                3
  Grade:             C
════════════════════════════════════════════════════════════

Nine of the twelve rows pass; three are rejected, earning the dataset a Grade of C. The exit code tells a CI pipeline what happened: 0 = no rejects, 1 = rejects present, 2 = error.

To gate the data and cast the valid rows into a typed DuckDB table in the same pass, add --db and --table:

finetype validate contacts.csv schema.json --db contacts.db --table contacts

This writes a typed contacts table (valid rows, per-column transforms applied) plus a finetype_reject_errors sidecar table capturing everything that failed. Inspect the rejects with DuckDB:

duckdb contacts.db -c "SELECT line, column_name, error_type, expected_type FROM finetype_reject_errors ORDER BY line;"
┌──────┬─────────────┬───────────────┬───────────────────────────┐
│ line │ column_name │  error_type   │       expected_type       │
├──────┼─────────────┼───────────────┼───────────────────────────┤
│ 6    │ email       │ SEMANTIC_TYPE │ identity.person.email     │
│ 9    │ amount      │ SEMANTIC_TYPE │ finance.currency.amount   │
│ 10   │ ip_address  │ SEMANTIC_TYPE │ technology.internet.ip_v4 │
└──────┴─────────────┴───────────────┴───────────────────────────┘

The three rejected rows are exactly the bad values we spotted earlier:

RowColumnIssue
6emailnot-an-email fails the email pattern
9amountN/A is not a valid number
10ip_address999.999.999.999 fails the IPv4 range check

Rows 8 and 12 have missing values but are not rejected — ip_address and amount are optional, so a blank is a valid null. The distinction matters: a missing value is expected; N/A typed as a string where a number should be is a data quality error.

Step 5: Query the Typed Table

The --db/--table run in Step 4 already materialised a fully typed contacts table — no manual CAST statements, no guessing at date formats, no silent type coercion errors. Open it and query with proper types:

duckdb contacts.db
-- Query with proper types
SELECT name, amount, created_at
FROM contacts
WHERE created_at > TIMESTAMP '2024-06-01'
ORDER BY amount DESC;
┌─────────────┬─────────┬─────────────────────┐
│    name     │ amount  │     created_at      │
├─────────────┼─────────┼─────────────────────┤
│ Grace Patel │ 1850.25 │ 2024-07-15 18:30:00 │
│ Kara Novak  │ 620.00  │ 2024-11-11 09:00:00 │
│ Hiro Tanaka │ 425.00  │ 2024-08-03 13:45:00 │
│ Leo Santos  │ NULL    │ 2024-12-01 12:00:00 │
└─────────────┴─────────┴─────────────────────┘

Dates are TIMESTAMP, amounts are DECIMAL — date arithmetic and numeric aggregation just work. The four rows are the post-June contacts; Leo Santos shows a NULL amount, the empty cell from row 12 carried through cleanly as a null rather than a parse error.

The full pipeline in four commands:

finetype infer -i "[email protected]" --mode column --confidence              # Build intuition
finetype profile -f contacts.csv                                              # Discover column types
finetype profile -f contacts.csv -o json-schema > schema.json                 # Capture the contract
finetype validate contacts.csv schema.json --db contacts.db --table contacts  # Validate & materialise

Bonus: DuckDB Extension

If you prefer to stay in SQL, the FineType DuckDB extension lets you profile, validate, and classify directly in queries:

INSTALL finetype FROM community;
LOAD finetype;

-- Classify a single value
SELECT ft_infer('[email protected]');
-- → identity.person.email

-- Or profile a whole table at once
FROM ft_profile('contacts');

See the DuckDB Extension docs for the full function reference.

What's Next?

Keep exploring

Dive deeper into the tools and the type system.

On this page