MeridianMERIDIAN

FineType

Semantic type detection for text data. 250 types, DuckDB integration, pure Rust.

FineType classifies text into 250 semantic types — dates, emails, IP addresses, coordinates, financial identifiers, and more.

Early Release

FineType is under active development. Expect breaking changes to taxonomy labels, CLI arguments, library APIs, and model formats between releases. Pin to a specific version if stability matters for your use case.

Why it matters

You download a dataset. DuckDB reads it instantly, but every text column is VARCHAR. Is that column of numbers a postal code, a year, or a price? Are those dates US or European format?

FineType answers these questions:

$ finetype profile -f orders.csv

Column        Type                           Confidence
────────────  ─────────────────────────────  ──────────
order_date    datetime.date.us_slash          0.97
amount        representation.numeric.decimal  0.98
customer      identity.person.full_name       0.93
country       geography.location.country      0.95
ip_address    technology.internet.ip_v4       0.99

Every type maps to a DuckDB SQL expression. FineType says order_date is datetime.date.us_slash — that means strptime(order_date, '%m/%d/%Y') will succeed on every matching value. Profile first, then cast with confidence.

Installation

curl -fsSL https://install.meridian.online/finetype | bash
brew install meridian-online/tap/finetype
cargo install finetype-cli
irm https://install.meridian.online/finetype/win | iex

CLI Usage

# Classify a single value
finetype infer -i "bc89:60a9:23b8:c1e9:3924:56de:3eb1:3b90"

# Classify from file (one value per line), JSON output
finetype infer -f data.txt --output json

# Column-mode inference (distribution-based disambiguation)
finetype infer -f column_values.txt --mode column

# Profile a CSV file — detect column types
finetype profile -f data.csv

# Profile with data quality validation
finetype profile -f data.csv --validate

# Generate a DuckDB CREATE TABLE from a CSV
finetype schema-for -f data.csv

# Export a JSON Schema for a type
finetype schema datetime.timestamp.iso_8601

# Validate data quality against taxonomy schemas
finetype validate -f data.ndjson --strategy quarantine

# Generate synthetic training data
finetype generate --samples 1000 --output training.ndjson

# Validate generator ↔ taxonomy alignment
finetype check

# Show taxonomy (filter by domain, category, priority)
finetype taxonomy --domain datetime

Column-Mode Inference

Single-value classification can be ambiguous: is 01/02/2024 a US date (Jan 2) or EU date (Feb 1)? Is 1995 a year, postal code, or plain number?

Column-mode analyses the distribution of values in a column and applies disambiguation rules:

  • Date format — US vs EU slash dates, short vs long dates
  • Year detection — 4-digit integers predominantly in 1900–2100 range
  • Coordinate resolution — latitude vs longitude based on value ranges
  • Numeric types — ports, increments, postal codes, street numbers
# CLI column-mode
finetype infer -f column_values.txt --mode column

# CSV profiling (uses column-mode automatically)
finetype profile -f data.csv

Schema Export

Once you know your column types, FineType can generate a DuckDB CREATE TABLE statement — every column typed, every cast guaranteed to succeed:

$ finetype schema-for -f orders.csv

CREATE TABLE orders (
    order_date    DATE,          -- strptime(order_date, '%m/%d/%Y')
    amount        DECIMAL(10,2), -- decimal
    customer      VARCHAR,       -- full_name
    country       VARCHAR,       -- country
    ip_address    VARCHAR        -- ip_v4 (INET)
);

Output formats include plain SQL, JSON (for programmatic use), and Arrow schema JSON.

Data Quality

Profile with --validate to get quality grades for each column. FineType checks every value against the type's schema and reports what doesn't match:

$ finetype profile -f data.csv --validate

Column        Type                   Confidence  Quality  Invalid
────────────  ─────────────────────  ──────────  ───────  ───────
email         identity.person.email  0.99        A        0/1000
order_date    datetime.date.us_slash 0.97        B        12/1000
amount        representation.numeric 0.94        A        2/1000

For deeper validation, finetype validate supports quarantine, null-replacement, forward-fill, and backward-fill strategies for handling invalid values.

Performance

Accuracy

Evaluated on 21 real-world datasets (116 annotated columns):

MetricResult
Label accuracy97.4%
Domain accuracy98.3%
Actionability99.7% (DuckDB casts succeed on real data)

The inference pipeline uses a two-stage architecture — Sense (broad classification) followed by Sharpen (fine-grained disambiguation) — with column-mode distribution analysis for ambiguous types like dates and coordinates.

Latency & Throughput

MetricValue
Model load66 ms cold, 25–30 ms warm
Single inferencep50 = 26 ms, p95 = 41 ms
Batch throughput600–750 values/sec
Memory footprint8.5 MB peak RSS

Acknowledgements

  • QSV — High-performance CSV toolkit that inspired FineType's approach to data profiling

See also

DuckDB Extension · Type Registry

On this page