MeridianMERIDIAN

Profile a Parquet File

Profile data stored in Parquet format using FineType — via CSV export or the DuckDB extension.

Goal: Profile a Parquet file to discover the semantic types in your data, using either CSV export or the DuckDB extension.

Prerequisites

ToolPurpose
FineTypeSemantic type detection
DuckDBReading Parquet files and running SQL
A .parquet fileAny Parquet file — a data warehouse export, a public dataset, your own data

Why Parquet needs a different path

FineType's CLI profiles CSV files. Parquet files store data in a columnar binary format that FineType can't read directly. You have two options:

  1. Export to CSV — use DuckDB to extract a sample, then profile with finetype profile
  2. Use the DuckDB extension — classify values directly inside SQL queries

Both approaches give you the same type labels. Choose whichever fits your workflow.

Option A: Export to CSV and profile

1. Sample and export

DuckDB reads Parquet natively. Extract a sample to CSV:

duckdb -c "COPY (SELECT * FROM 'data.parquet' LIMIT 1000) TO 'sample.csv' (HEADER)"
1000 rows exported.

The LIMIT 1000 keeps the export fast. FineType typically needs a few hundred rows to classify columns accurately — 1,000 is more than enough.

2. Profile the CSV

Run profile on the exported sample:

finetype profile -f sample.csv
FineType Column Profile — "sample.csv" (8 rows, 5 columns)
════════════════════════════════════════════════════════════════════════════════

  COLUMN                    TYPE                                      BROAD   CONF
  ──────────────────────────────────────────────────────────────────────────────
  user_id                   representation.identifier.increment      BIGINT  77.1% [numeric_sequential_detection]
  email                     identity.person.email                   VARCHAR 100.0%
  signup_date               datetime.date.iso                          DATE 100.0%
  country                   geography.location.country              VARCHAR 100.0%
  ip_address                technology.internet.ip_v4               VARCHAR 100.0% [ipv4_detection]

5/5 columns typed, 8 rows analyzed

You now know the semantic types in your Parquet file. From here you can export a schema (finetype profile -f sample.csv -o json-schema), validate and materialise a typed table (finetype validate sample.csv schema.json --db out.db --table sample), or simply use the profile as documentation.

3. Clean up

Remove the intermediate CSV when you're done:

rm sample.csv

Option B: Use the DuckDB extension

The FineType DuckDB extension profiles columns directly inside SQL — no CSV export needed.

1. Install and load the extension

INSTALL finetype FROM community;
LOAD finetype;

The signed community artifact loads on DuckDB 1.2 through 1.5+. See the DuckDB Extension docs for the full function reference.

2. Profile every column

Materialise the Parquet file as a table, then pass its name to the ft_profile table macro:

CREATE TABLE data AS SELECT * FROM read_parquet('data.parquet');

FROM ft_profile('data');
┌─────────────┬─────────────────────────────────────┬────────────────────┬─────────────┐
│ column_name │                type                 │     confidence     │ duckdb_type │
├─────────────┼─────────────────────────────────────┼────────────────────┼─────────────┤
│ amount      │ finance.currency.amount             │ 0.9960123300552368 │ VARCHAR     │
│ created_at  │ datetime.timestamp.iso_8601         │ 0.9695983529090881 │ TIMESTAMP   │
│ email       │ identity.person.email               │                1.0 │ VARCHAR     │
│ id          │ representation.identifier.increment │ 0.9105001091957092 │ BIGINT      │
│ ip_address  │ technology.internet.ip_v4           │                1.0 │ INET        │
│ name        │ identity.person.full_name           │ 0.9888034462928772 │ VARCHAR     │
└─────────────┴─────────────────────────────────────┴────────────────────┴─────────────┘

ft_profile returns one row per column — the detected type, confidence, and the DuckDB type to cast to — the equivalent of finetype profile running entirely inside DuckDB.

Because the result is an ordinary relation, you can filter it inline — for example, to surface only the columns that warrant a typed cast:

SELECT column_name, type, duckdb_type
FROM ft_profile('data')
WHERE duckdb_type <> 'VARCHAR';

What you learned

  • FineType's CLI profiles CSV files; Parquet requires either a CSV export step or the DuckDB extension
  • DuckDB's COPY ... TO command extracts a sample from Parquet to CSV in one line
  • The DuckDB extension's ft_profile table macro profiles columns in-place — useful when you want to stay in SQL
  • Both paths produce the same FineType type labels

See also

On this page