Profile a Parquet File
Profile data stored in Parquet format using FineType — via CSV export or the DuckDB extension.
Goal: Profile a Parquet file to discover the semantic types in your data, using either CSV export or the DuckDB extension.
Prerequisites
| Tool | Purpose |
|---|---|
| FineType | Semantic type detection |
| DuckDB | Reading Parquet files and running SQL |
A .parquet file | Any Parquet file — a data warehouse export, a public dataset, your own data |
Why Parquet needs a different path
FineType's CLI profiles CSV files. Parquet files store data in a columnar binary format that FineType can't read directly. You have two options:
- Export to CSV — use DuckDB to extract a sample, then profile with
finetype profile - Use the DuckDB extension — classify values directly inside SQL queries
Both approaches give you the same type labels. Choose whichever fits your workflow.
Option A: Export to CSV and profile
1. Sample and export
DuckDB reads Parquet natively. Extract a sample to CSV:
duckdb -c "COPY (SELECT * FROM 'data.parquet' LIMIT 1000) TO 'sample.csv' (HEADER)"1000 rows exported.The LIMIT 1000 keeps the export fast. FineType typically needs a few hundred rows to classify columns accurately — 1,000 is more than enough.
2. Profile the CSV
Run profile on the exported sample:
finetype profile -f sample.csvFineType Column Profile — "sample.csv" (8 rows, 5 columns)
════════════════════════════════════════════════════════════════════════════════
COLUMN TYPE BROAD CONF
──────────────────────────────────────────────────────────────────────────────
user_id representation.identifier.increment BIGINT 77.1% [numeric_sequential_detection]
email identity.person.email VARCHAR 100.0%
signup_date datetime.date.iso DATE 100.0%
country geography.location.country VARCHAR 100.0%
ip_address technology.internet.ip_v4 VARCHAR 100.0% [ipv4_detection]
5/5 columns typed, 8 rows analyzedYou now know the semantic types in your Parquet file. From here you can export a schema (finetype profile -f sample.csv -o json-schema), validate and materialise a typed table (finetype validate sample.csv schema.json --db out.db --table sample), or simply use the profile as documentation.
3. Clean up
Remove the intermediate CSV when you're done:
rm sample.csvOption B: Use the DuckDB extension
The FineType DuckDB extension profiles columns directly inside SQL — no CSV export needed.
1. Install and load the extension
INSTALL finetype FROM community;
LOAD finetype;The signed community artifact loads on DuckDB 1.2 through 1.5+. See the DuckDB Extension docs for the full function reference.
2. Profile every column
Materialise the Parquet file as a table, then pass its name to the ft_profile table macro:
CREATE TABLE data AS SELECT * FROM read_parquet('data.parquet');
FROM ft_profile('data');┌─────────────┬─────────────────────────────────────┬────────────────────┬─────────────┐
│ column_name │ type │ confidence │ duckdb_type │
├─────────────┼─────────────────────────────────────┼────────────────────┼─────────────┤
│ amount │ finance.currency.amount │ 0.9960123300552368 │ VARCHAR │
│ created_at │ datetime.timestamp.iso_8601 │ 0.9695983529090881 │ TIMESTAMP │
│ email │ identity.person.email │ 1.0 │ VARCHAR │
│ id │ representation.identifier.increment │ 0.9105001091957092 │ BIGINT │
│ ip_address │ technology.internet.ip_v4 │ 1.0 │ INET │
│ name │ identity.person.full_name │ 0.9888034462928772 │ VARCHAR │
└─────────────┴─────────────────────────────────────┴────────────────────┴─────────────┘ft_profile returns one row per column — the detected type, confidence, and the DuckDB type to cast to — the equivalent of finetype profile running entirely inside DuckDB.
Because the result is an ordinary relation, you can filter it inline — for example, to surface only the columns that warrant a typed cast:
SELECT column_name, type, duckdb_type
FROM ft_profile('data')
WHERE duckdb_type <> 'VARCHAR';What you learned
- FineType's CLI profiles CSV files; Parquet requires either a CSV export step or the DuckDB extension
- DuckDB's
COPY ... TOcommand extracts a sample from Parquet to CSV in one line - The DuckDB extension's
ft_profiletable macro profiles columns in-place — useful when you want to stay in SQL - Both paths produce the same FineType type labels
See also
profilecommand reference — all flags and output formats- DuckDB Extension — full function reference for the SQL extension
- Build a Typed DuckDB Pipeline — take profiling results and create a fully typed table