Data & Provenance

Where each dataset comes from, the licenses, how releases are published, and what we promise about URLs.

A reference dataset is only useful if you can trust where it came from and count on it not to shift under you. This page is the provenance and trust contract for Meridian Datasets: sources, licenses, release mechanics, typed columns, the URL stability promise, and the limits we're honest about.

Where each dataset comes from

Every dataset is a curated projection of an authoritative public source. We keep a stable subset of columns and normalize types; we don't invent or edit records.

Dataset	Source	Columns kept
GLEIF	GLEIF golden copy — the concatenated public LEI file	`lei`, `legal_name`, `country`, `city`, `jurisdiction`, `legal_form`, `entity_status`, `registration_status`, `initial_registration`, `next_renewal`
SEC EDGAR	SEC EDGAR company tickers	`cik`, `company_name`, `ticker`, `exchange`
NAICS	U.S. Census Bureau, 2022 edition	`code`, `title`, `level`, `description`

Each table also carries a hidden corpus column — a normalized blend of the text columns that powers the explorer's search. It's an index artifact, not source data; see Use anywhere.

Open licenses only

We publish a dataset only when its license is proven to permit redistribution. The default is to refuse — a source is out until it's shown to be safe to share, not the other way around.

Dataset	License	Commercial use
GLEIF	CC0 1.0 (public domain dedication)	Yes, no attribution required
SEC EDGAR	U.S. public domain (federal government work)	Yes
NAICS	U.S. public domain (Census Bureau)	Yes

Share-alike licenses (such as ODbL) and proprietary identifier sets are deliberately excluded from the open commons, because redistributing them cleanly isn't possible. If a dataset here ever carried different terms, they'd be stated on its page and in this table — but today everything is CC0 or public domain.

How releases are published

Datasets are immutable and content-addressed, which is what makes reads reproducible.

Each time a dataset is built, the Parquet is written to a URL whose path contains a hash of its contents — for example gleif/07e1bb77d258ba51.parquet. Because the bytes at that URL can never change, it's cached forever and never overwritten. A separate release simply gets a new hash and a new URL; the old one stays valid for anyone still reading it.

A single small manifest.json is the pointer. It's served with no-cache, so it's always fresh, and it maps each dataset to its current content-addressed URL plus metadata:

{
  "datasets": {
    "gleif": {
      "url": "https://openlake.meridian.online/gleif/07e1bb77d258ba51.parquet",
      "rows": 3361809,
      "bytes": 127418763,
      "sha256": "07e1bb77d258ba51",
      "published": "2026-07-04T06:31:25Z"
    }
  }
}

Two consequences worth knowing:

As-of dates are explicit. The published timestamp on each dataset is its as-of. When you need to cite exactly what you queried, cite the sha256 and published from the manifest.
Reads don't tear. The browser engine caches byte ranges by URL. Because a URL's bytes are frozen, a republish can never mix old and new bytes into a corrupt read — the pointer flips atomically, and readers move to the new URL cleanly on their next query.

The stable <slug>.parquet URL (for example gleif.parquet) always follows the latest release. For a snapshot that will never change under you, read the content-addressed URL from the manifest instead.

Typed columns

Columns carry semantic types precomputed offline by Finetype — not inferred live in your browser. Only confident, in-domain types are attached; a column Finetype can't type confidently simply has no badge. Those types show as badges in the explorer's Profile tab and its column menus.

For GLEIF, for example:

Column	Type	Finetype label
`lei`	LEI	`finance.securities.lei`
`country`	Country Code	`geography.location.country_code`
`jurisdiction`	Country Code	`geography.location.country_code`
`city`	City Name	`geography.location.city`
`initial_registration`	ISO Date	`datetime.date.iso`
`next_renewal`	ISO Date	`datetime.date.iso`

The value is that the type is actionable: an ISO Date column parses cleanly as a date, an LEI column is a checksummed identifier you can join on with confidence.

URL stability promise

You can hard-code these and build on them:

Catalog — https://openlake.meridian.online/catalog/open.ducklake is stable.
Manifest — https://openlake.meridian.online/manifest.json is stable.
Per-dataset pointer — https://openlake.meridian.online/<slug>.parquet is stable and follows the latest release.
Versioned URLs — a content-addressed URL stays available for at least 30 days after it's superseded, so in-flight readers and pinned snapshots keep working while you migrate.

Known limits

We'd rather you know these up front:

Refresh cadence is manual today. There's no live-freshness SLA yet. Releases happen periodically; the manifest's published date is the honest as-of for each dataset. Treat these as slow-changing reference tables, not real-time feeds.
We don't edit source records. If a value is wrong at the source, it's wrong here until the source corrects it and we re-pull. We publish faithful projections, not our own corrections.
Column subsets, not the full source schema. We keep the columns most useful for lookups and joins (listed above), not every field the upstream file carries.
Search index tracks the release closely, not perfectly. The GLEIF search index is rebuilt alongside the data; the manifest records its own searchPublished timestamp, which can briefly lag a fresh data republish.

Report a data issue

Spotted something wrong, stale, or missing? Open an issue — include the dataset slug, the row or value, and what you expected:

github.com/meridian-online/open-datasets

That's also where to request a new dataset.