Skip to main content

Snowflake Marketplace · Product Documentation

US Business Registry Data

8,031,121 active and historic US business entities sourced from 10 Secretary of State portals. NAICS-coded, PII-stripped. Daily refresh on 7 states (NY, TX, PA, CO, IA, OR, CT); 3 states (VA, DE, LA) ship as static snapshots from the original May 2026 backfill because their Secretary of State offices do not publish a free public bulk API.

Dataset at a glance

Rows
8,031,121
States covered
10
NAICS-populated
All rows
Distinct NAICS codes
1,327

The 7 daily-refresh states ingest from their Secretary of State Socrata APIs each morning at 7am ET and propagate to the Snowflake table by 7:30am ET. VA, DE, and LA are point-in-time from a 2026-05 backfill — their state offices either gate bulk data behind paid access or do not publish a public API.

Schema

Table PALAVIR_DATA.BUSINESS_REGISTRY.US_BUSINESS_FILINGS.

ColumnTypeDescription
entity_nameVARCHAR(500)Legal name of the business entity as filed with the Secretary of State.
entity_typeVARCHAR(100)Filing type: LLC, Corp, LP, Nonprofit, Sole Prop, etc.
statusVARCHAR(50)Active, Dissolved, or Unknown.
formation_dateDATEDate the entity was first registered with the state.
principal_addressVARCHAR(500)Full address as filed (often street + city + state + zip).
streetVARCHAR(300)Parsed street component of principal_address.
cityVARCHAR(100)Parsed city component of principal_address.
stateVARCHAR(10)Two-letter state code. Constrained to the 10 states listed below.
zipVARCHAR(20)Parsed postal code from principal_address.
filing_idVARCHAR(100)Unique identifier assigned by the Secretary of State.
source_urlVARCHAR(500)Public URL on the Secretary of State portal where the filing record can be verified.
naics_codeVARCHAR(20)Six-digit NAICS industry classification. Populated for all 8,031,121 rows; 1,327 distinct codes appear in the dataset. Codes may originate from the state filing or, where the state does not collect NAICS, be inferred from the entity name.

Coverage and refresh

Row counts are exact, queried directly from the Snowflake table on 2026-05-21. Daily-refresh states ingest from their state Socrata API each morning; static states ship from a 2026-05 backfill (no public refresh source).

StateRowsRefresh
NY New York4,034,725Daily
TX Texas2,903,089Daily
IA Iowa304,558Daily
PA Pennsylvania269,159Daily
CT Connecticut240,866Daily
CO Colorado80,196Daily
OR Oregon27,023Daily
VA Virginia71,099Static
DE Delaware64,616Static
LA Louisiana35,790Static

Sample query

All active LLCs formed in New York in 2025, grouped by NAICS sector. Runs in seconds on an XSMALL warehouse.

SELECT
  LEFT(naics_code, 2) AS naics_sector,
  COUNT(*) AS new_llcs_2025
FROM PALAVIR_DATA.BUSINESS_REGISTRY.US_BUSINESS_FILINGS
WHERE state = 'NY'
  AND entity_type = 'LLC'
  AND status = 'Active'
  AND formation_date BETWEEN '2025-01-01' AND '2025-12-31'
GROUP BY naics_sector
ORDER BY new_llcs_2025 DESC;

Data sourcing and compliance

  • Primary sources. Rows are sourced from Secretary of State filings or their Socrata data portals where available. The dataset does not include third-party broker scrapes.
  • PII stripped at ingestion. Registered-agent, officer, and principal name columns are dropped before the data is loaded into Snowflake. The dataset contains business entity metadata only, with no consumer PII.
  • Source URL retained per row. Each row carries a source_url pointing to a public state-portal record where the filing can be re-verified.
  • NAICS coverage. The naics_code column is populated for all 8,031,121 rows (1,327 distinct codes appear). Where a state filing does not collect NAICS, the code is inferred from the entity name; the dataset does not distinguish filed vs. inferred codes in a separate column.

Try a sample on Kaggle

Want to inspect the row shape before installing the Snowflake share? Three Palavir datasets are published as free 5,000-row CC BY 4.0 samples on Kaggle:

The business-registry dataset itself is too large for Kaggle (8M+ rows); the Snowflake share is the delivery mechanism for the full data product.

Support

Questions about schema, coverage, or refresh planning? Email support@palavir.co. Privacy practices are documented at palavir.co/privacy.