Snowflake Marketplace · Product Documentation
US Business Registry Data
8,031,121 active and historic US business entities sourced from 10 Secretary of State portals. NAICS-coded, PII-stripped. Daily refresh on 7 states (NY, TX, PA, CO, IA, OR, CT); 3 states (VA, DE, LA) ship as static snapshots from the original May 2026 backfill because their Secretary of State offices do not publish a free public bulk API.
Dataset at a glance
- Rows
- 8,031,121
- States covered
- 10
- NAICS-populated
- All rows
- Distinct NAICS codes
- 1,327
The 7 daily-refresh states ingest from their Secretary of State Socrata APIs each morning at 7am ET and propagate to the Snowflake table by 7:30am ET. VA, DE, and LA are point-in-time from a 2026-05 backfill — their state offices either gate bulk data behind paid access or do not publish a public API.
Schema
Table PALAVIR_DATA.BUSINESS_REGISTRY.US_BUSINESS_FILINGS.
| Column | Type | Description |
|---|---|---|
| entity_name | VARCHAR(500) | Legal name of the business entity as filed with the Secretary of State. |
| entity_type | VARCHAR(100) | Filing type: LLC, Corp, LP, Nonprofit, Sole Prop, etc. |
| status | VARCHAR(50) | Active, Dissolved, or Unknown. |
| formation_date | DATE | Date the entity was first registered with the state. |
| principal_address | VARCHAR(500) | Full address as filed (often street + city + state + zip). |
| street | VARCHAR(300) | Parsed street component of principal_address. |
| city | VARCHAR(100) | Parsed city component of principal_address. |
| state | VARCHAR(10) | Two-letter state code. Constrained to the 10 states listed below. |
| zip | VARCHAR(20) | Parsed postal code from principal_address. |
| filing_id | VARCHAR(100) | Unique identifier assigned by the Secretary of State. |
| source_url | VARCHAR(500) | Public URL on the Secretary of State portal where the filing record can be verified. |
| naics_code | VARCHAR(20) | Six-digit NAICS industry classification. Populated for all 8,031,121 rows; 1,327 distinct codes appear in the dataset. Codes may originate from the state filing or, where the state does not collect NAICS, be inferred from the entity name. |
Coverage and refresh
Row counts are exact, queried directly from the Snowflake table on 2026-05-21. Daily-refresh states ingest from their state Socrata API each morning; static states ship from a 2026-05 backfill (no public refresh source).
| State | Rows | Refresh |
|---|---|---|
| NY — New York | 4,034,725 | Daily |
| TX — Texas | 2,903,089 | Daily |
| IA — Iowa | 304,558 | Daily |
| PA — Pennsylvania | 269,159 | Daily |
| CT — Connecticut | 240,866 | Daily |
| CO — Colorado | 80,196 | Daily |
| OR — Oregon | 27,023 | Daily |
| VA — Virginia | 71,099 | Static |
| DE — Delaware | 64,616 | Static |
| LA — Louisiana | 35,790 | Static |
Sample query
All active LLCs formed in New York in 2025, grouped by NAICS sector. Runs in seconds on an XSMALL warehouse.
SELECT LEFT(naics_code, 2) AS naics_sector, COUNT(*) AS new_llcs_2025 FROM PALAVIR_DATA.BUSINESS_REGISTRY.US_BUSINESS_FILINGS WHERE state = 'NY' AND entity_type = 'LLC' AND status = 'Active' AND formation_date BETWEEN '2025-01-01' AND '2025-12-31' GROUP BY naics_sector ORDER BY new_llcs_2025 DESC;
Data sourcing and compliance
- Primary sources. Rows are sourced from Secretary of State filings or their Socrata data portals where available. The dataset does not include third-party broker scrapes.
- PII stripped at ingestion. Registered-agent, officer, and principal name columns are dropped before the data is loaded into Snowflake. The dataset contains business entity metadata only, with no consumer PII.
- Source URL retained per row. Each row carries a
source_urlpointing to a public state-portal record where the filing can be re-verified. - NAICS coverage. The
naics_codecolumn is populated for all 8,031,121 rows (1,327 distinct codes appear). Where a state filing does not collect NAICS, the code is inferred from the entity name; the dataset does not distinguish filed vs. inferred codes in a separate column.
Try a sample on Kaggle
Want to inspect the row shape before installing the Snowflake share? Three Palavir datasets are published as free 5,000-row CC BY 4.0 samples on Kaggle:
- ATF Federal Firearms Licensees — 5K sample — companion to the full ATF FFL data product.
- OIG LEIE Healthcare Exclusions — 5K sample — companion to the Compliance API feed.
- OFAC SDN Sanctions — 5K sample — companion to the Compliance API feed.
The business-registry dataset itself is too large for Kaggle (8M+ rows); the Snowflake share is the delivery mechanism for the full data product.
Support
Questions about schema, coverage, or refresh planning? Email support@palavir.co. Privacy practices are documented at palavir.co/privacy.