How We Detected 9 Out of 10 Proven Fraud Cases Using Public Data

Can public data and statistical models detect fraud before enforcement agencies act? We tested this question by running our models against 10 proven fraud cases where DOJ convictions or SEC settlements confirmed the fraud after the fact.

The result: 9 out of 10 cases were flagged by our models using data that was available before enforcement action was taken.

The Test

Our proven patterns analysis selected 10 major fraud cases spanning healthcare, government programs, financial markets, and consumer fraud. These cases had confirmed outcomes: guilty pleas, settlements, or convictions. The total fraud across all 10 cases exceeds $4 billion.

For each case, we asked: would our statistical models have flagged this entity using the public data available at the time?

The models we tested include:

Isolation Forest for PPP loan anomaly detection
Beneish M-Score for financial statement manipulation
CFPB complaint velocity for consumer-facing fraud signals
Medicare billing z-scores for healthcare provider outliers
FEC structuring analysis for campaign finance violations

Each model uses different data sources and different statistical methods. The test was whether any combination of these models would have produced a flag.

The Results

Of the 10 proven cases:

5 fully detected: The models produced clear anomaly signals using pre-enforcement data
4 partially detected: At least one model flagged the entity, though not all relevant signals were captured
1 not detected: The fraud method did not produce signals in the public datasets we analyzed

The average detection score across all 10 cases was 7.2 out of 10.

Examples

Feeding Our Future ($250M). This Minnesota nonprofit scheme diverted federal child nutrition funds. Our models flagged anomalous grant disbursement patterns and geographic concentration of payments to related entities.

Wells Fargo ($3B settlement). CFPB complaint velocity analysis showed a measurable spike in specific complaint categories (unauthorized accounts, identity theft) years before the settlement. The complaint volume trajectory was statistically abnormal compared to peer banks.

Medicare billing schemes. Several healthcare fraud cases were flagged by Medicare billing z-scores. Providers who billed at 3 or more standard deviations above their specialty average appeared consistently in both our flagged list and subsequent DOJ actions.

What This Means

This is not a claim that public data catches everything. One case out of ten was missed entirely because the fraud method (internal collusion with no external financial footprint) did not produce signals in federal datasets.

But 90% detection using only public data and standard statistical methods is significant. These models do not require subpoena power, insider information, or proprietary data. They run on datasets that CMS, SBA, SEC, FTC, and other agencies publish for free.

Cross-Dataset Signals

The strongest detection results came from cases where multiple models flagged the same entity independently. Our cross-dataset analysis documents 7 novel correlations across PPP, healthcare, corporate, and consumer fraud datasets.

When a PPP borrower is also flagged for Medicare billing anomalies, and the same entity shows up in CFPB complaint clusters, the probability of legitimate business activity explaining all three signals simultaneously drops sharply.

This is the core insight: fraud detection improves dramatically when you cross-reference across datasets rather than analyzing each dataset in isolation.

Why This Is Not Being Done at Scale

Federal agencies maintain separate databases. CMS does not routinely cross-reference with SBA. The FTC complaint system does not connect to SEC enforcement data. Each agency has its own analytical pipeline, its own data format, and its own enforcement priorities.

The technical barrier to cross-referencing is low. Entity matching, geographic clustering, and temporal correlation are standard data science techniques. The barrier is organizational: agencies are not structured to share data at the speed required for real-time fraud prevention.

Replicate It Yourself

Every dataset and method used in this analysis is documented. The underlying federal data is public domain. Our analysis methods, including the specific model parameters and thresholds, are described on each investigation page.

View the full proven patterns analysis

View cross-dataset correlations

Explore all 50+ fraud investigations