How We Detected 9 Out of 10 Proven Fraud Cases Using Public Data
We tested our statistical models against 10 proven fraud cases, from Feeding Our Future ($250M) to Wells Fargo ($3B). 9 out of 10 were flagged before enforcement action.
Can public data and statistical models detect fraud before enforcement agencies act? We tested this question by running our models against 10 proven fraud cases where DOJ convictions or SEC settlements confirmed the fraud after the fact.
The result: 9 out of 10 cases were flagged by our models using data that was available before enforcement action was taken.
The Test
Our proven patterns analysis selected 10 major fraud cases spanning healthcare, government programs, financial markets, and consumer fraud. These cases had confirmed outcomes: guilty pleas, settlements, or convictions. The total fraud across all 10 cases exceeds $4 billion.
For each case, we asked: would our statistical models have flagged this entity using the public data available at the time?
The models we tested include:
- Isolation Forest for PPP loan anomaly detection
- Beneish M-Score for financial statement manipulation
- CFPB complaint velocity for consumer-facing fraud signals
- Medicare billing z-scores for healthcare provider outliers
- FEC structuring analysis for campaign finance violations
Each model uses different data sources and different statistical methods. The test was whether any combination of these models would have produced a flag.
The Results
Of the 10 proven cases:
- 5 fully detected: The models produced clear anomaly signals using pre-enforcement data
- 4 partially detected: At least one model flagged the entity, though not all relevant signals were captured
- 1 not detected: The fraud method did not produce signals in the public datasets we analyzed
The average detection score across all 10 cases was 7.2 out of 10.
Examples
Feeding Our Future ($250M). This Minnesota nonprofit scheme diverted federal child nutrition funds. Our models flagged anomalous grant disbursement patterns and geographic concentration of payments to related entities.
Wells Fargo ($3B settlement). CFPB complaint velocity analysis showed a measurable spike in specific complaint categories (unauthorized accounts, identity theft) years before the settlement. The complaint volume trajectory was statistically abnormal compared to peer banks.
Medicare billing schemes. Several healthcare fraud cases were flagged by Medicare billing z-scores. Providers who billed at 3 or more standard deviations above their specialty average appeared consistently in both our flagged list and subsequent DOJ actions.
What This Means
This is not a claim that public data catches everything. One case out of ten was missed entirely because the fraud method (internal collusion with no external financial footprint) did not produce signals in federal datasets.
But 90% detection using only public data and standard statistical methods is significant. These models do not require subpoena power, insider information, or proprietary data. They run on datasets that CMS, SBA, SEC, FTC, and other agencies publish for free.
Cross-Dataset Signals
The strongest detection results came from cases where multiple models flagged the same entity independently. Our cross-dataset analysis documents 7 novel correlations across PPP, healthcare, corporate, and consumer fraud datasets.
When a PPP borrower is also flagged for Medicare billing anomalies, and the same entity shows up in CFPB complaint clusters, the probability of legitimate business activity explaining all three signals simultaneously drops sharply.
This is the core insight: fraud detection improves dramatically when you cross-reference across datasets rather than analyzing each dataset in isolation.
Why This Is Not Being Done at Scale
Federal agencies maintain separate databases. CMS does not routinely cross-reference with SBA. The FTC complaint system does not connect to SEC enforcement data. Each agency has its own analytical pipeline, its own data format, and its own enforcement priorities.
The technical barrier to cross-referencing is low. Entity matching, geographic clustering, and temporal correlation are standard data science techniques. The barrier is organizational: agencies are not structured to share data at the speed required for real-time fraud prevention.
Replicate It Yourself
Every dataset and method used in this analysis is documented. The underlying federal data is public domain. Our analysis methods, including the specific model parameters and thresholds, are described on each investigation page.
View the full proven patterns analysis
About the Author
Founder & Principal Consultant
Josh helps SMBs implement AI and analytics that drive measurable outcomes. With experience building data products and scaling analytics infrastructure, he focuses on practical, cost-effective solutions that deliver ROI within months, not years.
Get practical AI & analytics insights delivered to your inbox
No spam, ever. Unsubscribe anytime.
Related Posts
March 27, 2026
March 27, 2026
March 27, 2026
Ready to discuss your needs?
I work with SMBs to implement analytics and adopt AI that drives measurable outcomes.