SOMALISCAN
The full archive of US government spending, opened to the public.
60 tables · ~696M rows · 37 GB Parquet · CC0 public domain
Why this exists, why it ends here
SomaliScan started as a working transparency platform — a database backend, a search interface, an AI query layer. It ran for several years and accumulated a corpus of US federal, state, and local spending records that, in aggregate, is much more useful than the sum of its public sources.
Maintaining a live service is expensive. Open-sourcing the corpus is not. This is the terminal snapshot: every row of public-record spending data SomaliScan ever ingested, exported to Apache Parquet, published under CC0, and mirrored to permanent infrastructure. The live site is closed; the data is free, forever.
Some of what’s in here
- $26.9Tin state-level vendor payments across all 50 states, 2003–2026
- 241Mfederal campaign contributions linked to spending recipients
- $793Bacross 11.5M PPP loans — fully cross-linked to federal contractors
- $39.5Bto Lockheed Martin, the top federal contractor of the snapshot period
- $91Mpaid by industry to a single Florida physician in 2024 (CMS Open Payments)
- $79.9Bin assets at the Lilly Endowment, the largest US nonprofit
Query without downloading
Every table is Apache Parquet. With DuckDB installed, you can query straight from the Hugging Face mirror — no download, no account, no API key:
SELECT recipient_name,
ROUND(SUM(federal_action_obligation)::DOUBLE / 1e9, 1) AS billions
FROM 'hf://datasets/somaliscan/spending-archive/federal_contracts_v2/**/*.parquet'
WHERE recipient_name IS NOT NULL
GROUP BY recipient_name
ORDER BY billions DESC
LIMIT 5;Cite
@dataset{somaliscan_spending_2026,
title = {SomaliScan: US Government Spending Archive 2003--2026},
author = {SomaliScan Project},
year = {2026},
publisher = {Hugging Face Datasets},
version = {1.0.0},
url = {https://huggingface.co/datasets/somaliscan/spending-archive},
license = {CC0-1.0}
}