SOMALISCAN

The full archive of US government spending, opened to the public.

60 tables · ~696M rows · 37 GB Parquet · CC0 public domain

Why this exists, why it ends here

SomaliScan started as a working transparency platform — a database backend, a search interface, an AI query layer. It ran for several years and accumulated a corpus of US federal, state, and local spending records that, in aggregate, is much more useful than the sum of its public sources.

Maintaining a live service is expensive. Open-sourcing the corpus is not. This is the terminal snapshot: every row of public-record spending data SomaliScan ever ingested, exported to Apache Parquet, published under CC0, and mirrored to permanent infrastructure. The live site is closed; the data is free, forever.

Some of what’s in here

  • $26.9Tin state-level vendor payments across all 50 states, 2003–2026
  • 241Mfederal campaign contributions linked to spending recipients
  • $793Bacross 11.5M PPP loans — fully cross-linked to federal contractors
  • $39.5Bto Lockheed Martin, the top federal contractor of the snapshot period
  • $91Mpaid by industry to a single Florida physician in 2024 (CMS Open Payments)
  • $79.9Bin assets at the Lilly Endowment, the largest US nonprofit

Query without downloading

Every table is Apache Parquet. With DuckDB installed, you can query straight from the Hugging Face mirror — no download, no account, no API key:

SELECT recipient_name,
       ROUND(SUM(federal_action_obligation)::DOUBLE / 1e9, 1) AS billions
FROM 'hf://datasets/somaliscan/spending-archive/federal_contracts_v2/**/*.parquet'
WHERE recipient_name IS NOT NULL
GROUP BY recipient_name
ORDER BY billions DESC
LIMIT 5;

Cite

@dataset{somaliscan_spending_2026,
  title  = {SomaliScan: US Government Spending Archive 2003--2026},
  author = {SomaliScan Project},
  year   = {2026},
  publisher = {Hugging Face Datasets},
  version = {1.0.0},
  url    = {https://huggingface.co/datasets/somaliscan/spending-archive},
  license = {CC0-1.0}
}