Here’s a fun science mystery with surprising ways to catch bad guys and metadata flavor–which makes it hard to know where to begin this blogpost…
Bad guys are juicy–suppose that you’re a bad guy. Suppose you want to fake bookkeeping data or election results. Well, bwa-ha-ha, bad guy, you’re going to leave a mathematical “bad-guys-R-us” slimy trail–because fake random numbers like yours don’t obey Benford’s Law. Real ones do.
Benford’s Law describes–oddly, nobody understands why–many if not most huge collections of numbers.* Baseball statistics, lengths of rivers, areas of counties. Half-lives of radioactive isotopes. And vote counts, when those vote counts aren’t tampered with.
Big numbers have nine choices for their first digit–1, 2, 3, 4, 5, 6, 7, 8, or 9. Right? So you’d expect that one-ninth of all big numbers would start with each of those digits.
Bzzzt, wrong! Almost 1/3 of Benford-law-following numbers start with 1–just for example. Naive human fraudsters, on the other hand, create fake numbers that mostly start with 5 or 6–poking their inventions into what they imagine is anonymity’s forgiving middle.
Now for the dirty math books–I knew you were waiting–Benford’s Law was found, independently, in 1881 (Simon Newcomb) and 1938 (Frank Benford). Lisa Zyga at PhysOrg.com says:
Benford and Newcomb stumbled upon the law in the same way: while flipping through pages of a book of logarithmic tables, they noticed that the pages in the beginning of the book were dirtier than the pages at the end. This meant that their colleagues who shared the library preferred quantities beginning with the number one in their various disciplines…
Yes, dirty library-book pages! Important pre-Google metadata about what people before you found interesting.
Bad guys who created fake election data imagined that they were just creating new data–but Benford’s Law meant they left metadata behind. Forensic teams who want to find election fraud can use Benford’s Law to find out which sets of data have bad guys behind them.
Now, as for you good guys, for deeper insight into metadata, I recommend David Weinberger’s new book, Everything is Miscellanous. Meanwhile, for you bad guys, one message of both David Weinberger and Frank Benford is paraphrased clearly in Matthew 10:26:
…there is nothing covered, that shall not be revealed; and hid, that shall not be known.
Uh-oh. And I consider my own self a good guy.
* p.s. Not all data sets follow Benford’s Law. Both Wikipedia and Zyga give counter-examples, such as (to quote Zyga) “data sets that are arbitrary and contain restrictions..For example, lottery numbers, telephone numbers, gas prices, dates, and the weights or heights of a group of people.”