Sunday, August 23, 2015

Thresholdout - a stunning paper in Science

Fig 1 of Dwork et al. Their algorithm stops bogus
"learning" being "validated" from a holdout set -see text.
Stunning paper in Science called "The reusable holdout: Preserving validity in adaptive data analysis" by Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth.  This is about the serious problem that bedevils so much published research of over-fitting to data. As everyone knows if you try enough correlations between two datasets of any complexity you will eventually find one that has an apparently convincing p-value. This is sometimes called "p-hacking" and is a significant reason why so much published research is false.  And it is an even more serious problem for machine learning from "big data."

A sensible remedy for this is to divide the data into a training set and a "holdout set" which is only looked at when you have developed an apparently plausible hypothesis on the training set. But the trouble is - that only works once. If the holdout set doesn't validate the hypothesis and the researcher then tweaks the hypothesis the whole validity of the approach collapses.

Dwork and her colleagues have developed an algorithm they call Thresholdout which deals with this problem by injecting an element of Laplace Noise in the interrogation of the Holdout data set so that the over-fitting problem is avoided. Fig 1 shows what happens with a 10,000-point training and holdout set with a binary Y value that is in fact completely random. With the normal approach the "verified" accuracy of the classifier on the (repeatedly used) holdout set rises to over 60% (green line) whereas using Thresholdout it remains 50%.

1 comment:

SJ said...

Yes, indeed, Aaron Roth is the illustrious son of Alvin Roth. And a wonderful academician in his own right. It's amusing this is tagged 'Al Roth' - talk about living in the shadow of a famous father :P