|Fig 1 of Dwork et al. Their algorithm stops bogus |
"learning" being "validated" from a holdout set -see text.
A sensible remedy for this is to divide the data into a training set and a "holdout set" which is only looked at when you have developed an apparently plausible hypothesis on the training set. But the trouble is - that only works once. If the holdout set doesn't validate the hypothesis and the researcher then tweaks the hypothesis the whole validity of the approach collapses.
Dwork and her colleagues have developed an algorithm they call Thresholdout which deals with this problem by injecting an element of Laplace Noise in the interrogation of the Holdout data set so that the over-fitting problem is avoided. Fig 1 shows what happens with a 10,000-point training and holdout set with a binary Y value that is in fact completely random. With the normal approach the "verified" accuracy of the classifier on the (repeatedly used) holdout set rises to over 60% (green line) whereas using Thresholdout it remains 50%.
I realised when writing this that Aaron Roth must be the son of Alvin Roth who I know slightly and like a lot. I finished his wonderful book Who Gets What and Why last month. On Amazon.com I said This is a simply brilliant book. Not only does it tell some fascinating stories which really make you think, and teach you a lot about the subtleties of how real markets - as opposed to the idealised "markets" of classical economics - actually work, but it is just beautifully written.