Saturday, December 6, 2014

Blind Analysis of a Data Set

http://www.slac.stanford.edu/econf/C030908/papers/TUIT001.pdf 

is an article that discusses this in the context of particle physics.

Blind Analysis is a method used by particle physicists, who nowadays usually have very large data sets to analyze (typically terabytes and then some). They take a small fraction of the data set (say 1-3%), and develop their analysis methods (cuts in the data, ways of thinking there might be a signal, etc) using that small set to work it all out. What they are trying to do is to figure out how to see a small effect in a sea of noise and irrelevance (interactions other than the ones they are concerned with, which are very rare). When they are satisfied that they have done the best they can do, then and only then, do they run all the data through that "best" analysis method. With such a small test data set, they are unlikely to see anything very interesting so far. Rather, only after they put all the data through the system might there be any significant effects.

The basic idea is to prevent you from working over and changing your analysis method to get the result you might expect.  I guess this would be called data mining (although that is not the right term) or fudging.

I don't know if people then do some more analysis, revising their method, once they have had the blindfolds removed. 

Double-blind is the gold standard for medicine and biostatistics. I do not know if blind analysis is standard for social science statistical analysis. My impression is that people still try various regressors and various schemes to see if they can get good results. As I just said, I may well be wrong.

MK

No comments: