Saturday, March 29, 2014

What I have learned about modeling, statistics, and quantitative methods

1. Have a theory, a mechanism that connects the various aspects of what you are studying.
2. From fieldwork, other studies, whatever, be sure the right aspects are included. Thinking won't tell you much, compared to going out and talking to people, getting a sense of what might matter. You intuitions are likely to reflect your own experience, not those of what you are studying.
3. Sensitivity analyses are crucial--what if a parameter were somewhat different, what if you have a polluted data set, what if your a priori or estimates of certain factors could be different by 50% or maybe a factor of ten.
4. If you can understand your model without invoking fancy physics theory or whatever, and if what you say is in the end independent of that theory, don't bring it up. Hence if you talk about complex adaptive systems, but never actually model one using real data, perhaps you are just using some useful metaphors.
5. On the other hand, if fancy theory really helps, go for it--so for example, Cuernavaca, Mexico bus drivers' schedules are in fact understandable using  what is called random matrix theory,  not only in the distributions that such theory predicts--theTracy-Widom distribution and the Wigner semi-circle law, but you can actually model the process, you are really in great shape. [See: The statistical properties of the city transport in Cuernavaca (Mexico) and random matrix ensembles,  J. Phys. A: Math. Gen. 33 (2000): L229.]
6. If you do statistical analysis, it would really help to have some mechanism behind your regression or whatever statistical reduction process you employ. You might well do some such analysis to get an idea of what matters, but having something like a theory (#1 above) will serve you well.
7. Keep in mind that numbers that come out of your programs and packages are unlikely to be reliable to more than two significant figures, most likely one. Of course, you can decide if some factor or coefficient is close to zero or to one, or is important or not very important. But the actual number is a 1.5 significant figure number. You can locate a mean to high precision with lots of data points (snce the standard error of the mean is the SD/SQRT(N), but it's hard to believe that you have a conventional gaussian to that precision, or whatever distribution. You need robust and resilient methods, and you also need a sense of how polluted is your data set. High N does not save you from pollution. You might do lots of tests about the quality of the discovered distribution, at least.
8. How much would you bet on the quality of your reported figures? For example, in some of the natural sciences, you can measure something to anywhere from 3-12 significant figures. And you have theoretical connections of what you measure to other measurements. No economic or social science studies have this level of quality or connection--although for the most part it does not matter. Now, when you make a claim about economic growth or unemployment rates, what matters is to get it roughly right and in the correct direction. And in general you can measure such to 1% or so, although some numbers can be known quite well, you might say things changed 0.1%, since the denominator is very large. But the uncertainty in that is still likely to reflect the uncertainty in the numerator.
9. Recurrently, there is the call for measureable consequences of policies, and the claim that if you can't measure it, it does not exist. What is needed is a much better sense of what you should be measuring, and that means you have to go out into the field and find out more about how your policy is working. Ahead of time, you want to lower unemployment, but you had better go out and find out just what you are measuring when you do such surveys. Of course, this is well known.
10. But in less well trodden circumstances than national statistics, you have to find out what is going on, the ground truth that corresponds to what you measure, or perhaps what to measure that corresponds to the ground truth.
11. If there are very substantial consequences to your claims, you have to be sure that your modeling, statistics, and quantitative methods are good enough. You have to reveal their uncertainties when you make your claims. Of course, there are interested actors who will take what you say and selectively present it. But when the consequences are serious, you owe it to society to allow others to have a sense of how reliable are your claims.
12. Some of the time, there are policies or programs that are justified by what they do on the input side. Perhaps you cannot show any connection between income inequality, however understood, and economic growth, or whatever. But it might make sense to lower inequality for the purposes of social integrity and political legitimacy.  Maybe it won't change much, but at least you don't look so unequal compared to other places. Of course, others will argue that income is a matter of dessert. But one of the problems in saying this is that we have a system of taxes and incentives--tax expenditures--that allow some people to have much larger incomes than they might have. We might decide those are OK, but they don't look so good when you compare them to food stamps.
13. Keep in mind that physics envy is likely to shift to biology envy. That is, biology, genetics, computational biology, etc will be providing some of the models. More to the point, physics, biology, chemistry,... all of them are crutches to allow you to think about what is going on. It's very hard to identify actual processes and measureable parameters that correspond to these models. Econometrics may well allow you to take a messy data set and make it speak to you. But be sure that it is speaking in some understandable language about how the world works.
14. Simple models are often very illuminating, suggesting mechanisms that might be behind what you see. If the models get very complicated, it's hard to discern the mechanism in this black box.
15. Before you do your analysis, write down what you think the numbers will be. You are allowed to analyze 1/10 the data, and then have another write down having seen the results with 1/10 of the data.  And you may well need to fix things at this point since the results are surprising. But what you need to do is to open up the black box of your total data analysis and accept its outcome, or at least be quite reluctant to fiddle with things.

No comments: