Sunday, January 23, 2011

Some problems with the use of data mining in social sciences

Old science?




Anderson 2008 believes that the impact of huge databases has changed the methodology of Science:


"The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.
Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.
But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. ....
There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
The best practical example of this is the shotgun gene sequencing by J. Craig Venter. Enabled by high-speed sequencers and supercomputers that statistically analyze the data they produce, Venter went from sequencing individual organisms to sequencing entire ecosystems. In 2003, he started sequencing much of the ocean, retracing the voyage of Captain Cook. And in 2005 he started sequencing the air. In the process, he discovered thousands of previously unknown species of bacteria and other life-forms.
If the words "discover a new species" call to mind Darwin and drawings of finches, you may be stuck in the old way of doing science. Venter can tell you almost nothing about the species he found. He doesn't know what they look like, how they live, or much of anything else about their morphology. He doesn't even have their entire genome. All he has is a statistical blip — a unique sequence that, being unlike any other sequence in the database, must represent a new species.
This sequence may correlate with other sequences that resemble those of species we do know more about. In that case, Venter can make some guesses about the animals — that they convert sunlight into energy in a particular way, or that they descended from a common ancestor. But besides that, he has no better model of this species than Google has of your MySpace page. It's just data. By analyzing it with Google-quality computing resources, though, Venter has advanced biology more than anyone else of his generation."
I have issues with this approach. The first is that it is naive to believe that pure induction is possible.  An old, probably apocryphal story is told of an amateur scientist in the early days of the Royal Society who became enamoured of the Baconian inductive method of Science. So for thirty years he observed everything, wrote up his diaries and presented them, a mass of unorganised data, to the RS where they still lie, in the archives, unanalysed. But the 'scientist' was fooling himself. Before you collect data you must decide what to measure, before you observe you must select what you observe. Thus your beliefs (which are essentially unformed, unacknowledged theories) influence and pre-judge the facts from which you then construct your theories. There is likely to be a little confirmation bias in this loop, particularly so if the data is about people as in the social sciences.

My second concern is with the danger of false positives.  (Rajaraman 2008) suggests that "adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set".  Perhaps they key word here is 'independent' because if you look for patterns you are going to find them. Statisticians use significance limits, often 1%,  to determine whether a pattern occurs by pure random chance. But if you have a large data set of say 100 variables and you correlate each with each of the others you will trawl 10,000 patterns. At a 1% level you would expect to find 100 correlations by chance! How will you tell which are true correlations and which are false positives?

Thirdly, there is a concern about the objectivity of the data. Genome sequences are relatively easy to observe although there is always the possibility of contamination. But in the social sciences it is far more difficult for the observer NOT to 'contaminate' the observation. For example, if a subject is aware that they are being observed they may behave differently, often to conform to what they believe the observer expects of them. An example of this is social desirability bias. This can be exacerbated in action research in education when the observing experimenter may also be the teacher seeking to achieve better grades for the pupils who are at the same time the subjects of the experiment. Ethical considerations suggest that you can't not get involved. "On one hand, institutions might be vulnerable to charges of “profiling” students when they draw conclusions from student data; on the other, they could be seen as irresponsible if they don’t take action when data suggest a student is having difficulty." Educause 2010. But Goodhart's Law, quoted by Snowdon (2011) as  "the minute a measure becomes a target it ceases to be a measure" (although more accurately it is 'once a measure becomes a target it loses its value as a measure') suggests that the taking of action significantly undermines the research. 

This feedback effect has huge implications. Traditionally social sciences have used statistics based on the Gaussian bell-curve. There is significant research (Taleb 2007; Ball 2004; Buchanan 2000) to suggest that Mandelbrotian power-law statistics may be more appropriate. This is because there is often feedback between observer and observed. Feedback (also found in earthquake modelling and avalanches of grains in sand-piles) changes the maths. 

I still agree that the data mining has massive potential but its application to social sciences raises significant concerns which need to be addressed.

References


Anderson C 2008 The end of theory: the data deluge makes the scientific method obsolete Wired Magazine 23rd June 2008 available at http://www.wired.com/science/discoveries/magazine/16-07/pb_theory accessed 17th January 2011


Ball P 2004 Critical Mass: how one things leads to another Heinemann London


Buchanan M 2000 Ubiquity Weidenfield & Nicolson, London


Educause 2010 7 things you should know about analytics available at http://net.educause.edu/ir/library/pdf/ELI7059.pdf accessed 17th January 2011


Rajaraman A 2008 More data usually beats better algorithms Blog post 24th March 2008 in Datawocky available at http://anand.typepad.com/datawocky/2008/03/more-data-usual.html accessed 17th January 2011


Snowdon D 2011 A grain of sand: innovation diffusion blog posted on 11th January 2011 in Cognitive Edge available at http://www.cognitive-edge.com/blogs/dave/2011/01/a_grain_of_sand_innovation_dif.php accessed 17th January 2011


Taleb NN 2007 The Black Swan: the impact of the highly improbable Random House, New York

No comments:

Post a Comment