Thursday, January 13, 2011

Data Mining: a very simple start for a beginner (me)

Data mining is "the process of extracting patterns from data" (wikipedia). A data miner changes data into information. Educational data mining is a process of extracting data about learners and using that data to teach better.

What sort of things can you do?

You can use data to develop categories, clusters and classifications. 

In k-Nearest Neighbour (k-NN) classification a data point is classified by majority vote of its nearest neighbours. If k=1 the green circle will be classed with the red triangles, because its nearest neighbour is a red triangle. If k=3 it will again be red triangles because the majority of the (k=)3 nearest neighbours are triangles. If k=5 it will be classified as a blue square because 3 of the (k=)5 nearest neighbours are blue squares. Clearly the choice of k is critical. An alternate method is to weight the classification by the distance to each of the nearest neighbours.


You can try to discover behaviours which occur together. For example, in the sentence: "This is the life!", there are 2xe, 1xf, 2xh etc. If we only count where there are 2 or more occurrences, the "frequent 1 sequences" are: 2xe, 2xh, 3xi, 2xs, 2xt. If we seek the 2-sequences (only for these) we have: e_, e!, hi, he, is, is, s_, s_, th, th. Using a frequency threshold of 2 again, we are left with is, s_, and th as our frequent 2 sequences. Moving to 3 sequences (again only using those we have identified as frequent 2 sequences) and another threshold of 2 we have only 1 frequent 3 sequence: is_. Moving to 4 sequences we find is_i and is_t. Neither of these pass the threshold so the algorithm stops. What have we learnt? The 1 sequences could tell us something about the commonest letters in English and the 2 sequences tell us that is and th are frequent combinations and that s often happens at the end of words. The 3 sequences tell us that is often happens at the end of words. 

So what?

We now have a predictive framework: if you get an i expect an s (and then a space), if you get an s expect a space, if you get a t expect an h.

We could use this process to create 'recommendations' a la Amazon: if you enjoyed doing those sums you might like to try these. Or diagnoses, enabling us to identify the appropriate intervention for the measured behaviour.



References

Baker, S.J.D. & Yacef, K. (2009) The State of Educational Data Mining in 2009: A Review and Future Visions: http://www.educationaldatamining.org/JEDM/images/articles/vol1/issue1/JEDMVol1Issue1_BakerYacef.pdf accessed 10th January 2011

International Working Group on Educational Data Mining available at http://educationaldatamining.org/ accessed 10th January 2011

Wikipedia Data Mining available at http://en.wikipedia.org/wiki/Data_mining accessed 10th January 2011

No comments:

Post a Comment