KDD Cup 2010: Student performance evaluation

Task Description

At the start of the competition, we will provide 5 data sets: 3 development data sets and 2 challenge data sets. Each of the data sets will be divided into a training portion and a test portion, as specified on the Data page. Student performance labels will be withheld for the test portion of each data set. The competition task will be to develop a learning model based on the challenge and/or development data sets, use this algorithm to learn from the training portion of the challenge data sets, and then accurately predict student performance in the test sections. At the end of the competition, the actual winner will be determined based on their model’s performance on an unseen portion of the challenge test sets. We will only evaluate each team’s last submission of the challenge sets.


You will be allowed to train on the training portion of each data set, and will then be evaluated on your performance at providing Correct First Attempt values for the test portion. We will provide feedback for formatting errors in prediction files, but we will not reveal accuracy on test data until the end of the competition. Note that for each test file you submit, an unidentified portion will be used to validate your data and provide scores for the leaderboard, while the remaining portion will be used for determining the winner of the competition. For a valid submission, the evaluation program will compare the predictions you provided against the undisclosed true values and report the difference as Root Mean Squared Error (RMSE). If a data set file is missing from a submission, the evaluation program will report the RMSE as 1 for that file. The total score for a submission will then be the average of the RMSE values. All data sets will receive equal weight in the final average, independent of their size. At the end of the competition, the winner will be the team with the lowest total score.


Technical Challenges

In terms of technical challenges, we mention just a few:

  • The data matrix is sparse: not all students are given every problem, and some problems have only 1 or 2 students who completed each item. So, the contestants need to exploit relationships among problems to bring to bear enough data to hope to learn.
  • There is a strong temporal dimension to the data: students improve over the course of the school year, students must master some skills before moving on to others, and incorrect responses to some items lead to incorrect assumptions in other items. So, contestants must pay attention to temporal relationships as well as conceptual relationships among items.
  • Which problems a given student sees is determined in part by student choices or past success history: e.g., students only see remedial problems if they are having trouble with the non-remedial problems. So, contestants need to pay attention to causal relationships in order to avoid selection bias.

Scientific and Practical Importance

From a practical perspective, improved models could be saving millions of hours of students’ time (and effort) in learning algebra. These models should both increase achievement levels and reduce time needed. Focusing on just the latter, for the .5 million students that spend about 50 hours per year with Cognitive Tutors for mathematics, let’s say these optimizations can reduce time to mastery by at least 10%. One experiment showed the time reduction was about 15% (Cen et al. 2007). That’s 5 hours per student, or 2.5 million student hours per year saved. And this .5 million is less than 5% of all algebra-studying students in the US. If we include all algebra students (20x) and the grades 6-11 for which there are Carnegie Learning and Assistment applications (5x), that brings our rough estimate to 250 million student hours per year saved! In that time, students can be moving on in math and science or doing other things they enjoy.

From a scientific viewpoint, the ability to achieve low prediction error on unseen data is evidence that the learner has accurately discovered the underlying factors which make items easier or harder for students. Knowing these factors is essential for the design of high-quality curricula and lesson plans (both for human instructors and for automated tutoring software). So you, the contestants, have the potential to influence lesson design, improving retention, increasing student engagement, reducing wasted time, and increasing transfer to future lessons.

Currently K-12 education is extremely focused on assessment. The No Child Left Behind act has put incredible pressure on schools to “teach to the test”, meaning that a significant amount of time is spent preparing and taking standardized tests. Much of the time spent drilling for and taking these tests is wasted from the point of view of deep learning (long-term retention, transfer, and desire for future learning); so any advances which allow us to reduce the role of standardized tests hold the promise of increasing deep learning.

To this end, a model which accurately predicts long-term future performance as a byproduct of day-to-day tutoring could augment or replace some of the current standardized tests: this idea is called “assistment”, from the goal of assessing performance while simultaneously assisting learning. Previous work has suggested that assistment is indeed possible: e.g., an appropriate analysis of 8th-grade tutoring logs can predict 10th-grade standardized test performance as well as 8th-grade standardized test results can predict 10th-grade standardized test performance (Feng, Heffernan, & Koedinger, 2009). But it is far from clear what the best prediction methods are; so, the contestants’ algorithms may provide insights that allow important improvements in assistment.

Fundamental Questions

If a student is correct at one problem (e.g., “Starting with a number, if I multiply it by 6 and then add 66, I get 81.90. What’s the number?”) at one time, how likely are they to be correct at another problem (e.g., “Solve for x: 6x+66=81.90”) at a later time? These questions are of both scientific interest and practical importance. Scientifically, relevant deep questions include what is the nature of human knowledge representations and how generally do humans transfer their learning from one situation to another. Human learners do not always represent and solve mathematical tasks as we might expect. You might be surprised if you thought that a student working on the second problem above, the equation 6x+66=81.90, is likely to be correct given that he was correct on the first problem, the story problem. It turns out that most students are able to solve simple story problems like this one more successfully than the matched equation (Koedinger & Nathan, 2004; Koedinger, Alibali, & Nathan, 2008). In other words, there are interesting surprises to be found in student performance data.

Cognitive Tutors for mathematics are now in use in more than 2,500 schools across the US for some 500,000 students per year. While these systems have been quite successful, surprises like the one above suggest that the models behind these systems can be much improved. More generally, a number of studies have demonstrated how detailed cognitive task analysis can result in dramatically better instruction (Clark, Feldon, van Merrienboer, Yates, & Early, 2007; Lee, 2003). However, such analysis is painstaking and requires a high level of psychological expertise. We believe it possible that machine learning on large data sets can reap many of the benefits of cognitive task analysis, but without the great effort and expertise currently required.


  • Cen, H., Koedinger, K. R., & Junker, B. (2006). Learning Factors Analysis: A general method for cognitive model evaluation and improvement. In M. Ikeda, K. D. Ashley, T.- W. Chan (Eds.) Proceedings of the 8th International Conference on Intelligent Tutoring Systems, 164-175. Berlin: Springer-Verlag.
  • Clark, R. E., Feldon, D., van Merrienboer, J., Yates, K., & Early, S. (2007). Cognitive task analysis. In J. M. Spector, M. D. Merrill, J. J. G. van Merrienboer, & M. P. Driscoll (Eds.), Handbook of research on educational communications and technology (3rd ed., pp. 577-593). Mahwah, NJ: Lawrence Erlbaum Associates.
  • Feng, M., Heffernan, N.T., & Koedinger, K.R. (2009). Addressing the assessment challenge in an online system that tutors as it assesses. User Modeling and User-Adapted Interaction: The Journal of Personalization Research (UMUAI). 19(3), pp. 243-266.
  • Koedinger, K. R. & Aleven, V. (2007). Exploring the assistance dilemma in experiments with Cognitive Tutors. Educational Psychology Review, 19 (3): 239-264. Lee, R. L. (2003). Cognitive task analysis: A meta-analysis of comparative studies. Unpublished doctoral dissertation, University of Southern California, Los Angeles, California.
  • Pavlik, P. I., Cen, H., Wu, L.,& Koedinger, K. R. (2008). Using item-type performance covariance to improve the skill model of an existing tutor. In Proceedings of the First International Conference on Educational Data Mining. 77-86.
Copyrights © 2023 All Rights Reserved - SIGKDD
ACM Code of Conduct