Frequently Asked Questions
KDD Cup 2009: Customer relationship prediction
Participation and Registration
What is the goal of the challenge?
The challenge consists of several classification problems. The goal is to make the best possible predictions of a binary target variable from a number of predictive variables.
Can I enter under multiple names?
No, we limit each participant to one final entry, which may contain results on the large dataset only in the fast track and on either or both the small and the large dataset in the slow track. Registering under multiple names would be considered cheating and disqualify you. Your real identity must be known to the organizers. You may hide your identity only to the outside by checking the "Make my profile anonymous" in the registration form.
Can I participate to multiple teams?
No. Each individual is allowed to make only a single final entry into the challenge to compete towards the prizes. During the development period, each team must have a different registered team leader. To be ranked in the challenge and qualify for prizes, each registered participant (individual or team leader) will have to disclose the names of eventual team members, before the final results of the challenge get released. Hence, at the end of the challenge, you will have to choose to which team you want to belong (only one!), before the results are publicly released. After the results are released, no change in team composition will be allowed.
I understand that one person can join only one team, however, is it ok to have many teams in the same organization?
Yes it is OK. Each team leader must be a different person and must register and the teams cannot intersect. Before the end of the challenge the team leaders will have to declare the composition of their team. This will have to correspond to the list of co-authors in the proceedings, if they decide to publish their results. Hence a professor cannot have his/her name on all his/her students papers (but can be thanked in acknowledgements).
How do I register a team?
Only register the team leader and choose a nickname for your team. We'll let you know later how to disclose the members of your team.
Can the organizers enter the challenge?
No. The organizers may make entries under the common name "Reference" to stimulate the competition, but they do not compete towards the prizes.
Data: Download, format, etc.
I have problems with the ZIP files which appear to be corrupted. Can I get a DVD?
Try do download one archive at a time. If the problem persists, contact the organizers so they can send you a DVD.
Are the data available in other formats: matlab, SAS, etc.?
There are several Matlab versions posted on the Forum. There is also a numerical version of the categorical variables in text format for the large dataset. Please post your own version of the data to share it with others.
Is there sample code available?
Yes. We made available sample Matlab code to help you format your results. There are also examples to call CLOP models from that code. AT THIS STAGE THERE IS NOT YET MATLAB SUPPORT FOR HANDLING THE LARGE DATASET.
Are the true targets distributed similarly as the toy target?
No. The toy target is generated by an artificial stochastic process. The proportion of examples in either class is different in the real targets. The real targets have less than 10% of examples in the positive class.
I have observed that the last columns (after variable 14740) are not numerical, are the data corrupted?
The last variables are categorical variables. The strings correspond to category codes. This could be for instance a city name. But for reasons of privacy, the real names were replaced by strings that are meaningless.
I have observed that some columns are empty or constant, are the data corrupted?
No. This is correct, and part of the challenge, that deals with automatic data preparation and modeling in the context of industrial real data. Filtering constant data is the easy part of the challenge.
I have observed that the first chunk of the large dataset contains only 9999 lines, is this correct?
Yes. Chunk 1 contains 9999 data lines plus the header. All other chunks have no header. The last chunk has 10001 lines. So the total is 50000 lines of data.
In the categorical variables, do the value need to be handled as meaningful sequences or are they just codes?
The original categorical values where symbols, not indicating any category ordering. The category symbols have been replaced by random anonymized values (strings) with no semantic, in 1 to 1 bijection with the original values so as to keep the structure of the data.
Do the targets correspond to single or multiple products?
The targets correspond to single products (but not necessarily the same one). For instance, churn concerns mobile phone customers switching providers and up-selling the plan upgrade to include television.
Is there a meaning in the variable ordering?
No. The variables are randomly ordered.
Are the variables in the small dataset a subset of those in the large dataset?
Yes. However, they are disguised to make it non-trivial to identify and discourage people to do so. The examples are also ordered differently to render such mapping even harder. We wish that participants work on each dataset separately, although they may work on both.
Are the training and test data drawn from the same distribution?
Yes.
Are the set of categorical variable values the same in the training and test data?
Not necessarily. Some values might show up only in training data or only in test data.
Are there the same number of values in each line?
There can be missing values. The values are separated by tabulations. Two consecutive tabs indicate a missing value.
Is it allowed to unscramble the small dataset?
Scrambling was done to encourage the participants to work separately on the small dataset and the big dataset. If we wanted the participants to be able to use the features of the small dataset in addition to those they might select from the big one, we would not have scrambled the data. We realize however that, if we forbid the participants from unscrambling and consider it cheating, we would have difficulties enforcing that rule. Hence, participants who unscramble the small dataset will not be disqualified from the competition. All participants will be requested to report at the end of the challenge whether they made use of unscrambling and whether they derived some advantage from it.
Evaluation: Tracks, submission format, etc.
Why do we need to submit results on training data?
In this way we can assess the robustness of the models. If you make great predictions on training data and perform poorly on test data, your method likely is overfitting.
What is the purpose of giving performances on 10% or the test data?
We want to give feed-back to the participants to motivate them. In this way, they can see how roughly their performance compares to others. But, by giving feed-back on only 10% of the data, we avoid that they fine tune their system using the test data (i.e. de facto "learn" from the test data). There will be a slight bias in performance because of the 10% on which feed-back is provided, but it is the same bias for all contestants.
Is it correct that even if I submit the result on the large dataset in the fast track, I can submit the result on the large dataset in the slow track together with that on the small dataset?
Yes. In fact, you may submit as many times as you want. But, only the last complete entry (with churn appetency and upselling results both on training and test data = 6 files) will count in each track, depending on the submission date. In the fast track, you may enter only large dataset results, so you get 1 chance. In the slow track you may enter on both small and large datasets so you get 2 chances (the best of your 2 results will be taken into account). In total, you get 3 chances of winning.
If I submit results on both the small and large datasets in the slow track, how will results be evaluated?
The best of your 2 results will be taken into account.
Both small and large entries compete for the slow prize, but they seem to correspond to two distinct problems? Shouldn't there be two slow track prizes?
The small dataset is a downsized version of the large one: same examples, a subset of the features. To distinguish the two, the examples were ordered differently and the features were coded differently, in a way that should not affect performance but makes it non obvious to descramble. Because of the (unlikely) possibility that someone would spend time descrambling, we decided to give a single prize in the slow challenge, not to encourage people to cheat.
If I submit results before the fast track deadline, will those results also enter the slow track if I submit nothing afterwards?
Yes. For each deadline, your last valid complete entry will be entered in the ranking. So if you submit only to the fast track, your results will automatically be entered in the slow track.
If I win in both tracks, will I cumulate prizes?
No. You will get the largest of the two prizes. The remaining money will be used to give travel grants to other deserving participants.
On the result page, there is a "Score" column in the table, what does it mean?
As explained on the Tasks page, the score is the arithmetic mean of the AUC for the three tasks (churn, appetency. and up-selling).
I see a bunch of xxxx instead of my score, is there a problem?
No. Until the data labels of the tasks of the challenge are released, if people submit something on those tasks, they cannot see results to prevent them from gaining information by guessing. Only results on the toy problem are shown. You may still practice submitting some random values to test the system, but you will not see the results.
DISCLAIMER
Can a participant give an arbitrary hard time to the organizers?
ALL INFORMATION, SOFTWARE, DOCUMENTATION, AND DATA ARE PROVIDED "AS-IS". ORANGE AND/OR OTHER ORGANIZERS DISCLAIM ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY PARTICULAR PURPOSE, AND THE WARRANTY OF NON-INFRIGEMENT OF ANY THIRD PARTY'S INTELLECTUAL PROPERTY RIGHTS. IN NO EVENT SHALL ORANGE AND/OR OTHER ORGANIZERS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF SOFTWARE, DOCUMENTS, MATERIALS, PUBLICATIONS, OR INFORMATION MADE AVAILABLE FOR THE CHALLENGE.