Frequently Asked Questions

KDD Cup 2006: Pulmonary embolisms detection from image data

Frequently Asked Questions

The spec says that there are 46 positive cases and 20 negative cases, but I only count 38 positive and 8 negative in the training data. What gives?

Don't panic! As the README included with the data indicates, this release is a preliminary data set - it may not match the spec in all ways. The slightly longer story is that Siemens is preparing a more extensive version of the training data. But it turned out to take longer to produce that data than they had expected, and we judged that it was more important to go forward with the competition than to wait on the final data. So we released the data that is currently available. The spec is based on Siemens's projections of what the final data will look like and is out of synch with the preliminary data that we released. The "final" data may or may not become available in time to be of use during the competition; if it does become available I will release it as soon as I can. At that point, I'll correct the spec to reflect the final status of the training data.

How do you define candidates which are associated with PE (see pdf-specifications file)?

The 'label' field actually gives the index of the PE with which the candidate is associated. That is, label=0 means 'this candidate not associated with a PE', while label=X means 'this candidate associated with PE number X' (for X!=0).

Why does the 'KDDPEfeatureNames.txt' file have fewer feature names (116, including Patient ID) than the 'KDDPETrain.txt' file fields (117)?

The last feature name should be Tissue Feature and is missing from the original feature names file.

What's with normalization? The contest documentation says "all features are normalized from 0 to 1," but there are negative numbers all over the place.

Yes, that's a mistake in the documentation. The normalization is into a unit range and roughly a zero mean. But feel free to re-normalize the data in any way that you like.

How many submissions (i.e. sets of predictions) are allowed for each subtask? What should be the format for the submissions?

I'm working on sorting out the submission process now. I'll let you all know more when I do.

Once the evaluation is done, will the labels of the test data be released to the participants?

Yes. Once the competition is complete, the intention is to release the entire data set.

Can you give us more info on the meaning of each of the features? Background info, interpretation, and so on?

Siemens is unwilling to provide this information. They are hoping for "abstract" approaches that aren't engineered to specific feature sets. (E.g., so that they're easily extensible to different medical problems.)

How far apart should candidates be before I can reasonably assume they are independent (there is no way I can tell this from the processed data)? I.e. how far apart should they be to represent a "different" part of the lung and not be looking at essentially the same structure?

That's a good question. It's part of the challenge for you to determine that. No prior domain knowledge is available on this point.

There appear to be strong correlations within a patient. Can you explain that? Will we be given patient ID in the test data?

There are indeed strong correlations within patient and even within patient groups from a single hospital. Part of the challenge is to find useful ways to exploit that information. You will be given patient ID in the final test data, but you will not be given a hospital ID and you may not assume that all of the patients are drawn from the same hospital. The test data patient set will be disjoint from the training data patient set.

Can I return "unknown" labels for some of the data points?

I'll say more about the answer format in the future but the short answer is: no. You must commit to an answer for each candidate. In practice, if your algorithm was embedded in a piece of medical imaging equipment, you would be required to return an absolute yes/no answer to the physician - there is no room for "I don't know".

Is there label noise in the training or test data?

Answer from Siemens: "We have high confidence that the labels given in the training and test data are correct. They have been rigorously examined individually by domain experts. But there is always an element of human error - there may be small errors in either direction, but we believe that such errors are minimal, at best. There exists the possibility that there are PEs in the original data to which no candidates were assigned at all, but this is beyond the scope of this competition. (Such PEs will have no candidates in the test data, so you won't be scored on them one way or the other.)"

Copyrights © 2020 All Rights Reserved - SIGKDD
ACM Code of Conduct