Foundations (Course 1)
INTRODUCTION
Basic concepts of data mining, including motivation, definition, the relationships of data mining with database systems, statistics, machine learning, different kinds of data repositories on which data mining can be performed, different kind of patterns and knowledge to be mined, the concept of interestingness, and the current trends and developments of data mining. The material can probably be introduced by showing a few case studies.
- Concepts of data mining: motivation, definition, the relationships of data mining with database systems, statistics, machine learning, and information retrieval.
- Knowledge discovery process: An overview of the Knowledge Discovery Process. Emphasis on the iterative and interactive nature of the KDD Process.
- Mining on different kinds of data: relational, transactional, object-relational, heterogeneous, spatiotemporal, text, multimedia, Web, stream, mobile, and so on.
- Mining for different kind of knowledge: classification, regression, clustering, frequent patterns, discriminant, outliers, and so on.
- Evaluation of knowledge: interestingness or quality of knowledge, including accuracy, utility (such as support), and relevance (such as correlation).
- Applications of data mining: market analysis, scientific and engineering process analysis, bioinformatics, homeland security, and so on.
DATA PREPROCESSING
This unit will cover the following topics: (1) why preprocess the data? (2) basic data cleaning techniques, (3) data integration and transformation, and (4) data reduction methods. In particular, the following topics will be covered.
- Descriptive data summarization: This unit covers basic techniques for summarizing and describing data. It will cover: (1) computing the measures of central tendency such as mean, and mode, (2) computing the measures of data dispersion such as quantiles, boxplots, variances, standard deviation, and outliers, and (3) graphic display of basic statistical descriptions, such as histogram, scatter plot, boxplot, quantile-quantile plot, and local regression curves.
- Data cleaning methods: Basic techniques for handling missing values, noisy data, and inconsistent data, including typical binning, clustering, and regression methods for data cleaning.
- Data integration and transformation methods: This includes data smoothing, data aggregation, data generalization, normalization, attribute (or feature) construction.
- Basic data reduction methods: It introduces binning (histograms), sampling, and data cube aggregation.
- Discretization and concept hierarchy generation: It covers discretization and concept hierarchy generation for numeric data (including binning, clustering, histogram analysis), and for categorical data (automatic generation of concept hierarchies).