Data Mining Curriculum: A Proposal

Advanced Topics (Course II)

This unit will cover advanced data reduction methods.
  • Advanced data reduction methods: (1) dimensionality reduction (feature or attribute subset reduction), (2) numerosity reduction (regression, histogram, clustering, sampling, singular value decomposition (SVD), and discretization), and (3) data compression (lossless versus lossy compression, Fourier and wavelet transformation, and principal component analysis).

This unit covers advanced material in data warehousing, OLAP, and data generalization
  • The multidimensional data model
  • Implementation of data warehouses: data integration, indexing OLAP data (bitmap index), efficient processing of OLAP queries, metadata repository, data warehouse back-end tools and utilities.
  • Efficient computation of data cubes: categorization of measures: distributive, algebraic, and holistic measures, cube computation methods, icburg cubes, top-down and bottom-up computation, computing closed and approximate data cubes.
  • Other data generalization approaches: Attribute-oriented induction, mining class comparisons; discriminating between different classes.
  • Exploration of data warehouse and data mining: Discovery-driven exploration of data cubes, complex aggregation at multiple granularity, cube gradient analysis, from online analytical processing to online analytical mining.

  • Advanced frequent pattern mining methods: (1) vertical format mining, (2) pattern-growth algorithm, (3) mining closed patterns and max-patterns
  • Constraint-based association mining: (1) rule- and query-guided association mining, (2) anti-monotonicity, monotonicity, succinctness in constraint mining, (3) convertible constraints
  • Extensions and applications of frequent pattern mining: (1) iceberg cube computation, (2) fascicles and semantic data compression, (3) frequent pattern-based classification and cluster analysis

  • Bayesian belief networks: methods for (advanced) choosing BBN structure and training Bayesian belief networks
  • Advanced decision tree construction: (1) enhancements to basic classification tree induction, (2) scalable algorithms for classification tree induction, (3) integrating data warehousing techniques and classification tree induction, (4) classification with partially labeled data
  • Neural network approach for classification: (1) a multi-layer feed-forward neural network, (2) defining a network topology, (3) back-propagation, (4) interpretability of classification results
  • Kernel methods: (1) kernel logistic regression, (2) kernel discriminant analysis, (3) advanced SVM kernel methods
  • Introduction to learning theory: PAC-learnability, empirical, true and structural risk, VC-theory
  • Ensemble construction: Weighted voting, bagging, weak learner, boosting, AdaBoost
  • Other classification methods: (1) case-based reasoning, (2) genetic algorithms, (3) rough set approach, (4) fuzzy set approach

  • Grid-based clustering: A statistical information grid approach, clustering by wavelet analysis, clustering high-dimensional space.
  • Clustering high-dimensional data: Subspace clustering, frequent pattern-based clustering, clustering wavelet analysis.
  • Advanced outlier analysis: Statistical-based outlier detection, distance-based outlier detection, deviation-based outlier detection, analysis of local outliers.
  • Collaborative filtering

Advanced Time-Series and Sequential Data Mining
This unit covers the advanced techniques for mining sequential data, including the following topics.
  • Similarity search in time-series analysis
  • Hidden Markov models
  • Periodicity analysis: Transformation-based approach, mining partial periodicity.
  • Sequence segmentation: Hidden Markov model and Variable Markov model for sequence segmentation.
  • Sequence classification and clustering: (1) $q$-gram based methods, keyword-based methods; (2) (high order) Markov chain, hidden Markov model; (3) suffix tree, probabilistic suffix tree, and probabilistic automata.

Mining Data Streams
This unit covers the techniques for mining stream data, including the following topics.
  • What is stream data?
  • Basic tools: Chernoff bounds, reservoir sampling
  • Stream sample counting and frequent pattern analysis
  • Classification of data streams
  • Clustering data streams
  • Online sensor data analysis

Mining Spatial, Spatiotemporal, and Multimedia data
This unit covers the techniques for mining spatial, spatiotemporal, and multimedia data, including the following topics.
  • Mining spatial and spatiotemporal databases: (1) Spatial data cube construction and spatial OLAP, (2) spatial association and co-location analysis, (3) spatial clustering methods, (4) spatial classification and spatial trend analysis, (5) spatiotemporal data miming, (6) mining moving objects and trajectories.
  • Mining multimedia databases: (1) multidimensional analysis of multimedia data, (2) similarity search in multimedia data, (3) classification and regression analysis of multimedia data, (4) mining association and correlation in multimedia data, (5) clustering multimedia data
  • Mining object databases: (1) multidimensional analysis of complex objects, (2) generalization on complex structured and semi-structured data, (3) methodology for mining complex object databases: aggregation, approximation, and progressive refinement.

Mining Biological Data
This unit covers the techniques for mining biological data, including the following topics.
  • Mining DNA, RNA, and proteins: (1) Mining motif patterns, (2) searching homology in large databases, (3) phylogenetic and functional prediction.
  • Mining gene expression data: (1) clustering gene expression, e.g., for gene regulatory networks, (2) classifying gene expression, e.g., for disease-sensitive gene discovery.
  • Mining mass spectrometry data
  • Mining and integrating knowledge from biomedical literature
  • Mining inter-domain associations

Text mining
This module will cover work that applies known mining techniques to the text media, emphasizing the new issues which arise.
  • Text representation: Set-of-words, bag-of-words, vector-space model; the issue of large raw dimensionality
  • Dimensionality reduction: PCA, SVD, latent semantic indexing
  • Text clustering: agglomerative, $k$-means, EM; effect of a large number of noise dimensions, partial supervision
  • Feature selection in high dimensions
  • Naive Bayes classification: Poor density estimates, small-degree Bayesian belief network induction
  • Discriminative learning: maximum entropy, logistic regression, and support vector learning
  • Shallow linguistics: Phrase detection, part-of-speech tagging, named entity extraction, word sense disambiguation

Hypertext and Web mining
This module will cover work that is specific to analyzing hypermedia, i.e., involving hierarchical tagging languages and hyperlinks in conjunction with text.
  • Web modeling: The Web as an evolving,collaborative, populist social network: aggregate graph-structure of the Web, preferential attachment linking models and experimental validation
  • Link mining and social network analysis: Links as endorsement: PageRank and HITS algorithms to identify authoritative Web pages; connections with bibliometry
  • The PageRank algorithm: Integrating page content and page layout with link structure; topic-sensitive PageRanks; Google
  • Mining by exploiting text and links: Exploitingtext and links for better clustering and classification; unified probabilistic models for text and links
  • Structured data extraction: Information extraction,exploiting markup structure to extract structured data from pages meant for human consumption
  • Multidimensional Web databases: Automatic construction of multilayered Web information base; discovering entities and relations on the Web (WebKB)
  • Exploration and resource discovery on the Web: reinforcement learning, other approaches
  • Web usage mining and adaptive Web sites: Reorganizing Web sites by mining log data

Data Mining Languages, Standards, and System Architectures
This unit covers the issues related to data mining languages, standards, and system architectures, including the following topics.
  • Data mining primitives: what defines a data mining task? task-relevant data, the kind of knowledge to be mined, background knowledge: concept hierarchies, user-specified constraints, interestingness measures, presentation of discovered patterns
  • Data mining languages, user interfaces, and standardization efforts
  • Architectures of data mining systems

Data Mining Applications
This unit covers the issues related to domain-specific data mining applications, including the following topics. (Note: Some of these themes, if concrete and good materials are available, should go into the Foundations part as case studies.)
  • Data mining for financial data analysis
  • Data mining for the retail industry
  • Data mining for the telecommunication industry
  • Data mining for intrusion detection
  • Data mining in scientific and statistical applications
  • Data mining in software engineering and computer system analysis

Data Mining and Society
This unit covers the issues related to social impacts of data mining, including the following topics.
  • Social impacts of data mining
  • Data mining vs. data security and privacy
  • Privacy-preserving data mining

Trends in Data Mining
This unit covers the major trends in data mining, including the following topics.
  • Setting solid theoretical foundations for data mining
  • Mining deep in specific applications
  • Ubiquitous and invisible data mining
  • Integrated data and information systems

Copyrights © 2023 All Rights Reserved - SIGKDD
ACM Code of Conduct