2017 SIGKDD Dissertation Award Winners
2017 SIGKDD Dissertation Award AwardSIGKDD Dissertation Awards (1 winner, 2 runner-ups, and 1 honorable mention)
ACM SIGKDD dissertation awards recognize outstanding work done by graduate students in the areas of data science, machine learning and data mining.
Review Criteria:
- Relevance of the Dissertation to KDD
- Originality of the Main Ideas in the Dissertation
- Significance of Scientific Contributions
- Technical Depth and Soundness of Dissertation (including experimental methodologies, theoretical results, etc.)
- Overall Presentation and Readability of Dissertation (including organization, writing style and exposition, etc.)
WINNER
Local Modeling of Attributed Graphs: Algorithms and Applications
Bryan Perozzi (student) and Steven Skiena (advisor) at Stony Brook University
Abstract: It is increasingly common to encounter real-world graphs which have attributes associated with the nodes, in addition to their raw connectivity information. For example, social networks contain both the friendship relations as well as user attributes such as interests and demographics. A protein-protein interaction network may not only have the interaction relations but the expression levels associated with the proteins. Such information can be described by a graph in which nodes represent the objects, edges represent the relations between them, and feature vectors associated with the nodes represent the attributes. This graph data is often referred to as an attributed graph. This thesis focuses on developing scalable algorithms and models for attributed graphs. This data can be viewed as either discrete (set of edges), or continuous (distances between embedded nodes), and I examine the issue from both sides. Specifically, I present an online learning algorithm which utilizes recent advances in deep learning to create rich graph embeddings. The multiple scales of social relationships encoded by this novel approach are useful for multi-label classification and regression tasks in networks. I also present local algorithms for anomalous community scoring in discrete graphs. These algorithms discover subsets of the graph’s attributes which cause communities to form (e.g. shared interests on a social network). The scalability of all the methods in this thesis is ensured by building from a restricted set of graph primitives, such as ego-networks and truncated random walks, which exploit the local information around each vertex. In addition, limiting the scope of graph dependencies we consider enables my approaches to be trivially parallelized using commodity tools for big data processing, like MapReduce or Spark. The applications of this work are broad and far reaching across the fields of data mining and information retrieval, including user profiling/demographic inference, online advertising, and fraud detection.
Runner-up: User Behavior Modeling with Large-Scale Graph Analysis
Alex Beutel (student) and Alex Smola and Christos Faloutsos (co-advisors) at Carnegie Mellon University
Abstract: Can we model how fraudsters work to distinguish them from normal users? Can we predict not just which movie a person will like, but also why? How can we find when a student will become confused or where patients in a hospital system are getting infected? How can we effectively model large attributed graphs of complex interactions?
In this dissertation we understand user behavior through modeling graphs. Online, users interact not just with each other in social networks, but also with the world around them—supporting politicians, watching movies, buying clothing, searching for restaurants and finding doctors. These interactions often include insightful contextual information as attributes, such as the time of the interaction and ratings or reviews about the interaction. The breadth of interactions and contextual information being stored presents a new frontier for graph modeling.
To improve our modeling of user behavior, we focus on three broad challenges: (1) modeling abnormal behavior, (2) modeling normal behavior and (3) scaling machine learning. To more effectively model and detect abnormal behavior, we model how fraudsters work, catching previously undetected fraud on Facebook, Twitter, and Tencent Weibo and improving classification accuracy by up to 68%. By designing flexible and interpretable models of normal behavior, we can predict why you will like a particular movie. Last, we scale modeling of large hypergraphs by designing machine learning systems that scale to hundreds of gigabytes of data, billions of parameters, and are 26 times faster than previous methods. This dissertation provides a foundation for making graph modeling useful for many other applications as well as offers new directions for designing more powerful and flexible models.
Runner-up: Mining Large Multi-Aspect Data: Algorithms and Applications
Evangelos E. Papalexakis (Student) and Christos Faloutsos (Advisor) at Carnegie Mellon University
Abstract: What does a person’s brain activity look like when they read the word apple? How does it differ from the activity of the same (or even a different person) when reading about an airplane? How can we identify parts of the human brain that are active for different semantic concepts? On a seemingly unrelated setting, how can we model and mine the knowledge on web (e.g., subject-verb-object triplets), in order to find hidden emerging patterns? Our proposed answer to both problems (and many more) is through bridging signal processing and large-scale multi-aspect data mining. Specifically, language in the brain, along with many other real-word processes and phenomena, have different aspects, such as the various semantic stimuli of the brain activity (apple or airplane), the particular person whose activity we analyze, and the measurement technique. In the above example, the brain regions with high activation for “apple” will likely differ from the ones for “airplane”. Nevertheless, each aspect of the activity is a signal of the same underlying physical phenomenon: language understanding in the human brain. Taking into account all aspects of brain activity results in more accurate models that can drive scientific discovery (e.g, identifying semantically coherent brain regions).
In addition to the above Neurosemantics application, multi-aspect data appear in numerous scenarios such as mining knowledge on the web, where different aspects in the data include entities in a knowledge base and the links between them or search engine results for those entities, and multi-aspect graph mining, with the example of multi-view social networks, where we observe social interactions of people under different means of communication, and we use all aspects of the communication to extract communities more accurately.
The main thesis of our work is that many real-world problems, such as the aforementioned, benefit from jointly modeling and analyzing the multi-aspect data associated with the underlying phenomenon we seek to uncover. In this thesis we develop scalable and interpretable algorithms for mining big multiaspect data, with emphasis on tensor decomposition. We present algorithmic advances on scaling up and parallelizing tensor decomposition and assessing the quality of its results, that have enabled the analysis of multi-aspect data that the state-of-the-art could not support. Indicatively, our proposed methods speed up the state-of-the-art by up to two orders of magnitude, and are able to assess the quality for 100 times larger tensors. Furthermore, we present results on multi-aspect data applications focusing on Neurosemantics and Social Networks and the Web, demonstrating the effectiveness of multiaspect modeling and mining. We conclude with our future vision on bridging Signal Processing and Data Science for real-world applications.
Honorable mention: Computational Lens on Big Social and Information Networks
Yuxiao Dong (Student) and Nitesh V. Chawla (Advisor) at University of Notre Dame
Abstract: What does a person’s brain activity look like when they read the word apple? How does it differ from the activity of the same (or even a different person) when reading about an airplane? How can we identify parts of the human brain that are active for different semantic concepts? On a seemingly unrelated setting, how can we model and mine the knowledge on web (e.g., subject-verb-object triplets), in order to find hidden emerging patterns? Our proposed answer to both problems (and many more) is through bridging signal processing and large-scale multi-aspect data mining. Specifically, language in the brain, along with many other real-word processes and phenomena, have different aspects, such as the various semantic stimuli of the brain activity (apple or airplane), the particular person whose activity we analyze, and the measurement technique. In the above example, the brain regions with high activation for “apple” will likely differ from the ones for “airplane”. Nevertheless, each aspect of the activity is a signal of the same underlying physical phenomenon: language understanding in the human brain. Taking into account all aspects of brain activity results in more accurate models that can drive scientific discovery (e.g, identifying semantically coherent brain regions).
Abstract: The connections between individuals form the structural backbone of human societies, which manifest as networks. In a network sense, individuals matter in the ways in which their unique demographic attributes and diverse interactions activate the emergence of new phenomena at larger, societal levels. Accordingly, this thesis develops computational models to investigating the ways that individuals are embedded in and interact within a wide range of over one hundred big networks---the biggest with over 60 million nodes and 1.8 billion edges---with an emphasis on two fundamental and interconnected directions: user demographics and network diversity.
Work in this thesis in the direction of demographics unveils the social strategies that are used to satisfy human social needs evolve across the lifespan, examines how males and females build and maintain similar or dissimilar social circles, and reveals how classical social theories---such as weak/strong ties, social balance, and small worlds---are influenced in the context of digitally recorded big networks coupled with socio-demographics. Our work on demographics also develops scalable graphical models that are capable of incorporating structured discoveries (features), facilitating conventional data mining tasks in networks. Work in this part demonstrates the predictability of user demographic attributes from networked systems, enabling the potential for precision marketing and business intelligence in social networking services. Work in this thesis in the direction of diversity examines how the diverse structures of common neighborhood influence link formation locally and network organization globally, how this influence varies across different types of social and information networks, and how it concords or conflicts with the principle of homophily. Work in this direction reveals how topic diversity---in contrast to authority and popularity---drives the growth of impact in academic collaboration and citation networks as well. Finally, our work on diversity presents neural network based representation learning models for embedding heterogeneous networks in which there exist diverse types of nodes and edges, giving rise to important implications for traditional mining and learning tasks in heterogeneous network data, including similarity search, clustering, and classification.
Congratulations to all the outstanding students who were nominated and to the winners of this year.