SIGKDD Dissertation Awards (1 winner, 1 runner-up and 3 honorable mentions)
ACM SIGKDD dissertation awards recognize outstanding work done by graduate students in the areas of data science, machine learning and data mining.
Selection Procedure: We received 19 nominations this year, a new record in the history of this award. After receiving the nominations, we invited leading experts to serve on the award selection committee from all over the world. Each dissertation was reviewed by atleast 3 experts who helped group the dissertations into two competing groups. During the second phase, all members were invited to rank the top 5 nominations.
* Relevance of the Dissertation to KDD
*Originality of the Main Ideas in the Dissertation
* Significance of Scientific Contributions
* Technical Depth and Soundness of Dissertation (including experimental methodologies, theoretical results, etc.)
* Overall Presentation and Readability of Dissertation (including organization, writing style and exposition, etc.)
Winner: Reconstruction and Applications of Collective Storylines from Web Photo Collections.
Gunhee Kim (student) and Eric Xing (advisor)
Abstract: Widespread access to photo-taking devices and high speed Internet has combined with rampant social networking to produce an explosion in picture sharing on Web platforms. In this environment, new challenges in image acquisition, processing, and sharing have emerged, creating exciting opportunities for research in computer vision and multimedia data mining. In this dissertation, we explore one of these interesting problems, the reconstruction of collective storylines as an efficient but comprehensive structural summary of ever-growing big image data shared online.
More specifically, the goal of this dissertation can be summarized as follows. Given large-scale online image collections and associated meta-data, we aim to create the collective storylines by jointly inferring the temporal trends and the overlapping contents of image collections. We also explore novel computer vision and data mining applications taking advantage of the reconstructed photo storylines.
In order to achieve the proposed research objective, we develop the required technologies from three research directions, which are (1) understanding of temporal trends of image collections, (2) discovery of overlapping contents across image collections, and (3) reconstruction and applications of collective photo storylines. The first direction of the work addresses the problems of understanding what topics are popular when by whom in the image collections, while the second line of the work studies the approaches for detecting salient and recurring contents across the image collections in the form of bounding boxes or pixel-wise segmentations. Finally, based upon the results of the work in the first two directions, we propose the reconstruction algorithms of branching storyline graphs, and explore their promising applications at the intersection of computer vision and multimedia data mining.
Runner-up: Human-Powered Data Management.
Aditya Parameswaran (student) and Hector Garcia-Molina (advisor)
Abstract: Fully automated algorithms are inadequate for a number of data analysis tasks, especially those involving images, video, or text. Thus, there is often a need to combine "human computation" (or crowdsourcing), together with traditional computation, in order to improve the process of understanding and analyzing data. However, most data management applications currently employ crowdsourcing in an ad-hoc fashion; these applications are not optimized for low monetary cost, low latency, or high accuracy. In this thesis, we develop a formalism for reasoning about human-powered data management, and use this formalism to design: (a) a toolbox of basic data processing algorithms, optimized for cost, latency, and accuracy, and (b) practical data management systems and applications that use these algorithms. We demonstrate that our techniques lead to algorithms and systems that expend very few resources (e.g., time waiting, human effort, or money spent), while providing just as high quality results, as compared to approaches currently used in practice.
Honorable mention: Exploratory Mining of Collaborative Social Content
Mahashweta Das (Student) and Gautam Das (Advisor)
Abstract: The widespread use and growing popularity of online collaborative content sites (e.g., Yelp, Amazon, IMDB) has created rich resources for consumers to consult in order to make purchasing decisions on various items such as restaurants, movies, e-commerce products, movies, etc. It has also created new opportunities for producers of such items to improve business by designing better products, composing interesting advertisement snippets, building more effective personalized recommendation systems, etc. This motivates us to develop a framework for exploratory mining of user feedback on items in collaborative social content sites. Typically, the amount of user feedback (e.g., ratings, reviews, tags) associated with an item (or, a set of items) can easily reach hundreds or thousands resulting in an overwhelming amount of information (information explosion), which users may find difficult to cope with (information overload). For example, popular restaurants listed in the review site Yelp routinely receive several thousand ratings and reviews, thereby causing decision making cumbersome. Moreover, most online activities involve interactions between multiple items and different users and interpreting such complex user-item interactions becomes intractable too. Our research concerns developing novel data mining and exploration algorithms to formally analyze how user and item attributes influence user-item interactions. In this dissertation, we choose to focus on short user feedback (i.e., ratings and tags) and reveal how it, in conjunction with structural attributes associated with items and users, open up exciting opportunities for performing aggregated analytics. The aggregate analysis goal is two-fold: (i) exploratory mining to benefit content consumers make more informed judgment (e.g., if a user will enjoy eating at a particular restaurant), as well as (ii) exploratory mining to benefit content producers conduct better business (e.g., a redesigned menu to attract more people of a certain demographic group, etc.). We identify a family of mining tasks and propose a suite of algorithms - exact, approximation with theoretical properties, and efficient heuristics - for solving the problems. Performance evaluation over synthetic data and real data crawled from the web validates the utility of our framework and effectiveness of our algorithms.
Honorable mention: Uncovering Structure in High-Dimensions: Networks and Multi-Task Learning Problems
Mladen Kolar (Student) and Eric Xing (Advisor)
Abstract: Extracting knowledge and providing insights into complex mechanisms underlying noisy high-dimensional data sets is of utmost importance in many scientific domains. Statistical modeling has become ubiquitous in the analysis of high dimensional functional data in search of better understanding of cognition mechanisms, in the exploration of large-scale gene regulatory networks in hope of developing drugs for lethal diseases, and in prediction of volatility in stock market in hope of beating the market. Statistical analysis in these high-dimensional data sets is possible only if an estimation procedure exploits hidden structures underlying data. This thesis develops flexible estimation procedures with provable theoretical guarantees for uncovering unknown hidden structures underlying data generating process. Of particular interest are procedures that can be used on high dimensional data sets where the number of samples n is much smaller than the ambient dimension p. Learning in high-dimensions is difficult due to the curse of dimensionality, however, the special problem structure makes inference possible. Due to its importance for scientific discovery, we put emphasis on consistent structure recovery throughout the thesis. Particular focus is given to two important problems, semi-parametric estimation of networks and feature selection in multi-task learning.
Honorable mention: Community Structure of Large Networks
Jaewon Yang (Student) and Jure Leskovec (Advisor)
Abstract: One of the main organizing principles in real-world networks is that of network communities, which are sets of nodes that share common properties, functions, or roles. Communities in networks often overlap as nodes can belong to multiple communities at once. Identifying such overlapping network communities is crucial for an understanding of social, technological, and biological networks. In this thesis, we develop a family of accurate and scalable community detection methods and apply them to large networks. We begin by challenging the conventional view that defines network communities as densely connected clusters of nodes. We show that the conventional view leads to an unrealistic structure of community overlaps. We present a new conceptual model of network communities, which reliably captures the overall structure of a network as well as accurately models community overlaps. Based on our model, we develop accurate algorithms for detecting overlapping communities that scale to networks an order of magnitude larger than what was possible before. Our approach leads to novel insights that unify two fundamental organizing principles of networks: modular communities and the commonly observed core-periphery structure. In particular, our results show that dense network cores stem from the overlaps between many communities. As the final part of the thesis, we present several extensions of our models such that we can detect communities with a bipartite connectivity structure and we combine the node attributes and the network structure for community detection. Congratulations to all the outstanding students who were nominated and to the winners of this year.