KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

Full Citation in the ACM Digital Library

SESSION: Keynote Talks

Automated Mechanism Design for Strategic Classification: Abstract for KDD'21 Keynote Talk

AI is increasingly making decisions, not only for us, but also about us -- from whether we are invited for an interview, to whether we are proposed as a match for someone looking for a date, to whether we are released on bail. Often, we have some control over the information that is available to the algorithm; we can self-report some information, and other information we can choose to withhold. This creates a potential circularity: the classifier used, mapping submitted information to outcomes, depends on the (training) data that people provide, but the (test) data depend on the classifier, because people will reveal their information strategically to obtain a more favorable outcome. This setting is not adversarial, but it is also not fully cooperative. Mechanism design provides a framework for making good decisions based on strategically reported information, and it is commonly applied to the design of auctions and matching mechanisms. However, the setting above is unlike these common applications, because in it, preferences tend to be similar across agents, but agents are restricted in what they can report. This creates both new challenges and new opportunities, as we demonstrate in our theoretical work and our initial experiments. This is joint work with Hanrui Zhang, Andrew Kephart, Yu Cheng, Anilesh Krishnaswamy, Haoming Li, and David Rein.

Data Science for Assembly Engineering

Discovery and design of new materials able to self assemble from nanoscale building blocks are becoming increasingly enabled by large-scale molecular simulation. Aided by fast simulation codes leveraging powerful computer architectures, an unprecedented amount of data can be generated in the blink of an eye, shifting the effort and focus of the computational scientist from the simulation to the data. How do we manage so much data, and what do we do with it when we have it? In this talk, we discuss the applications of data science and data-driven thinking to molecular and materials simulation. Although we do so in the context of assembly engineering of soft matter, the tools and techniques discussed are general and applicable to a wide range of problems. We present applications of machine learning to automated, structure identification of complex colloidal crystals, high-throughput mapping of phase diagrams, the study of kinetic pathways between fluid and solid phases, and the discovery of previously elusive design rules and structure-property relationships.

Biography: Sharon C. Glotzer is the John W. Cahn Distinguished University Professor at the University of Michigan, Ann Arbor, the Stuart W. Churchill Collegiate Professor of Chemical Engineering, and the Anthony C. Lembke Department Chair of Chemical Engineering. She is also Professor of Materials Science and Engineering, Physics, Applied Physics, and Macromolecular Science and Engineering. Her research on computational assembly science and engineering aims toward predictive materials design of colloidal and soft matter: using computation, geometrical concepts, and statistical mechanics, her research group seeks to understand complex behavior emerging from simple rules and forces, and use that knowledge to design new classes of materials. Glotzer's group also develops and disseminates powerful open-source software including the particle simulation toolkit, HOOMD-blue, which allows for fast molecular simulation of materials on graphics processors, the signac framework for data and workflow management, and several analysis and visualization tools. Glotzer received her B.S. in Physics from UCLA and her PhD in Physics from Boston University. She is a member of the National Academy of Sciences, the National Academy of Engineering and the American Academy of Arts and Sciences.

Safe Learning in Robotics

In many applications of autonomy in robotics, guarantees that constraints are satisfied throughout the learning process are paramount. We present a controller synthesis technique based on the computation of reachable sets, using optimal control and game theory. Then, we present methods for combining reachability with learning-based methods, to enable performance improvement while maintaining safety and to move towards safe robot control with learned models of the dynamics and the environment. We will illustrate these "safe learning" methods on robotic platforms at Berkeley, including demonstrations of motion planning around people, and navigating in a priori unknown environments.

Biography: Claire Tomlin is a Professor of Electrical Engineering and Computer Sciences at the University of California at Berkeley, where she holds the Charles A. Desoer Chair in Engineering. Claire received her B.A.Sc. in EE from the University of Waterloo in 1992, her M.Sc. in EE from Imperial College, London, in 1993, and her PhD in EECS from Berkeley in 1998. She held the positions of Assistant, Associate, and Full Professor at Stanford from 1998-2007, and in 2005 joined Berkeley. Claire works in hybrid systems and control, and integrates machine learning methods with control theoretic methods in the field of safe learning. She works in the applications of air traffic and unmanned air vehicle systems. Claire is a MacArthur Foundation Fellow, an IEEE Fellow, and an AIMBE Fellow. She was awarded the Donald P. Eckman Award of the American Automatic Control Council in 2003, an Honorary Doctorate from KTH in 2016, and in 2017 she won the IEEE Transportation Technologies Award. In 2019, she was elected to the National Academy of Engineering and the American Academy of Arts and Sciences.

On the Nature of Data Science

One can hear "Data Science" defined as a synonym for machine learning or as a branch of Statistics. I shall argue that it is far more than that; it is the natural evolution of the technology of very large-scale data management to solve problems in scientific and commercial fields. To support my argument, I shall give a brief introduction to two algorithms that are important in data science but that are neither machine learning nor statistics: locality-sensitive hashing and counting distinct elements.

Biography: Jeff Ullman is the Stanford W. Ascherman Professor of Engineering (Emeritus) in the Department of Computer Science at Stanford and CEO of Gradiance Corp. He received the B.S. degree from Columbia University in 1963 and the PhD from Princeton in 1966. Prior to his appointment at Stanford in 1979, he was a member of the technical staff of Bell Laboratories from 1966-1969, and on the faculty of Princeton University between 1969 and 1979. From 1990-1994, he was chair of the Stanford Computer Science Department. Ullman was elected to the National Academy of Engineering in 1989, the American Academy of Arts and Sciences in 2012, the National Academy of Sciences in 2020, and has held Guggenheim and Einstein Fellowships. He has received the Sigmod Contributions Award (1996), the ACM Karl V. Karlstrom Outstanding Educator Award (1998), the Knuth Prize (2000), the Sigmod E. F. Codd Innovations award (2006), the IEEE von Neumann medal (2010), the NEC C&C Foundation Prize (2017), and the ACM A.M. Turing Award (2020). He is the author of 16 books, including books on database systems, data mining, compilers, automata theory, and algorithms.

SESSION: Research Track Papers

LawyerPAN: A Proficiency Assessment Network for Trial Lawyers

Assessing the proficiency of trial lawyers in different legal fields is of significant importance since a qualified lawyer or lawyer team can strive for his clients' best rights while ensuring the fairness of litigations. However, proficiency assessment for lawyers is very challenging due to many technical and domain challenges, such as the lack of unified evaluation standards, and the complex interactions between lawyers and cases in real legal systems. To this end, we propose a novel proficiency assessment network for trial lawyers (LawyerPAN) to quantify lawyer proficiency through online litigation records. Specifically, we first leverage the theories in psychological measurement for mapping the proficiency of lawyers in each field into a unified real number space. Meanwhile, the characteristics of cases (i.e., case difficulty and discrimination) are well modeled to ensure fairness when assessing lawyers in different cases and fields. Then, we model the interactions between lawyers and cases from two perspectives: the anticipatory perspective aims to measure the personal proficiency of anticipated strategy, and the adversarial perspective seeks to depict the gap of lawyers' proficiency between both sides (i.e., plaintiffs and defendants). Finally, we conduct extensive experiments on real-world data, and the results show the effectiveness and interpretability of our approaches on assessing the proficiency of trial lawyers.

Fine-Grained System Identification of Nonlinear Neural Circuits

We study the problem of sparse nonlinear model recovery of high dimensional compositional functions. Our study is motivated by emerging opportunities in neuroscience to recover fine-grained models of biological neural circuits using collected measurement data. Guided by available domain knowledge in neuroscience, we explore conditions under which one can recover the underlying biological circuit that generated the training data. Our results suggest insights of both theoretical and practical interests. Most notably, we find that a sign constraint on the weights is a necessary condition for system recovery, which we establish both theoretically with an identifiability guarantee and empirically on simulated biological circuits. We conclude with a case study on retinal ganglion cell circuits using data collected from mouse retina, showcasing the practical potential of this approach.

Why Attentions May Not Be Interpretable?

Attention-based methods have played important roles in model interpretations, where the calculated attention weights are expected to highlight the critical parts of inputs (e.g., keywords in sentences). However, recent research found that attention-as-importance interpretations often do not work as we expected. For example, learned attention weights sometimes highlight less meaningful tokens like "[SEP]", ",", and ".", and are frequently uncorrelated with other feature importance indicators like gradient-based measures. A recent debate over whether attention is an explanation or not has drawn considerable interest. In this paper, we demonstrate that one root cause of this phenomenon is the combinatorial shortcuts, which means that, in addition to the highlighted parts, the attention weights themselves may carry extra information that could be utilized by downstream models after attention layers. As a result, the attention weights are no longer pure importance indicators. We theoretically analyze combinatorial shortcuts, design one intuitive experiment to show their existence, and propose two methods to mitigate this issue. We conduct empirical studies on attention-based interpretation models. The results show that the proposed methods can effectively improve the interpretability of attention mechanisms.

Multi-facet Contextual Bandits: A Neural Network Perspective

Contextual multi-armed bandit has shown to be an effective tool in recommender systems. In this paper, we study a novel problem of multi-facet bandits involving a group of bandits, each characterizing the users' needs from one unique aspect. In each round, for the given user, we need to select one arm from each bandit, such that the combination of all arms maximizes the final reward. This problem can find immediate applications in E-commerce, healthcare, etc. To address this problem, we propose a novel algorithm, named MuFasa, which utilizes an assembled neural network to jointly learn the underlying reward functions of multiple bandits. It estimates an Upper Confidence Bound (UCB) linked with the expected reward to balance between exploitation and exploration. Under mild assumptions, we provide the regret analysis of MuFasa. It can achieve the near-optimal Õ((K + 1) √T) regret bound where K is the number of bandits and T is the number of played rounds. Furthermore, we conduct extensive experiments to show that MuFasa outperforms strong baselines on real-world data sets.

Partial Label Dimensionality Reduction via Confidence-Based Dependence Maximization

Partial label learning deals with training examples each associated with a set of candidate labels, among which only one is valid. Most existing works focus on manipulating the label space by estimating the labeling confidences of candidate labels, while the task of manipulating the feature space by dimensionality reduction has been rarely investigated. In this paper, a novel partial label dimensionality reduction approach named CENDA is proposed via confidence-based dependence maximization. Specifically, CENDA adapts the Hilbert-Schmidt Independence Criterion (HSIC) to help identify the projection matrix, where the dependence between projected feature information and confidence-based labeling information is maximized iteratively. In each iteration, the projection matrix admits closed-form solution by solving a tailored generalized eigenvalue problem, while the labeling confidences of candidate labels are updated by conducting kNN aggregation in the projected feature space. Extensive experiments over a broad range of benchmark data sets show that the predictive performance of well-established partial label learning algorithms can be significantly improved by coupling with the proposed dimensionality reduction approach.

Uplift Modeling with Generalization Guarantees

In this paper, we consider the task of ranking individuals based on the potential benefit of being "treated" (e.g. by a drug or exposure to recommendations or ads), referred to as Uplift Modeling in the literature. This application has gained a surge of interest in recent years and it is found in many applications such as personalized medicine, recommender systems or targeted advertising. In real life scenarios the capacity of models to rank individuals by potential benefit is measured by the Area Under the Uplift Curve (AUUC), a ranking metric related to the well known Area Under ROC Curve. In the case where the objective function, for learning model parameters, is different from AUUC, the capacity of the resulting system to generalize on AUUC is limited. To tackle this issue, we propose to learn a model that directly optimizes an upper bound on AUUC. To find such a model we first develop a generalization bound on AUUC and then derive from it a learning objective called AUUC-max, usable with linear and deep models. We empirically study the tightness of this generalization bound, its effectiveness for hyperparameters tuning and show the efficiency of the proposed learning objective compared to a wide range of competitive baselines on two classical uplift modeling benchmarks using real-world datasets.

Fast One-class Classification using Class Boundary-preserving Random Projections

Several applications, like malicious URL detection and web spam detection, require classification on very high-dimensional data. In such cases anomalous data is hard to find but normal data is easily available. As such it is increasingly common to use a one-class classifier (OCC). Unfortunately, most OCC algorithms cannot scale to datasets with extremely high dimensions. In this paper, we present Fast Random projection-based One-Class Classification (FROCC), an extremely efficient, scalable and easily parallelizable method for one-class classification with provable theoretical guarantees. Our method is based on the simple idea of transforming the training data by projecting it onto a set of random unit vectors that are chosen uniformly and independently from the unit sphere, and bounding the regions based on separation of the data. FROCC can be naturally extended with kernels. We provide a new theoretical framework to prove that that FROCC generalizes well in the sense that it is stable and has low bias for some parameter settings. We then develop a fast scalable approximation of FROCC using vectorization, exploiting data sparsity and parallelism to develop a new implementation called ParDFROCC. ParDFROCC achieves up to 2 percent points better ROC than the next best baseline, with up to 12× speedup in training and test times over a range of state-of-the-art benchmarks for the OCC task.

Causal Models for Real Time Bidding with Repeated User Interactions

A large portion of online advertising displays are sold through an auction mechanism called Real Time Bidding (RTB). Each auction corresponds to a display opportunity, for which the competing advertisers need to precisely estimate the economical value in order to bid accordingly. This estimate is typically taken as the advertiser's payoff for the target event -- such as a purchase on the merchant website attributed to this display -- times this event estimated probability. However, this greedy approach is too naive when several displays are shown to the same user. The purpose of the present paper is to discuss how such an estimation should be made when a user has already been shown one or more displays. Intuitively, while a user is more likely to make a purchase if the number of displays increases, the marginal effect of each display is expected to be decreasing. In this work, we first frame this bidding problem with repeated user interactions by using causal models to value each display individually. Then, based on this approach, we introduce a simple rule to improve the value estimate. This change shows both interesting qualitative properties that follow our previous intuition as well as quantitative improvements on a public data set and online in a production environment.

Aggregating Complex Annotations via Merging and Matching

Human annotations are critical for training and evaluating supervised learning models, yet annotators often disagree with one another, especially as annotation tasks increase in complexity. A common strategy to improve label quality is to ask multiple annotators to label the same item and then aggregate their labels. While many aggregation models have been proposed for simple annotation tasks, how can we reason about and resolve annotator disagreement for more complex annotation tasks (e.g., continuous, structured, or high-dimensional), without needing to devise a new aggregation model for every different complex annotation task? We address two distinct challenges in this work. Firstly, how can a general aggregation model support merging of complex labels across diverse annotation tasks? Secondly, for multi-object annotation tasks that require annotators to provide multiple labels for each item being annotated (e.g., labeling named-entities in a text or visual entities in an image), how do we match which annotator label refers to which entity, such that only matching labels are aggregated across annotators? Using general constructs for merging and matching, our model not only supports diverse tasks, but delivers equal or better results than prior aggregation models: general and task-specific.

How Interpretable and Trustworthy are GAMs?

Generalized additive models (GAMs) have become a leading model class for interpretable machine learning. However, there are many algorithms for training GAMs, and these can learn different or even contradictory models, while being equally accurate. Which GAM should we trust? In this paper, we quantitatively and qualitatively investigate a variety of GAM algorithms on real and simulated datasets. We find that GAMs with high feature sparsity (only using a few variables to make predictions) can miss patterns in the data and be unfair to rare subpopulations. Our results suggest that inductive bias plays a crucial role in what interpretable models learn and that tree-based GAMs represent the best balance of sparsity, fidelity and accuracy and thus appear to be the most trustworthy GAM models.

Graph Deep Factors for Forecasting with Applications to Cloud Resource Allocation

Deep probabilistic forecasting techniques have recently been proposed for modeling large collections of time-series. However, these techniques explicitly assume either complete independence (local model) or complete dependence (global model) between time-series in the collection. This corresponds to the two extreme cases where every time-series is disconnected from every other time-series in the collection or likewise, that every time-series is related to every other time-series resulting in a completely connected graph. In this work, we propose a deep hybrid probabilistic graph-based forecasting framework called Graph Deep Factors (GraphDF) that goes beyond these two extremes by allowing nodes and their time-series to be connected to others in an arbitrary fashion. GraphDF is a hybrid forecasting framework that consists of a relational global and relational local model. In particular, we propose a relational global model that learns complex non-linear time-series patterns globally using the structure of the graph to improve both forecasting accuracy and computational efficiency. Similarly, instead of modeling every time-series independently, we learn a relational local model that not only considers its individual time-series but also the time-series of nodes that are connected in the graph. The experiments demonstrate the effectiveness of the proposed deep hybrid graph-based forecasting model compared to the state-of-the-art methods in terms of its forecasting accuracy, runtime, and scalability. Our case study reveals that GraphDF can successfully generate cloud usage forecasts and opportunistically schedule workloads to increase cloud cluster utilization by 47.5% on average.

On Breaking Truss-Based Communities

A k-truss is a graph such that each edge is contained in at least k-2 triangles. This notion has attracted much attention, because it models meaningful cohesive subgraphs of a graph. We introduce the problem of identifying a smallest edge subset of a given graph whose removal makes the graph k-truss-free. We also introduce a problem variant where the identified subset contains only edges incident to a given set of nodes and ensures that these nodes are not contained in any k-truss. These problems are directly applicable in communication networks: the identified edges correspond to vital network connections; or in social networks: the identified edges can be hidden by users or sanitized from the output graph. We show that these problems are NP-hard. We thus develop exact exponential-time algorithms to solve them. To process large networks, we also develop heuristics sped up by an efficient data structure for updating the truss decomposition under edge deletions. We complement our heuristics with a lower bound on the size of an optimal solution to rigorously evaluate their effectiveness. Extensive experiments on 10 real-world graphs show that our heuristics are effective (close to the optimal or to the lower bound) and also efficient (up to two orders of magnitude faster than a natural baseline).

PAR-GAN: Improving the Generalization of Generative Adversarial Networks Against Membership Inference Attacks

Recent works have shown that Generative Adversarial Networks (GANs) may generalize poorly and thus are vulnerable to privacy attacks. In this paper, we seek to improve the generalization of GANs from a perspective of privacy protection, specifically in terms of defending against the membership inference attack (MIA) which aims to infer whether a particular sample was used for model training. We design a GAN framework, partition GAN (PAR-GAN), which consists of one generator and multiple discriminators trained over disjoint partitions of the training data. The key idea of PAR-GAN is to reduce the generalization gap by approximating a mixture distribution of all partitions of the training data. Our theoretical analysis shows that PAR-GAN can achieve global optimality just like the original GAN. Our experimental results on simulated data and multiple popular datasets demonstrate that PAR-GAN can improve the generalization of GANs while mitigating information leakage induced by MIA.

Learning Elastic Embeddings for Customizing On-Device Recommenders

In today's context, deploying data-driven services like recommendation on edge devices instead of cloud servers becomes increasingly attractive due to privacy and network latency concerns. A common practice in building compact on-device recommender systems is to compress their embeddings which are normally the cause of excessive parameterization. However, despite the vast variety of devices and their associated memory constraints, existing memory-efficient recommender systems are only specialized for a fixed memory budget in every design and training life cycle, where a new model has to be retrained to obtain the optimal performance while adapting to a smaller/larger memory budget. In this paper, we present a novel lightweight recommendation paradigm that allows a well-trained recommender to be customized for arbitrary device-specific memory constraints without retraining. The core idea is to compose elastic embeddings for each item, where an elastic embedding is the concatenation of a set of embedding blocks that are carefully chosen by an automated search function. Correspondingly, we propose an innovative approach, namely recommendation with universally learned elastic embeddings (RULE). To ensure the expressiveness of all candidate embedding blocks, RULE enforces a diversity-driven regularization when learning different embedding blocks. Then, a performance estimator-based evolutionary search function is designed, allowing for efficient specialization of elastic embeddings under any memory constraint for on-device recommendation. Extensive experiments on real-world datasets reveal the superior performance of RULE under tight memory budgets.

Causal Understanding of Fake News Dissemination on Social Media

Recent years have witnessed remarkable progress towards computational fake news detection. To mitigate its negative impact, we argue that it is critical to understand what user attributes potentially cause users to share fake news. The key to this causal-inference problem is to identify confounders -- variables that cause spurious associations between treatments (e.g., user attributes) and outcome (e.g., user susceptibility). In fake news dissemination, confounders can be characterized by fake news sharing behavior that inherently relates to user attributes and online activities. Learning such user behavior is typically subject to selection bias in users who are susceptible to share news on social media. Drawing on causal inference theories, we first propose a principled approach to alleviating selection bias in fake news dissemination. We then consider the learned unbiased fake news sharing behavior as the surrogate confounder that can fully capture the causal links between user attributes and user susceptibility. We theoretically and empirically characterize the effectiveness of the proposed approach and find that it could be useful in protecting society from the perils of fake news.

Interpreting Internal Activation Patterns in Deep Temporal Neural Networks by Finding Prototypes

Deep neural networks have demonstrated competitive performance in classification tasks for sequential data. However, it remains difficult to understand which temporal patterns the internal channels of deep neural networks capture for decision-making in sequential data. To address this issue, we propose a new framework with which to visualize temporal representations learned in deep neural networks without hand-crafted segmentation labels. Given input data, our framework extracts highly activated temporal regions that contribute to activating internal nodes and characterizes such regions by prototype selection method based on Maximum Mean Discrepancy. Representative temporal patterns referred to here as Prototypes of Temporally Activated Patterns (PTAP) provide core examples of subsequences in the sequential data for interpretability. We also analyze the role of each channel by Value-LRP plots using representative prototypes and the distribution of the input attribution. Input attribution plots give visual information to recognize the shapes focused on by the channel for decision-making.

Improve Learning from Crowds via Generative Augmentation

Crowdsourcing provides an efficient label collection schema for supervised machine learning. However, to control annotation cost, each instance in the crowdsourced data is typically annotated by a small number of annotators. This creates a sparsity issue and limits the quality of machine learning models trained on such data. In this paper, we study how to handle sparsity in crowdsourced data using data augmentation. Specifically, we propose to directly learn a classifier by augmenting the raw sparse annotations. We implement two principles of high-quality augmentation using Generative Adversarial Networks: 1) the generated annotations should follow the distribution of authentic ones, which is measured by a discriminator; 2) the generated annotations should have high mutual information with the ground-truth labels, which is measured by an auxiliary network. Extensive experiments and comparisons against an array of state-of-the-art learning from crowds methods on three real-world datasets proved the effectiveness of our data augmentation framework. It shows the potential of our algorithm for low-budget crowdsourcing in general.

Graph Infomax Adversarial Learning for Treatment Effect Estimation with Networked Observational Data

Treatment effect estimation from observational data is a critical research topic across many domains. The foremost challenge in treatment effect estimation is how to capture hidden confounders. Recently, the growing availability of networked observational data offers a new opportunity to deal with the issue of hidden confounders. Unlike networked data in traditional graph learning tasks, such as node classification and link detection, the networked data under the causal inference problem has its particularity, i.e., imbalanced network structure. In this paper, we propose a Graph Infomax Adversarial Learning (GIAL) model for treatment effect estimation, which makes full use of the network structure to capture more information by recognizing the imbalance in network structure. We evaluate the performance of our GIAL model on two benchmark datasets, and the results demonstrate superiority over the state-of-the-art methods.

Graph Similarity Description: How Are These Graphs Similar?

How do social networks differ across platforms? How do information networks change over time? Answering questions like these requires us to compare two or more graphs. This task is commonly treated as a measurement problem, but numerical answers give limited insight. Here, we argue that if the goal is to gain understanding, we should treat graph similarity assessment as a description problem instead. We formalize this problem as a model selection task using the Minimum Description Length principle, capturing the similarity of the input graphs in a common model and the differences between them in transformations to individual models. To discover good models, we propose Momo, which breaks the problem into two parts and introduces efficient algorithms for each. Through an extensive set of experiments on a wide range of synthetic and real-world graphs, we confirm that Momo works well in practice.

Bavarian: Betweenness Centrality Approximation with Variance-Aware Rademacher Averages

We present Bavarian, a collection of sampling-based algorithms for approximating the Betweenness Centrality (BC) of all vertices in a graph. Our algorithms use Monte-Carlo Empirical Rademacher Averages (MCERAs), a concept from statistical learning theory, to efficiently compute tight bounds on the maximum deviation of the estimates from the exact values. The MCERAs provide a sample-dependent approximation guarantee much stronger than the state of the art, thanks to its use of variance-aware probabilistic tail bounds. The flexibility of the MCERA allows us to introduce a unifying framework that can be instantiated with existing sampling-based estimators of BC, thus allowing a fair comparison between them, decoupled from the sample-complexity results with which they were originally introduced. Additionally, we prove novel sample-complexity results showing that, for all estimators, the sample size sufficient to achieve a desired approximation guarantee depends on the vertex-diameter of the graph, an easy-to-bound characteristic quantity. We also show progressive-sampling algorithms and extensions to other centrality measures, such as percolation centrality. Our extensive experimental evaluation of Bavarian shows the improvement over the state-of-the art made possible by the MCERA, and it allows us to assess the different trade-offs between sample size and accuracy guarantee offered by the different estimators.

Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking Fairness and Algorithm Utility

Bipartite ranking, which aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data, is widely adopted in various applications where sample prioritization is needed. Recently, there have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups defined by sensitive attributes. While there could be trade-off between fairness and performance, in this paper we propose a model agnostic post-processing framework for balancing them in the bipartite ranking scenario. Specifically, we maximize a weighted sum of the utility and fairness by directly adjusting the relative ordering of samples across groups. By formulating this problem as the identification of an optimal warping path across different protected groups, we propose a non-parametric method to search for such an optimal path through a dynamic programming process. Our method is compatible with various classification models and applicable to a variety of ranking fairness metrics. Comprehensive experiments on a suite of benchmark data sets and two real-world patient electronic health record repositories show that our method can achieve a great balance between the algorithm utility and ranking fairness. Furthermore, we experimentally verify the robustness of our method when faced with the fewer training samples and the difference between training and testing ranking score distributions.

Labeled Data Generation with Inexact Supervision

The recent advanced deep learning techniques have shown the promising results in various domains such as computer vision and natural language processing. The success of deep neural networks in supervised learning heavily relies on a large amount of labeled data. However, obtaining labeled data with target labels is often challenging due to various reasons such as cost of labeling and privacy issues, which challenges existing deep models. In spite of that, it is relatively easy to obtain data with inexact supervision, i.e., having labels/tags related to the target task. For example, social media platforms are overwhelmed with billions of posts and images with self-customized tags, which are not the exact labels for target classification tasks but are usually related to the target labels. It is promising to leverage these tags (inexact supervision) and their relations with target classes to generate labeled data to facilitate the downstream classification tasks. However, the work on this is rather limited. Therefore, we study a novel problem of labeled data generation with inexact supervision. We propose a novel generative framework named as ADDES which can synthesize high-quality labeled data for target classification tasks by learning from data with inexact supervision and the relations between inexact supervision and target classes. Experimental results on image and text datasets demonstrate the effectiveness of the proposed ADDES for generating realistic labeled data from inexact supervision to facilitate the target classification task.

NRGNN: Learning a Label Noise Resistant Graph Neural Network on Sparsely and Noisily Labeled Graphs

Graph Neural Networks (GNNs) have achieved promising results for semi-supervised learning tasks on graphs such as node classification. Despite the great success of GNNs, many real-world graphs are often sparsely and noisily labeled, which could significantly degrade the performance of GNNs, as the noisy information could propagate to unlabeled nodes via graph structure. Thus, it is important to develop a label noise-resistant GNN for semi-supervised node classification. Though extensive studies have been conducted to learn neural networks with noisy labels, they mostly focus on independent and identically distributed data and assume a large number of noisy labels are available, which are not directly applicable for GNNs. Thus, we investigate a novel problem of learning a robust GNN with noisy and limited labels. To alleviate the negative effects of label noise, we propose to link the unlabeled nodes with labeled nodes of high feature similarity to bring more clean label information. Furthermore, accurate pseudo labels could be obtained by this strategy to provide more supervision and further reduce the effects of label noise. Our theoretical and empirical analysis verify the effectiveness of these two strategies under mild conditions. Extensive experiments on real-world datasets demonstrate the effectiveness of the proposed method in learning a robust GNN with noisy and limited labels.

PID-GAN: A GAN Framework based on a Physics-informed Discriminator for Uncertainty Quantification with Physics

As applications of deep learning (DL) continue to seep into critical scientific use-cases, the importance of performing uncertainty quantification (UQ) with DL has become more pressing than ever before. In scientific applications, it is also important to inform the learning of DL models with knowledge of physics of the problem to produce physically consistent and generalized solutions. This is referred to as the emerging field of physics-informed deep learning (PIDL). We consider the problem of developing PIDL formulations that can also perform UQ. To this end, we propose a novel physics-informed GAN architecture, termed PID-GAN, where the knowledge of physics is used to inform the learning of both the generator and discriminator models, making ample use of unlabeled data instances. We show that our proposed PID-GAN framework does not suffer from imbalance of generator gradients from multiple loss terms as compared to state-of-the-art. We also empirically demonstrate the efficacy of our proposed framework on a variety of case studies involving benchmark physics-based PDEs as well as imperfect physics. All the code and datasets used in this study have been made available on this link: https://github.com/arkadaw9/PID-GAN.

MiniRocket: A Very Fast (Almost) Deterministic Transform for Time Series Classification

Rocket achieves state-of-the-art accuracy for time series classification with a fraction of the computational expense of most existing methods by transforming input time series using random convolutional kernels, and using the transformed features to train a linear classifier. We reformulate Rocket into a new method, MiniRocket. MiniRocket is up to 75 times faster than Rocket on larger datasets, and almost deterministic (and optionally, fully deterministic), while maintaining essentially the same accuracy. Using this method, it is possible to train and test a classifier on all of 109 datasets from the UCR archive to state-of-the-art accuracy in under 10 minutes. MiniRocket is significantly faster than any other method of comparable accuracy (including Rocket), and significantly more accurate than any other method of remotely similar computational expense.

Mutual Information Preserving Back-propagation: Learn to Invert for Faithful Attribution

Back-propagation based visualizations have been proposed to interpret deep neural networks (DNNs), some of which produce interpretations with good visual quality. However, there exist doubts about whether these intuitive visualizations are related to network decisions. Recent studies have confirmed this suspicion by verifying that almost all these modified back-propagation visualizations are not faithful to the model's decision-making process. Besides, these visualizations produce vague "relative importance scores", among which low values can't guarantee to be independent of the final prediction. Hence, it's highly desirable to develop a novel back-propagation method that guarantees theoretical faithfulness and produces a quantitative attribution score with a clear understanding. To achieve the goal, we resort to mutual information theory to generate the interpretations, studying how much information of output is encoded in each input neuron. The basic idea is to learn a source signal by back-propagation such that the mutual information between input and output should be as much as possible preserved in the mutual information between input and the source signal. In addition, we propose a Mutual Information Preserving Inverse Network, termed MIP-IN, in which the parameters of each layer are recursively trained to learn how to invert. During the inversion, forward relu operation is adopted to adapt the general interpretations to the specific input. We then empirically demonstrate that the inverted source signal satisfies completeness and minimality property, which are crucial for a faithful interpretation. Furthermore, the empirical study validates the effectiveness of interpretations generated by MIP-IN.

ST-Norm: Spatial and Temporal Normalization for Multi-variate Time Series Forecasting

Multi-variate time series (MTS) data is a ubiquitous class of data abstraction in the real world. Any instance of MTS is generated from a hybrid dynamical system with their specific dynamics normally unknown. The hybrid nature of such a dynamical system is a result of complex external impacts, which can be summarized as high-frequency and low-frequency from the temporal view, or global and local if we take the spatial view. These impacts also determine the forthcoming development of MTS making them paramount to capture in a time series forecasting task. However, conventional methods face intrinsic difficulties in disentangling the components yielded by each kind of impact from the raw data. To this end, we propose two kinds of normalization modules -- temporal and spatial normalization -- which separately refine the high-frequency component and the local component underlying the raw data. Moreover, both modules can be readily integrated into canonical deep learning architectures such as Wavenet and Transformer. Extensive experiments on three datasets are conducted to illustrate that, with additional normalization modules, the performance of the canonical architectures can be enhanced by a large margin in the application of MTS and achieves state-of-the-art results compared with existing MTS models.

DiffMG: Differentiable Meta Graph Search for Heterogeneous Graph Neural Networks

In this paper, we propose a novel framework to automatically utilize task-dependent semantic information which is encoded in heterogeneous information networks (HINs). Specifically, we search for a meta graph, which can capture more complex semantic relations than a meta path, to determine how graph neural networks (GNNs) propagate messages along different types of edges. We formalize the problem within the framework of neural architecture search (NAS) and then perform the search in a differentiable manner. We design an expressive search space in the form of a directed acyclic graph (DAG) to represent candidate meta graphs for a HIN, and we propose task-dependent type constraint to filter out those edge types along which message passing has no effect on the representations of nodes that are related to the downstream task. The size of the search space we define is huge, so we further propose a novel and efficient search algorithm to make the total search cost on a par with training a single GNN once. Compared with existing popular NAS algorithms, our proposed search algorithm improves the search efficiency. We conduct extensive experiments on different HINs and downstream tasks to evaluate our method, and experimental results show that our method can outperform state-of-the-art heterogeneous GNNs and also improves efficiency compared with those methods which can implicitly learn meta paths.

Global Neighbor Sampling for Mixed CPU-GPU Training on Giant Graphs

Graph neural networks (GNNs) are powerful tools for learning from graph data and are widely used in various applications such as social network recommendation, fraud detection, and graph search. The graphs in these applications are typically large, usually containing hundreds of millions of nodes. Training GNN models on such large graphs efficiently remains a big challenge. Despite a number of sampling-based methods have been proposed to enable mini-batch training on large graphs, these methods have not been proved to work on truly industry-scale graphs, which require GPUs or mixed CPU-GPU training. The state-of-the-art sampling-based methods are usually not optimized for these real-world hardware setups, in which data movement between CPUs and GPUs is a bottleneck. To address this issue, we propose Global Neighborhood Sampling that aims at training GNNs on giant graphs specifically for mixed CPU-GPU training. The algorithm samples a global cache of nodes periodically for all mini-batches and stores them in GPUs. This global cache allows in-GPU importance sampling of mini-batches, which drastically reduces the number of nodes in a mini-batch, especially in the input layer, to reduce data copy between CPU and GPU and mini-batch computation without compromising the training convergence rate or model accuracy. We provide a highly efficient implementation of this method and show that our implementation outperforms an efficient node-wise neighbor sampling baseline by a factor of 2× ~ 4× on giant graphs. It outperforms an efficient implementation of LADIES with small layers by a factor of 2× ~ 14× while achieving much higher accuracy than LADIES. We also theoretically analyze the proposed algorithm and show that with cached node data of a proper size, it enjoys a comparable convergence rate as the underlying node-wise sampling method.

Individual Fairness for Graph Neural Networks: A Ranking based Approach

Recent years have witnessed the pivotal role of Graph Neural Networks (GNNs) in various high-stake decision-making scenarios due to their superior learning capability. Close on the heels of the successful adoption of GNNs in different application domains has been the increasing societal concern that conventional GNNs often do not have fairness considerations. Although some research progress has been made to improve the fairness of GNNs, these works mainly focus on the notion of group fairness regarding different subgroups defined by a protected attribute such as gender, age, and race. Beyond that, it is also essential to study the GNN fairness at a much finer granularity (i.e., at the node level) to ensure that GNNs render similar prediction results for similar individuals to achieve the notion of individual fairness. Toward this goal, in this paper, we make an initial investigation to enhance the individual fairness of GNNs and propose a novel ranking based framework---REDRESS. Specifically, we refine the notion of individual fairness from a ranking perspective, and formulate the ranking based individual fairness promotion problem. This naturally addresses the issue of Lipschitz constant specification and distance calibration resulted from the Lipschitz condition in the conventional individual fairness definition. Our proposed framework REDRESS encapsulates the GNN model utility maximization and the ranking-based individual fairness promotion in a joint framework to enable end-to-end training. It is noteworthy mentioning that REDRESS is a plug-and-play framework and can be easily generalized to any prevalent GNN architectures. Extensive experiments on multiple real-world graphs demonstrate the superiority of REDRESS in achieving a good balance between model utility maximization and individual fairness promotion. Our open source code can be found here: https://github.com/yushundong/REDRESS.

Sylvester Tensor Equation for Multi-Way Association

How can we identify the same or similar users from a collection of social network platforms (e.g., Facebook, Twitter, LinkedIn, etc.)? Which restaurant shall we recommend to a given user at the right time at the right location? Given a disease, which genes and drugs are most relevant? Multi-way association, which identifies strongly correlated node sets from multiple input networks, is the key to answering these questions. Despite its importance, very few multi-way association methods exist due to its high complexity. In this paper, we formulate multi-way association as a convex optimization problem, whose optimal solution can be obtained by a Sylvester tensor equation. Furthermore, we propose two fast algorithms to solve the Sylvester tensor equation, with a linear time and space complexity. We further provide theoretic analysis in terms of the sensitivity of the Sylvester tensor equation solution. Empirical evaluations demonstrate the efficacy of the proposed method.

TabularNet: A Neural Network Architecture for Understanding Semantic Structures of Tabular Data

Tabular data are ubiquitous for the widespread applications of tables and hence have attracted the attention of researchers to extract underlying information. One of the critical problems in mining tabular data is how to understand their inherent semantic structures automatically. Existing studies typically adopt Convolutional Neural Network (CNN) to model the spatial information of tabular structures yet ignore more diverse relational information between cells, such as the hierarchical and paratactic relationships. To simultaneously extract spatial and relational information from tables, we propose a novel neural network architecture, TabularNet. The spatial encoder of TabularNet utilizes the row/column-level Pooling and the Bidirectional Gated Recurrent Unit (Bi-GRU) to capture statistical information and local positional correlation, respectively. For relational information, we design a new graph construction method based on the WordNet tree and adopt a Graph Convolutional Network (GCN) based encoder that focuses on the hierarchical and paratactic relationships between cells. Our neural network architecture can be a unified neural backbone for different understanding tasks and utilized in a multitask scenario. We conduct extensive experiments on three classification tasks with two real-world spreadsheet data sets, and the results demonstrate the effectiveness of our proposed TabularNet over state-of-the-art baselines.

When Comparing to Ground Truth is Wrong: On Evaluating GNN Explanation Methods

We study the evaluation of graph explanation methods. The state of the art to evaluate explanation methods is to first train a GNN, then generate explanations, and finally compare those explanations with the ground truth. We show five pitfalls that sabotage this pipeline because the GNN does not use the ground-truth edges. Thus, the explanation method cannot detect the ground truth. We propose three novel benchmarks: (i) pattern detection, (ii) community detection, and (iii) handling negative evidence and gradient saturation. In a re-evaluation of state-of-the-art explanation methods, we show paths for improving existing methods and highlight further paths for GNN explanation research.

Large-Scale Subspace Clustering via k-Factorization

Subspace clustering (SC) aims to cluster data lying in a union of low-dimensional subspaces. Usually, SC learns an affinity matrix and then performs spectral clustering. Both steps suffer from high time and space complexity, which leads to difficulty in clustering large datasets. This paper presents a method called k-Factorization Subspace Clustering (k-FSC) for large-scale subspace clustering. K-FSC directly factorizes the data into k groups via pursuing structured sparsity in the matrix factorization model. Thus, k-FSC avoids learning affinity matrix and performing eigenvalue decomposition, and has low (linear) time and space complexity on large datasets. This paper proves the effectiveness of the k-FSC model theoretically. An efficient algorithm with convergence guarantee is proposed to solve the optimization of k-FSC. In addition, k-FSC is able to handle sparse noise, outliers, and missing data, which are pervasive in real applications. This paper also provides online extension and out-of-sample extension for k-FSC to handle streaming data and cluster arbitrarily large datasets. Extensive experiments on large-scale real datasets show that k-FSC and its extensions outperform state-of-the-art methods of subspace clustering.

Gaussian Process with Graph Convolutional Kernel for Relational Learning

Gaussian Process (GP) offers a principled non-parametric framework for learning stochastic functions. The generalization capability of GPs depends heavily on the kernel function, which implicitly imposes the smoothness assumptions of the data. However, common feature-based kernel functions are inefficient to model the relational data, where the smoothness assumptions implied by the kernels are violated. To model the complex and non-differentiable functions over relational data, we propose a novel Graph Convolutional Kernel, which enables to incorporate relational structures to feature-based kernels to capture the statistical structure of data. To validate the effectiveness of proposed kernel function in modeling relational data, we introduce GP models with Graph Convolutional Kernel in two relational learning settings, i.e., unsupervised settings of link prediction and semi-supervised settings of object classification. The parameters of our GP models are optimized through the scalable variational inducing point method. However, the highly structured likelihood objective requires densely sampling from variational distributions, which is costly and makes its optimization challenging in the unsupervised settings. To tackle this challenge, we propose a Local Neighbor Sampling technique with a provable more efficient computational complexity. Experimental results on real-world datasets demonstrate that our model achieves state-of-the-art performance in two relational learning tasks.

Spatial-Temporal Graph ODE Networks for Traffic Flow Forecasting

Spatial-temporal forecasting has attracted tremendous attention in a wide range of applications, and traffic flow prediction is a canonical and typical example. The complex and long-range spatial-temporal correlations of traffic flow bring it to a most intractable challenge. Existing works typically utilize shallow graph convolution networks (GNNs) and temporal extracting modules to model spatial and temporal dependencies respectively. However, the representation ability of such models is limited due to: (1) shallow GNNs are incapable to capture long-range spatial correlations, (2) only spatial connections are considered and a mass of semantic connections are ignored, which are of great importance for a comprehensive understanding of traffic networks. To this end, we propose Spatial-Temporal Graph Ordinary Differential Equation Networks (STGODE).1 Specifically, we capture spatial-temporal dynamics through a tensor-based ordinary differential equation (ODE), as a result, deeper networks can be constructed and spatial-temporal features are utilized synchronously. To understand the network more comprehensively, semantical adjacency matrix is considered in our model, and a well-design temporal dialated convolution structure is used to capture long term temporal dependencies. We evaluate our model on multiple real-world traffic datasets and superior performance is achieved over state-of-the-art baselines.

Multiple-Instance Learning from Similar and Dissimilar Bags

Multiple-instance learning (MIL) is an important weakly supervised binary classification problem, where training instances are arranged in bags, and each bag is assigned a positive or negative label. Most of the previous studies for MIL assume that training bags are fully labeled. However, in some real-world scenarios, it could be difficult to collect fully labeled bags, due to the expensive time and labor consumption of the labeling task. Fortunately, it could be much easier for us to collect similar and dissimilar bags (indicating whether two bags share the same label or not), because we do not need to figure out the underlying label of each bag in this case. Therefore, in this paper, we for the first time investigate MIL from only similar and dissimilar bags. To solve this new MIL problem, we propose a convex formulation to train a bag-level classifier based on empirical risk minimization and theoretically derive a generalization error bound. In addition, we also propose a strong baseline for this new MIL problem, which aims to train an instance-level classifier by minimizing the instance-level empirical risk. Extensive experimental results clearly demonstrate that our proposed baseline works well, while our proposed convex formulation is even better.

Differentiable Pattern Set Mining

Pattern set mining has been successful in discovering small sets of highly informative and useful patterns from data. To find good models, existing methods heuristically explore the twice-exponential search space over all possible pattern sets in a combinatorial way, by which they are limited to data over at most hundreds of features, as well as likely to get stuck in local minima. Here, we propose a gradient based optimization approach that allows us to efficiently discover high-quality pattern sets from data of millions of rows and hundreds of thousands of features.

In particular, we propose a novel type of neural autoencoder called BinaPs, using binary activations and binarizing weights in each forward pass, which are directly interpretable as conjunctive patterns. For training, optimizing a data-sparsity aware reconstruction loss, continuous versions of the weights are learned in small, noisy steps. This formulation provides a link between the discrete search space and continuous optimization, thus allowing for a gradient based strategy to discover sets of high-quality and noise-robust patterns. Through extensive experiments on both synthetic and real world data, we show that BinaPs discovers high quality and noise robust patterns, and unique among all competitors, easily scales to data of supermarket transactions or biological variant calls.

ProgRPGAN: Progressive GAN for Route Planning

Learning to route has received significant research momentum as anew approach for the route planning problem in intelligent transportation systems. By exploring global knowledge of geographical areas and topological structures of road networks to facilitate route planning, in this work, we propose a novel Generative Adversarial Network (GAN) framework, namely Progressive Route Planning GAN (ProgRPGAN), for route planning in road networks. The novelty of ProgRPGAN lies in the following aspects: 1) we propose to plan a route with levels of increasing map resolution, starting on a low-resolution grid map, gradually refining it on higher-resolution grid maps, and eventually on the road network in order to progressively generate various realistic paths; 2) we propose to transfer parameters of the previous-level generator and discriminator to the subsequent generator and discriminator for parameter initialization in order to improve the efficiency and stability in model learning; and 3) we propose to pre-train embeddings of grid cells in grid maps and intersections in the road network by capturing the network topology and external factors to facilitate effective model learn-ing. Empirical result shows that ProgRPGAN soundly outperforms the state-of-the-art learning to route methods, especially for long routes, by 9.46% to 13.02% in F1-measure on multiple large-scale real-world datasets. ProgRPGAN, moreover, effectively generates various realistic routes for the same query.

Probabilistic and Dynamic Molecule-Disease Interaction Modeling for Drug Discovery

Drug discovery aims at finding promising drug molecules for treating target diseases. Existing computational drug discovery methods mainly depend on molecule databases, ignoring valuable data collected from clinical trials. In this work, we propose PRIME to leverage high-quality drug molecules and drug-disease relations in historical clinical trials to narrow down the molecular search space in drug discovery. PRIME also introduces time dependency constraints to model evolving drug-disease relations using a probabilistic deep learning model that can quantify model uncertainty. We evaluated PRIME against leading models on both de novo design and drug repurposing tasks. Results show that compared with the best baselines, PRIME achieves 25.9% relative improvement (i.e., reduction) in average hit-ranking on drug repurposing and 47.6% relative improvement in success rate on de novo design.

Efficient Data-specific Model Search for Collaborative Filtering

Collaborative filtering (CF), as a fundamental approach for recommender systems, is usually built on the latent factor model with learnable parameters to predict users' preferences towards items. However, designing a proper CF model for a given data is not easy, since the properties of datasets are highly diverse. In this paper, motivated by the recent advances in automated machine learning (AutoML), we propose to design a data-specific CF model by AutoML techniques. The key here is a new framework that unifies state-of-the-art (SOTA) CF methods and splits them into disjoint stages of input encoding, embedding function, interaction function, and prediction function. We further develop an easy-to-use, robust, and efficient search strategy, which utilizes random search and a performance predictor for efficient searching within the above framework. In this way, we can combinatorially generalize data-specific CF models, which have not been visited in the literature, from SOTA ones. Extensive experiments on five real-world datasets demonstrate that our method can consistently outperform SOTA ones for various CF tasks. Further experiments verify the rationality of the proposed framework and the efficiency of the search strategy. The searched CF models can also provide insights for exploring more effective methods in the future.

Unsupervised Graph Alignment with Wasserstein Distance Discriminator

Graph alignment aims to identify node correspondence across multiple graphs, with significant implications in various domains. As supervision information is often not available, unsupervised methods have attracted a surge of research interest recently. Most of existing unsupervised methods assume that corresponding nodes should have similar local structure, which, however, often does not hold. Meanwhile, rich node attributes are often available and have shown to be effective in alleviating the above local topology inconsistency issue. Motivated by the success of graph convolution networks (GCNs) in fusing network and node attributes for various learning tasks, we aim to tackle the graph alignment problem on the basis of GCNs. However, directly grafting GCNs to graph alignment is often infeasible due to multi-faceted challenges. To bridge the gap, we propose a novel unsupervised graph alignment framework WAlign. We first develop a lightweight GCN architecture to capture both local and global graph patterns and their inherent correlations with node attributes. Then we prove that in the embedding space, obtaining optimal alignment results is equivalent to minimizing the Wasserstein distance between embeddings of nodes from different graphs. Towards this, we propose a novel Wasserstein distance discriminator to identify candidate node correspondence pairs for updating node embeddings. The whole process acts like a two-player game, and in the end, we obtain discriminative embeddings that are suitable for the alignment task. Extensive experiments on both synthetic and real-world datasets validate the effectiveness and efficiency of the proposed framework WAlign.

Maxmin-Fair Ranking: Individual Fairness under Group-Fairness Constraints

We study a novel problem of fairness in ranking aimed at minimizing the amount of individual unfairness introduced when enforcing group-fairness constraints. Our proposal is rooted in the distributional maxmin fairness theory, which uses randomization to maximize the expected satisfaction of the worst-off individuals. We devise an exact polynomial-time algorithm to find maxmin-fair distributions of general search problems (including, but not limited to, ranking), and show that our algorithm can produce rankings which, while satisfying the given group-fairness constraints, ensure that the maximum possible value is to individuals.

Boosted Second Price Auctions: Revenue Optimization for Heterogeneous Bidders

The second price auction has been the prevalent auction format used by advertising exchanges because of its simplicity and desirable incentive properties. However, even with an optimized choice of reserve prices, this auction is not revenue optimal when the bidders are heterogeneous and their valuation distributions differ significantly. In order to optimize the revenue of advertising exchanges, we propose an auction format called the boosted second price auction, which assigns a boost value to each bidder. The auction favors bidders with higher boost values and allocates the item to the bidder with the highest boosted bid. We propose a data-driven approach to optimize boost values using the previous bids of the bidders. Our analysis of auction data from Google's online advertising exchange shows that the boosted second price auction with data-optimized boost values outperforms the second price auction and empirical Myerson auction by up to 6% and 3%, respectively.

Meaning Error Rate: ASR domain-specific metric framework

Speech recognition became a popular task during the last decade. Automatic speech recognition (ASR) systems are used in many fields: virtual assistants, call-center automation, device speech interfaces, etc. Each application defines its own measure of quality. Improvement in one domain could lead to loss of the recognition quality in the other domain. For ASR services open to the public, it is essential to provide reasonable quality for all customers in their scenarios. State-of-the-art metrics currently do not fit well for this purpose as they do not adapt to domain specifics. In our work, we build a speech recognition quality evaluation framework that unifies feedback coming from different types of customers into a single metric. For this purpose, we collect feedback from customers, train a new dedicated metric for each customer based on their feedback, and finally aggregate these metrics in a single criterion of quality. The resulting metrics have two significant properties: they compare recognition quality in different domains, and their results are easy to interpret.

Towards Computing a Near-Maximum Weighted Independent Set on Massive Graphs

The vertices in many graphs are weighted unequally in real scenarios, but the previous studies on the maximum independent set (MIS) ignore the weights of vertices. Therefore, the weight of an MIS may not necessarily be the largest. In this paper, we study the problem of maximum weighted independent set (MWIS) that is defined as the set of independent vertices with the largest weight. Since it is intractable to deliver the exact solution for large graphs, we design a reducing and tie-breaking framework to compute a near-maximum weighted independent set. The reduction rules are critical to reduce the search space for both exact and greedy algorithms as they determine the vertices that are definitely (or not) in the MWIS while preserving the correctness of solutions. We devise a set of novel reductions including low-degree reductions and high-degree reductions for general weighted graphs. Extensive experimental studies over real graphs confirm that our proposed method outperforms the state-of-the-arts significantly in terms of both effectiveness and efficiency.

UCPhrase: Unsupervised Context-aware Quality Phrase Tagging

Identifying and understanding quality phrases from context is a fundamental task in text mining. The most challenging part of this task arguably lies in uncommon, emerging, and domain-specific phrases. The infrequent nature of these phrases significantly hurts the performance of phrase mining methods that rely on sufficient phrase occurrences in the input corpus. Context-aware tagging models, though not restricted by frequency, heavily rely on domain experts for either massive sentence-level gold labels or handcrafted gazetteers. In this work, we propose UCPhrase, a novel unsupervised context-aware quality phrase tagger. Specifically, we induce high-quality phrase spans as silver labels from consistently co-occurring word sequences within each document. Compared with typical context-agnostic distant supervision based on existing knowledge bases (KBs), our silver labels root deeply in the input domain and context, thus having unique advantages in preserving contextual completeness and capturing emerging, out-of-KB phrases. Training a conventional neural tagger based on silver labels usually faces the risk of overfitting phrase surface names. Alternatively, we observe that the contextualized attention maps generated from a transformer-based neural language model effectively reveal the connections between words in a surface-agnostic way. Therefore, we pair such attention maps with the silver labels to train a lightweight span prediction model, which can be applied to new input to recognize (unseen) quality phrases regardless of their surface names or frequency. Thorough experiments on various tasks and datasets, including corpus-level phrase ranking, document-level keyphrase extraction, and sentence-level phrase tagging, demonstrate the superiority of our design over state-of-the-art pre-trained, unsupervised, and distantly supervised methods.

Learning from Imbalanced and Incomplete Supervision with Its Application to Ride-Sharing Liability Judgment

In multi-label tasks, sufficient and class-balanced label is usually hard to obtain, which makes it challenging to train a good classifier. In this paper, we consider the problem of learning from imbalanced and incomplete supervision, where only a small subset of labeled data is available and the label distribution is highly imbalanced. This setting is of importance and commonly appears in a variety of real applications. For instance, considering the ride-sharing liability judgment task, liability disputes usually due to a variety of reasons, however, it is expensive to manually annotate the reasons, meanwhile, the distribution of reason is often seriously imbalanced. In this paper, we present a systemic framework Limi consisting of three sub-steps, that is, Label Separating, Correlation Mining and Label Completion. Specifically, we propose an effective two-classifier strategy to separately tackle head and tail labels so as to alleviate the performance degradation on tail labels while maintaining high performance on head labels. Then, a novel label correlation network is adopted to explore the label relation knowledge with flexible aggregators. Moreover, the Limi framework completes the label on unlabeled instances in a semi-supervised fashion. The framework is general, flexible, and effective. Extensive experiments on diverse applications, such as the ride-sharing liability judgment task from Didi and various benchmark tasks, demonstrate that our solution is clearly better than many competitive methods.

Dual Graph enhanced Embedding Neural Network for CTR Prediction

CTR prediction, which aims to estimate the probability that a user will click an item, plays a crucial role in online advertising and recommender system. Feature interaction modeling based and user interest mining based methods are the two kinds of most popular techniques that have been extensively explored for many years and have made great progress for CTR prediction. However, (1) feature interaction based methods which rely heavily on the co-occurrence of different features, may suffer from the feature sparsity problem (i.e., many features appear few times); (2) user interest mining based methods which need rich user behaviors to obtain user's diverse interests, are easy to encounter the behavior sparsity problem (i.e., many users have very short behavior sequences). To solve these problems, we propose a novel module named Dual Graph enhanced Embedding, which is compatible with various CTR prediction models to alleviate these two problems. We further propose a Dual Graph enhanced Embedding Neural Network(DG-ENN) for CTR prediction. Dual Graph enhanced Embedding exploits the strengths of graph representation with two carefully designed learning strategies (divide-and-conquer, curriculum-learning-inspired organized learning) to refine the embedding. We conduct comprehensive experiments on three real-world industrial datasets. The experimental results show that our proposed DG-ENN significantly outperforms state-of-the-art CTR prediction models. Moreover, when applying to state-of-the-art CTR prediction models, Dual graph enhanced embedding always obtains better performance. Further case studies prove that our proposed dual graph enhanced embedding could alleviate the feature sparsity and behavior sparsity problems. Our framework will be open-source based on MindSpore in the near future.

Deep Generative Models for Spatial Networks

Spatial networks represent crucial data structures where the nodes and edges are embedded in a geometric space. Nowadays, spatial network data is becoming increasingly popular and important, ranging from microscale (e.g., protein structures), to middle-scale (e.g., biological neural networks), to macro-scale (e.g., mobility networks). Although, modeling and understanding the generative process of spatial networks are very important, they remain largely under-explored due to the significant challenges in automatically modeling and distinguishing the independency and correlation among various spatial and network factors. To address these challenges, we first propose a novel objective for joint spatial-network disentanglement from the perspective of information bottleneck as well as a novel optimization algorithm to optimize the intractable objective. Based on this, a spatial-network variational autoencoder (SND-VAE) with a new spatial-network message passing neural network (S-MPNN) is proposed to discover the independent and dependent latent factors of spatial and networks. Qualitative and quantitative experiments on both synthetic and real-world datasets demonstrate the superiority of the proposed model over the state-of-the-arts by up to 66.9% for graph generation and 37.3% for interpretability.

Subset Node Representation Learning over Large Dynamic Graphs

Dynamic graph representation learning is a task to learn node embeddings over dynamic networks, and has many important applications, including knowledge graphs, citation networks to social networks. Graphs of this type are usually large-scale but only a small subset of vertices are related in downstream tasks. Current methods are too expensive to this setting as the complexity is at best linear-dependent on both the number of nodes and edges.

In this paper, we propose a new method, namely Dynamic Personalized PageRank Embedding (DynamicPPE) for learning a target subset of node representations over large-scale dynamic networks. Based on recent advances in local node embedding and a novel computation of dynamic personalized PageRank vector (PPV), DynamicPPE has two key ingredients: 1) the per-PPV complexity is O (m d / ε) where m, d, and ε are the number of edges received, average degree, global precision error respectively. Thus, the per-edge event update of a single node is only dependent on d in average; and 2) by using these high quality PPVs and hash kernels, the learned embeddings have properties of both locality and global consistency. These two make it possible to capture the evolution of graph structure effectively.

Experimental results demonstrate both the effectiveness and efficiency of the proposed method over large-scale dynamic networks. We apply DynamicPPE to capture the embedding change of Chinese cities in the Wikipedia graph during this ongoing COVID-19 pandemic. https://en.wikipedia.org/wiki/COVID-19_pandemic. Our results show that these representations successfully encode the dynamics of the Wikipedia graph.

Generalized Zero-Shot Extreme Multi-label Learning

Extreme Multi-label Learning (XML) involves assigning the subset of most relevant labels to a data point from millions of label choices. A hitherto unaddressed challenge in XML is that of predicting unseen labels with no training points. These form a significant fraction of total labels and contain fresh and personalized information desired by end users. Most existing extreme classifiers are not equipped for zero-shot label prediction and hence fail to leverage unseen labels. As a remedy, this paper proposes a novel approach called ZestXML for the task of Generalized Zero-shot XML (GZXML) where relevant labels have to be chosen from all available seen and unseen labels. ZestXML learns to project a data point's features close to the features of its relevant labels through a highly sparsified linear transform. This L0-constrained linear map between the two high-dimensional feature vectors is tractably recovered through a novel optimizer based on Hard Thresholding. By effectively leveraging the sparsities in features, labels and the learnt model, ZestXML achieves higher accuracy and smaller model size than existing XML approaches while also promoting efficient training & prediction, real-time label update as well as explainable prediction.

Experiments on large-scale GZXML datasets demonstrated that ZestXML can be up to 14% and 10% more accurate than state-of-the-art extreme classifiers and leading BERT-based dense retrievers respectively, while having 10x smaller model size. ZestXML trains on largest dataset with 31M labels in just 30 hours on a single core of a commodity desktop. When added to an large ensemble of existing models in Bing Sponsored Search Advertising, ZestXML significantly improved click yield of IR based system by 17% and unseen query coverage by 3.4% respectively. ZestXML's source code and benchmark datasets for GZXML will be publically released for research purposes here.

Graph Summarization with Controlled Utility Loss

We present new algorithms for graph summarization where the loss in utility is fully controllable by the user. Specifically, we make three key contributions. First, we present a utility-driven graph summarization method G-SCIS, based on a clique and independent set decomposition, that produces optimal compression with zero loss of utility. The compression provided is significantly better than state-of-the-art in lossless graph summarization, while the runtime is two orders of magnitude lower. Second, we propose a highly scalable, utility-driven algorithm, T-BUDS, for fully controlled lossy summarization. It achieves high scalability by combining memory reduction using Maximum Spanning Tree with a novel binary search procedure. T-BUDS outperforms state-of-the-art drastically in terms of the quality of summarization and is about two orders of magnitude better in terms of speed. In contrast to the competition, we are able to handle web-scale graphs in a single machine without performance impediment as the utility threshold (and size of summary) decreases. Third, we show that our graph summaries can be used as-is to answer several important classes of queries, such as triangle enumeration, Pagerank and shortest paths.

Dynamic and Multi-faceted Spatio-temporal Deep Learning for Traffic Speed Forecasting

Dynamic Graph Neural Networks (DGNNs) have become one of the most promising methods for traffic speed forecasting. However, when adapting DGNNs for traffic speed forecasting, existing approaches are usually built on a static adjacency matrix (no matter predefined or self-learned) to learn spatial relationships among different road segments, even if the impact of two road segments can be changeable dynamically during a day. Moreover, the future traffic speed cannot only be related with the current traffic speed, but also be affected by other factors such as traffic volumes. To this end, in this paper, we aim to explore these dynamic and multi-faceted spatio-temporal characteristics inherent in traffic data for further unleashing the power of DGNNs for better traffic speed forecasting. Specifically, we design a dynamic graph construction method to learn the time-specific spatial dependencies of road segments. Then, a dynamic graph convolution module is proposed to aggregate hidden states of neighbor nodes to focal nodes by message passing on the dynamic adjacency matrices. Moreover, a multi-faceted fusion module is provided to incorporate the auxiliary hidden states learned from traffic volumes with the primary hidden states learned from traffic speeds. Finally, experimental results on real-world data demonstrate that our method can not only achieve the state-of-the-art prediction performances, but also obtain the explicit and interpretable dynamic spatial relationships of road segments.

A Graph-based Approach for Trajectory Similarity Computation in Spatial Networks

Trajectory similarity computation is an essential operation in many applications of spatial data analysis. In this paper, we study the problem of trajectory similarity computation over spatial network, where the real distances between objects are reflected by the network distance. Unlike previous studies which learn the representation of trajectories in Euclidean space, it requires to capture not only the sequence information of the trajectory but also the structure of spatial network. To this end, we propose GTS, a brand new framework that can jointly learn both factors so as to accurately compute the similarity. It first learns the representation of each point-of-interest (POI) in the road network along with the trajectory information. This is realized by incorporating the distances between POIs and trajectory in the random walk over the spatial network as well as the loss function. Then the trajectory representation is learned by a Graph Neural Network model to identify neighboring POIs within the same trajectory, together with an LSTM model to capture the sequence information in the trajectory. We conduct comprehensive evaluation on several real world datasets. The experimental results demonstrate that our model substantially outperforms all existing approaches.

Adaptive Transfer Learning on Graph Neural Networks

Graph neural networks (GNNs) is widely used to learn a powerful representation of graph-structured data. Recent work demonstrates that transferring knowledge from self-supervised tasks to downstream tasks could further improve graph representation. However, there is an inherent gap between self-supervised tasks and downstream tasks in terms of optimization objective and training data. Conventional pre-training methods may be not effective enough on knowledge transfer since they do not make any adaptation for downstream tasks. To solve such problems, we propose a new transfer learning paradigm on GNNs which could effectively leverage self-supervised tasks as auxiliary tasks to help the target task. Our methods would adaptively select and combine different auxiliary tasks with the target task in the fine-tuning stage. We design an adaptive auxiliary loss weighting model to learn the weights of auxiliary tasks by quantifying the consistency between auxiliary tasks and the target task. In addition, we learn the weighting model through meta-learning. Our methods can be applied to various transfer learning approaches, it performs well not only in multi-task learning but also in pre-training and fine-tuning. Comprehensive experiments on multiple downstream tasks demonstrate that the proposed methods can effectively combine auxiliary tasks with the target task and significantly improve the performance compared to state-of-the-art methods.

PETGEN: Personalized Text Generation Attack on Deep Sequence Embedding-based Classification Models

What should a malicious user write next to fool a detection model? Identifying malicious users is critical to ensure the safety and integrity of internet platforms. Several deep learning based detection models have been created. However, malicious users can evade deep detection models by manipulating their behavior, rendering these models of little use. The vulnerability of such deep detection models against adversarial attacks is unknown. Here we create a novel adversarial attack model against deep user sequence embedding-based classification models, which use the sequence of user posts to generate user embeddings and detect malicious users. In the attack, the adversary generates a new post to fool the classifier. We propose a novel end-to-end Personalized Text Generation Attack model, called PETGEN, that simultaneously reduces the efficacy of the detection model and generates posts that have several key desirable properties. Specifically, PETGEN generates posts that are personalized to the user's writing style, have knowledge about a given target context, are aware of the user's historical posts on the target context, and encapsulate the user's recent topical interests. We conduct extensive experiments on two real-world datasets (Yelp and Wikipedia, both with ground-truth of malicious users) to show that PETGEN significantly reduces the performance of popular deep user sequence embedding-based classification models. PETGEN outperforms five attack baselines in terms of text quality and attack efficacy in both white-box and black-box classifier settings. Overall, this work paves the path towards the next generation of adversary-aware sequence classification models.

Pruning-Aware Merging for Efficient Multitask Inference

Many mobile applications demand selective execution of multiple correlated deep learning inference tasks on resource-constrained platforms. Given a set of deep neural networks, each pre-trained for a single task, it is desired that executing arbitrary combinations of tasks yields minimal computation cost. Pruning each network separately yields suboptimal computation cost due to task relatedness. A promising remedy is to merge the networks into a multitask network to eliminate redundancy across tasks before network pruning. However, pruning a multitask network combined by existing network merging schemes cannot minimise the computation cost of every task combination because they do not consider such a future pruning. To this end, we theoretically identify the conditions such that pruning a multitask network minimises the computation of all task combinations. On this basis, we propose Pruning-Aware Merging (PAM), a heuristic network merging scheme to construct a multitask network that approximates these conditions. The merged network is then ready to be further pruned by existing network pruning methods. Evaluations with different pruning schemes, datasets, and network architectures show that PAM achieves up to 4.87x less computation against the baseline without network merging, and up to 2.01x less computation against the baseline with a state-of-the-art network merging scheme.

DARING: Differentiable Causal Discovery with Residual Independence

Discovering causal structure among a set of variables is a crucial task in various scientific and industrial scenarios. Given finite i.i.d. samples from a joint distribution, causal discovery is a challenging combinatorial problem in nature. The recent development in functional causal models, especially the NOTEARS provides a differentiable optimization framework for causal discovery. They formulate the structure learning problem as a task of maximum likelihood estimation over observational data (i.e., variable reconstruction) with specified structural constraints such as acyclicity and sparsity. Despite its success in terms of scalability, we find that optimizing the objectives of these differentiable methods is not always consistent with the correctness of learned causal graph especially when the variables carry heterogeneous noises (i.e., different noise types and noise variances) in real data from wild environments. In this paper, we provide the justification that their proneness to erroneous structures is mainly caused by the over-reconstruction problem, i.e., the noises of variables are absorbed into the variable reconstruction process, leading to the dependency among variable reconstruction residuals, and thus raise structure identifiability problems according to FCM theories. To remedy this, we propose a novel differentiable method DARING by imposing explicit residual independence constraint in an adversarial way. Extensive experimental results on both simulation and real data show that our proposed method is insensitive to the heterogeneity of external noise, and thus can significantly improve the causal discovery performances.

PcDGAN: A Continuous Conditional Diverse Generative Adversarial Network For Inverse Design

Engineering design tasks often require synthesizing new designs that meet desired performance requirements. The conventional design process, which requires iterative optimization and performance evaluation, is slow and dependent on initial designs. Past work has used conditional generative adversarial networks (cGANs) to enable direct design synthesis for given target performances. However, most existing cGANs are restricted to categorical conditions. Recent work on Continuous conditional GAN (CcGAN) tries to address this problem, but still faces two challenges: 1) it performs poorly on non-uniform performance distributions, and 2) the generated designs may not cover the entire design space. We propose a new model, named Performance Conditioned Diverse Generative Adversarial Network (PcDGAN), which introduces a singular vicinal loss combined with a Determinantal Point Processes (DPP) based loss function to enhance diversity. PcDGAN uses a new self-reinforcing score called the Lambert Log Exponential Transition Score (LLETS) for improved conditioning. Experiments on synthetic problems and a real-world airfoil design problem demonstrate that PcDGAN outperforms state-of-the-art GAN models and improves the conditioning likelihood by 69% in an airfoil generation task and up to 78% in synthetic conditional generation tasks and achieves greater design space coverage. The proposed method enables efficient design synthesis and design space exploration with applications ranging from CAD model generation to metamaterial selection.

Federated Adversarial Debiasing for Fair and Transferable Representations

Federated learning is a distributed learning framework that is communication efficient and provides protection over participating users' raw training data. One outstanding challenge of federate learning comes from the users' heterogeneity, and learning from such data may yield biased and unfair models for minority groups. While adversarial learning is commonly used in centralized learning for mitigating bias, there are significant barriers when extending it to the federated framework. In this work, we study these barriers and address them by proposing a novel approach Federated Adversarial DEbiasing (FADE). FADE does not require users' sensitive group information for debiasing and offers users the freedom to opt-out from the adversarial component when privacy or computational costs become a concern. We show that ideally, FADE can attain the same global optimality as the one by the centralized algorithm. We then analyze when its convergence may fail in practice and propose a simple yet effective method to address the problem. Finally, we demonstrate the effectiveness of the proposed framework through extensive empirical studies, including the problem settings of unsupervised domain adaptation and fair learning. Our codes and pretrained models are available at: https://github.com/illidanlab/FADE.

Uncertainty-Aware Reliable Text Classification

Deep neural networks have significantly contributed to the success in predictive accuracy for classification tasks. However, they tend to make over-confident predictions in real-world settings, where domain shifting and out-of-distribution (OOD) examples exist. Most research on uncertainty estimation focuses on computer vision because it provides visual validation on uncertainty quality. However, few have been presented in the natural language process domain. Unlike Bayesian methods that indirectly infer uncertainty through weight uncertainties, current evidential uncertainty-based methods explicitly model the uncertainty of class probabilities through subjective opinions. They further consider inherent uncertainty in data with different root causes, vacuity (i.e., uncertainty due to a lack of evidence) and dissonance (i.e., uncertainty due to conflicting evidence). In our paper, we firstly apply evidential uncertainty in OOD detection for text classification tasks. We propose an inexpensive framework that adopts both auxiliary outliers and pseudo off-manifold samples to train the model with prior knowledge of a certain class, which has high vacuity for OOD samples. Extensive empirical experiments demonstrate that our model based on evidential uncertainty outperforms other counterparts for detecting OOD examples. Our approach can be easily deployed to traditional recurrent neural networks and fine-tuned pre-trained transformers.

HMRL: Hyper-Meta Learning for Sparse Reward Reinforcement Learning Problem

In spite of the success of existing meta reinforcement learning methods, they still have difficulty in learning a meta policy effectively for RL problems with sparse reward. In this respect, we develop a novel meta reinforcement learning framework called Hyper-Meta RL(HMRL), for sparse reward RL problems. It is consisted with three modules including the cross-environment meta state embedding module which constructs a common meta state space to adapt to different environments; the meta state based environment-specific meta reward shaping which effectively extends the original sparse reward trajectory by cross-environmental knowledge complementarity and as a consequence the meta policy achieves better generalization and efficiency with the shaped meta reward. Experiments with sparse-reward environments show the superiority of HMRL on both transferability and policy learning efficiency.

Representation Learning on Knowledge Graphs for Node Importance Estimation

In knowledge graphs, there are usually different types of nodes, multiple heterogeneous relations, and numerous attributes of nodes and edges, which impose the challenges on the task of Node Importance Estimation (NIE). Indeed, existing NIE approaches, such as PageRank (PR) and Node-Degree (ND), are not designed for handling knowledge graphs with the rich information related with these multifarious nodes and edges. To this end, in this paper, we propose a representation learning framework to leverage the rich information inherent in these multifarious nodes and edges for improving node importance estimation in knowledge graphs. Specifically, we provide a Relational Graph Transformer Network (RGTN), where a relational graph transformer is first proposed to propagate node information with the consideration of semantic predicate representations. Here, the assumption is that different predicates may have distinct effects on the transmission of node importance. Then, two separate encoders are designed to capture both the structural and semantic information of nodes respectively, and a co-attention module is developed to fuse the two separate representations of nodes. Next, an attention-based aggregation module is adopted to map the representations of nodes to their importance values. In addition, a learning-to-rank loss is designed to ensure that the learned representations can be aware of the relative ranking information among nodes. Finally, extensive experiments have been conducted on real-world knowledge graphs, and the results illustrate that our model outperforms the existing methods consistently for all the evaluation metrics. The code and the data are available at https://github.com/GRAPH-0/RGTN-NIE.

Metric Learning via Penalized Optimization

Metric learning aims to project original data into a new space, where data points can be classified more accurately using kNN or similar types of classification algorithms. To avoid trivial learning results such as indistinguishably projecting the data onto a line, many existing approaches formulate metric learning as a constrained optimization problem, like finding a metric that minimizes the distance between data points from the same class, with a constraint of ensuring a certain separation for data points from different classes, and then they approximate the optimal solution to the constrained optimization in an iterative way. In order to improve the classification accuracy as much as possible, we try to find a metric that is able to minimize the intra-class distance and maximize the inter-class distance simultaneously. Towards this, we formulate metric learning as a penalized optimization problem, and provide design guideline, paradigms with a general formula, as well as two representative instantiations for the penalty term. In addition, we provide an analytical solution for the penalized optimization, with which costly computation can be avoid, and more importantly, there is no need to worry about the convergence rates or approximation ratios any more. Extensive experiments on real-world data sets are conducted, and the results verify the effectiveness and efficiency of our approach.

MixGCF: An Improved Training Method for Graph Neural Network-based Recommender Systems

Graph neural networks (GNNs) have recently emerged as state-of-the-art collaborative filtering (CF) solution. A fundamental challenge of CF is to distill negative signals from the implicit feedback, but negative sampling in GNN-based CF has been largely unexplored. In this work, we propose to study negative sampling by leveraging both the user-item graph structure and GNNs' aggregation process. We present the MixGCF method---a general negative sampling plugin that can be directly used to train GNN-based recommender systems. In MixGCF, rather than sampling raw negatives from data, we design the hop mixing technique to synthesize hard negatives. Specifically, the idea of hop mixing is to generate the synthetic negative by aggregating embeddings from different layers of raw negatives' neighborhoods. The layer and neighborhood selection process are optimized by a theoretically-backed hard selection strategy. Extensive experiments demonstrate that by using MixGCF, state-of-the-art GNN-based recommendation models can be consistently and significantly improved, e.g., 26% for NGCF and 22% for LightGCN in terms of NDCG@20.

Scaling Up Graph Neural Networks Via Graph Coarsening

Scalability of graph neural networks remains one of the major challenges in graph machine learning. Since the representation of a node is computed by recursively aggregating and transforming representation vectors of its neighboring nodes from previous layers, the receptive fields grow exponentially, which makes standard stochastic optimization techniques ineffective. Various approaches have been proposed to alleviate this issue, e.g., sampling-based methods and techniques based on pre-computation of graph filters.

In this paper, we take a different approach and propose to use graph coarsening for scalable training of GNNs, which is generic, extremely simple and has sublinear memory and time costs during training. We present extensive theoretical analysis on the effect of using coarsening operations and provides useful guidance on the choice of coarsening methods. Interestingly, our theoretical analysis shows that coarsening can also be considered as a type of regularization and may improve the generalization. Finally, empirical results on real world datasets show that, simply applying off-the-shelf coarsening methods, we can reduce the number of nodes by up to a factor of ten without causing a noticeable downgrade in classification accuracy.

A Broader Picture of Random-walk Based Graph Embedding

Graph embedding based on random-walks supports effective solutions for many graph-related downstream tasks. However, the abundance of embedding literature has made it increasingly difficult to compare existing methods and to identify opportunities to advance the state-of-the-art. Meanwhile, existing work has left several fundamental questions---such as how embeddings capture different structural scales and how they should be applied for effective link prediction---unanswered. This paper addresses these challenges with an analytical framework for random-walk based graph embedding that consists of three components: a random-walk process, a similarity function, and an embedding algorithm. Our framework not only categorizes many existing approaches but naturally motivates new ones. With it, we illustrate novel ways to incorporate embeddings at multiple scales to improve downstream task performance. We also show that embeddings based on autocovariance similarity, when paired with dot product ranking for link prediction, outperform state-of-the-art methods based on Pointwise Mutual Information similarity by up to 100%.

DisenQNet: Disentangled Representation Learning for Educational Questions

Learning informative representations for educational questions is a fundamental problem in online learning systems, which can promote many applications, e.g., difficulty estimation. Most solutions integrate all information of one question together following a supervised manner, where the representation results are unsatisfactory sometimes due to the following issues. First, they cannot ensure the presentation ability due to the scarcity of labeled data. Then, the label-dependent representation results have poor feasibility to be transferred. Moreover, aggregating all information into the unified may introduce some noises in applications since it cannot distinguish the diverse characteristics of questions. In this paper, we aim to learn the disentangled representations of questions. We propose a novel unsupervised model, namely DisenQNet, to divide one question into two parts, i.e., a concept representation that captures its explicit concept meaning and an individual representation that preserves its personal characteristics. We achieve this goal via mutual information estimation by proposing three self-supervised estimators in a large unlabeled question corpus. Then, we propose another enhanced model, DisenQNet+, that transfers the representation knowledge from unlabeled questions to labeled questions in specific applications by maximizing the mutual information between both. Extensive experiments on real-world datasets demonstrate that DisenQNet can generate effective and meaningful disentangled representations for questions, and furthermore, DisenQNet+ can improve the performance of different applications.

Coupled Graph ODE for Learning Interacting System Dynamics

Many real-world systems such as social networks and moving planets are dynamic in nature, where a set of coupled objects are connected via the interaction graph and exhibit complex behavior along the time. For example, the COVID-19 pandemic can be considered as a dynamical system, where objects represent geographical locations (e.g., states) whose daily confirmed cases of infection evolve over time. Outbreak at one location may influence another location as people travel between these locations, forming a graph. Thus, how to model and predict the complex dynamics for these systems becomes a critical research problem. Existing work on modeling graph-structured data mostly assumes a static setting. How to handle dynamic graphs remains to be further explored. On one hand, features of objects change over time, influenced by the linked objects in the interaction graph. On the other hand, the graph itself can also evolve, where new interactions (links) may form and existing links may drop, which may in turn be affected by the dynamic features of objects. In this paper, we propose coupled graph ODE: a novel latent ordinary differential equation (ODE) generative model that learns the coupled dynamics of nodes and edges with a graph neural network (GNN) based ODE in a continuous manner. Our model consists of two coupled ODE functions for modeling the dynamics of edges and nodes based on their latent representations respectively. It employs a novel encoder parameterized by a GNN for inferring the initial states from historical data, which serves as the starting point of the predicted latent trajectories. Experiment results on the COVID-19 dataset and the simulated social network dataset demonstrate the effectiveness of our proposed method.

TrajNet: A Trajectory-Based Deep Learning Model for Traffic Prediction

Ridesharing companies such as Ube and DiDi provide ride-hailing services where passengers and drivers are matched via mobile apps. As a result, large amounts of vehicle trajectories and vehicle speed data are collected that can be used for traffic prediction. The recent popularity of graph convolutional networks (GCNs) has opened up new possibilities for real-time traffic prediction and many GCN-based models have been proposed to capture the spatial correlation on the urban road network. However, the graph-based approaches fail to capture the intricate dependencies of consecutive road segments that are well captured by trajectories.

Instead of proposing yet another GCN-based model for traffic prediction, we propose a novel deep learning model that treats vehicle trajectories as first-class citizens. Our model, called TrajNet, captures the spatial dependency of traffic flow by propagating information along real trajectories. To improve training efficiency, we organize the multiple trajectories in a batch used for training with a trie structure, to reuse shared computation. TrajNet uses a spatial attention mechanism to adaptively capture the dynamic correlations between different road segments, and dilated causal convolution to capture long-range temporal dependency. We also resolve the inconsistency between the fine-grained road segment coverage by trajectories, and the ground-truth traffic data that are coarse-grained, following a trajectory-based refinement framework. Extensive experiments on real traffic datasets validate the performance superiority of TrajNet over the state-of-the-art GCN-based models.

Fast and Memory-Efficient Tucker Decomposition for Answering Diverse Time Range Queries

Given a temporal dense tensor and an arbitrary time range, how can we efficiently obtain latent factors in the range? Tucker decomposition is a fundamental tool for analyzing dense tensors to discover hidden factors, and has been exploited in many data mining applications. However, existing decomposition methods do not provide the functionality to analyze a specific range of a temporal tensor. The existing methods are one-off, with the main focus on performing Tucker decomposition once for a whole input tensor. Although a few existing methods with a preprocessing phase can deal with a time range query, they are still time-consuming and suffer from low accuracy. In this paper, we propose Zoom-Tucker, a fast and memory-efficient Tucker decomposition method for finding hidden factors of temporal tensor data in an arbitrary time range. Zoom-Tucker fully exploits block structure to compress a given tensor, supporting an efficient query and capturing local information. Zoom-Tucker answers diverse time range queries quickly and memory-efficiently, by elaborately decoupling the preprocessed results included in the range and carefully determining the order of computations. We demonstrate that Zoom-Tucker is up to 171.9x faster and requires up to 230x less space than existing methods while providing comparable accuracy.

ACE-NODE: Attentive Co-Evolving Neural Ordinary Differential Equations

Neural ordinary differential equations (NODEs) presented a new paradigm to construct (continuous-time) neural networks. While showing several good characteristics in terms of the number of parameters and the flexibility in constructing neural networks, they also have a couple of well-known limitations: i) theoretically NODEs learn homeomorphic mapping functions only, and ii) sometimes NODEs show numerical instability in solving integral problems. To handle this, many enhancements have been proposed. To our knowledge, however, integrating attention into NODEs has been overlooked for a while. To this end, we present a novel method of attentive dual co-evolving NODE (ACE-NODE): one main NODE for a downstream machine learning task and the other for providing attention to the main NODE. Our ACE-NODE supports both pairwise and elementwise attention. In our experiments, our method outperforms existing NODE-based and non-NODE-based baselines in almost all cases by non-trivial margins.

Cross-Network Learning with Partially Aligned Graph Convolutional Networks

Graph neural networks have been widely used for learning representations of nodes for many downstream tasks on graph data. Existing models were designed for the nodes on a single graph, which would not be able to utilize information across multiple graphs. The real world does have multiple graphs where the nodes are often partially aligned. For examples, knowledge graphs share a number of named entities though they may have different relation schema; collaboration networks on publications and awarded projects share some researcher nodes who are authors and investigators, respectively; people use multiple web services, shopping, tweeting, rating movies, and some may register the same email account across the platforms. In this paper, I propose partially aligned graph convolutional networks to learn node representations across the models. I investigate multiple methods (including model sharing, regularization, and alignment reconstruction) as well as theoretical analysis to positively transfer knowledge across the (small) set of partially aligned nodes. Extensive experiments on real-world knowledge graphs and collaboration networks show the superior performance of our proposed methods on relation classification and link prediction.

Pre-training on Large-Scale Heterogeneous Graph

Graph neural networks (GNNs) emerge as the state-of-the-art representation learning methods on graphs and often rely on a large amount of labeled data to achieve satisfactory performance. Recently, in order to relieve the label scarcity issues, some works propose to pre-train GNNs in a self-supervised manner by distilling transferable knowledge from the unlabeled graph structures. Unfortunately, these pre-training frameworks mainly target at homogeneous graphs, while real interaction systems usually constitute large-scale heterogeneous graphs, containing different types of nodes and edges, which leads to new challenges on structure heterogeneity and scalability for graph pre-training. In this paper, we first study the problem of pre-training on large-scale heterogeneous graph and propose a novel pre-training GNN framework, named PT-HGNN. The proposed PT-HGNN designs both the node- and schema-level pre-training tasks to contrastively preserve heterogeneous semantic and structural properties as a form of transferable knowledge for various downstream tasks. In addition, a relationbased personalized PageRank is proposed to sparsify large-scale heterogeneous graph for efficient pre-training. Extensive experiments on one of the largest public heterogeneous graphs (OAG) demonstrate that our PT-HGNN significantly outperforms various state-of-the-art baselines.

Weakly Supervised Spatial Deep Learning based on Imperfect Vector Labels with Registration Errors

This paper studies weakly supervised learning on spatial raster data based on imperfect vector training labels. Given raster feature imagery and imperfect (weak) vector labels with location registration errors, our goal is to learn a deep learning model for pixel classification and refine vector labels simultaneously. The problem is important in many geoscience applications such as streamline delineation and road mapping from earth imagery, where annotating imperfect coarse vector labels is far more efficient than drawing precise labels. But the problem is challenging due to the misalignment of vector labels with raster feature pixels and the need to infer true vector label location while learning neural network parameters. Existing works on weakly supervised learning often focus on noise and errors in label semantics, assuming label locations to be either correct or irrelevant (e.g., identical and independently distributed). A few works exist on label registration errors, but these methods often focus on label misalignment on object segment boundaries at the pixel level without guaranteeing vector continuity. To fill the gap, this paper proposes a spatial learning framework based on Expectation-Maximization that iteratively updates deep neural network parameters while inferring true vector label locations. Specifically, inference of true vector locations is based on both the current pixel class predictions and the geometric properties of vectors. Evaluations on real-world high-resolution remote sensing datasets in National Hydrography Dataset (NHD) refinement show that the proposed framework outperforms baseline methods in classification accuracy and refined vector quality.

Towards a Better Understanding of Linear Models for Recommendation

Recently, linear regression models have shown to often produce rather competitive results against more sophisticated deep learning models. Meanwhile, the (weighted) matrix factorization approaches have been popular choices for recommendation in the past and widely adopted in the industry. In this work, we aim to theoretically understand the relationship between these two approaches, which are the cornerstones of model-based recommendations. Through the derivation and analysis of the closed-form solutions for two basic regression and matrix factorization approaches, we found these two approaches are indeed inherently related but also diverge in how they "scale-down" the singular values of the original user-item interaction matrix. We further introduce a new learning algorithm in searching (hyper)parameters for the closed-form solution and utilize it to discover the nearby models of the existing solutions. The experimental results demonstrate that the basic models and their closed-form solutions are indeed quite competitive against the state-of-the-art models, thus, confirming the validity of studying the basic models. The effectiveness of exploring the nearby models are also experimentally validated.

Learning to Walk across Time for Interpretable Temporal Knowledge Graph Completion

Static knowledge graphs (KGs), despite their wide usage in relational reasoning and downstream tasks, fall short of realistic modeling of knowledge and facts that are only temporarily valid. Compared to static knowledge graphs, temporal knowledge graphs (TKGs) inherently reflect the transient nature of real-world knowledge. Naturally, automatic TKG completion has drawn much research interests for a more realistic modeling of relational reasoning. However, most of the existing models for TKG completion extend static KG embeddings that do not fully exploit TKG structure, thus lacking in 1) accounting for temporally relevant events already residing in the local neighborhood of a query, and 2) path-based inference that facilitates multi-hop reasoning and better interpretability. In this paper, we propose T-GAP, a novel model for TKG completion that maximally utilizes both temporal information and graph structure in its encoder and decoder. T-GAP encodes query-specific substructure of TKG by focusing on the temporal displacement between each event and the query timestamp, and performs path-based inference by propagating attention through the graph. Our empirical experiments demonstrate that T-GAP not only achieves superior performance against state-of-the-art baselines, but also competently generalizes to queries with unseen timestamps. Through extensive qualitative analyses, we also show that T-GAP enjoys transparent interpretability, and follows human intuition in its reasoning process.

A Hyper-surface Arrangement Model of Ranking Distributions

A distribution on the permutations over a fixed finite set is called a ranking distribution. Modelling ranking distributions is one of the major topics in preference learning as such distributions appear as the ranking data produced by many judges. In this paper, we propose a geometric model for ranking distributions. Our idea is to use hyper-surface arrangements in a metric space as the representation space, where each component cut out by hyper-surfaces corresponds to a total ordering, and its volume is proportional to the probability. In this setting, the union of components corresponds to a partial ordering and its probability is also estimated by the volume. Similarly, the probability of a partial ordering conditioned by another partial ordering is estimated by the ratio of volumes. We provide a simple iterative algorithm to fit our model to a given dataset. We show our model can represent the distribution of a real-world dataset faithfully and can be used for prediction and visualisation purposes.

Preference Amplification in Recommender Systems

Recommender systems have become increasingly accurate in suggesting content to users, resulting in users primarily consuming content through recommendations. This can cause the user's interest to narrow toward the recommended content, something we refer to as preference amplification. While this can contribute to increased engagement, it can also lead to negative experiences such as lack of diversity and echo chambers. We propose a theoretical framework for studying such amplification in a matrix factorization based recommender system. We model the dynamics of the system, where users interact with the recommender systems and gradually "drift'' toward the recommended content, with the recommender system adapting, based on user feedback, to the updated preferences. We study the conditions under which preference amplification manifests, and validate our results with simulations. Finally, we evaluate mitigation strategies that prevent the adverse effects of preference amplification and present experimental results using a real-world large-scale video recommender system showing that by reducing exposure to potentially objectionable content we can increase user engagement by up to 2%.

SESSION: Research Track Papers

Topology Distillation for Recommender System

Recommender Systems (RS) have employed knowledge distillation which is a model compression technique training a compact student model with the knowledge transferred from a pre-trained large teacher model. Recent work has shown that transferring knowledge from the teacher's intermediate layer significantly improves the recommendation quality of the student. However, they transfer the knowledge of individual representation point-wise and thus have a limitation in that primary information of RS lies in the relations in the representation space. This paper proposes a new topology distillation approach that guides the student by transferring the topological structure built upon the relations in the teacher space. We first observe that simply making the student learn the whole topological structure is not always effective and even degrades the student's performance. We demonstrate that because the capacity of the student is highly limited compared to that of the teacher, learning the whole topological structure is daunting for the student. To address this issue, we propose a novel method named Hierarchical Topology Distillation (HTD) which distills the topology hierarchically to cope with the large capacity gap. Our extensive experiments on real-world datasets show that the proposed method significantly outperforms the state-of-the-art competitors. We also provide in-depth analyses to ascertain the benefit of distilling the topology for RS.

Learning to Embed Categorical Features without Embedding Tables for Recommendation

Embedding learning of categorical features (e.g. user/item IDs) is at the core of various recommendation models. The standard approach creates an embedding table where each row represents a dedicated embedding vector for every unique feature value. However, this method fails to efficiently handle high-cardinality features and unseen feature values (e.g. new video ID) that are prevalent in real-world recommendation systems. In this paper, we propose an alternative embedding framework Deep Hash Embedding (DHE), replacing embedding tables by a deep embedding network to compute embeddings on the fly. DHE first encodes the feature value to a unique identifier vector with multiple hashing functions and transformations, and then applies a DNN to convert the identifier vector to an embedding. The encoding module is deterministic, non-learnable, and free of storage, while the embedding network is updated during the training time to learn embedding generation. Empirical results show that DHE achieves comparable AUC against the standard one-hot full embedding, with smaller model sizes. Our work sheds light on the design of DNN-based alternative embedding schemes for categorical features without using embedding table lookup.

Joint Graph Embedding and Alignment with Spectral Pivot

Graphs are powerful abstractions that naturally capture the wealth of relationships in our interconnected world. This paper proposes a new approach for graph alignment, a core problem in graph mining. Classical (e.g., spectral) methods use fixed embeddings for both graphs to perform the alignment. In contrast, the proposed approach fixes the embedding of the 'target' graph and jointly optimizes the embedding transformation and the alignment of the 'query' graph. An alternating optimization algorithm is proposed for computing high-quality approximate solutions and compared against the prevailing state-of-the-art graph aligning frameworks using benchmark real-world graphs. The results indicate that the proposed formulation can offer significant gains in terms of matching accuracy and robustness to noise relative to existing solutions for this hard but important problem.

Auditing for Diversity Using Representative Examples

Assessing the diversity of a dataset of information associated with people is crucial before using such data for downstream applications. For a given dataset, this often involves computing the imbalance or disparity in the empirical marginal distribution of a protected attribute (e.g. gender, dialect, etc.). However, real-world datasets, such as images from Google Search or collections of Twitter posts, often do not have protected attributes labeled. Consequently, to derive disparity measures for such datasets, the elements need to hand-labeled or crowd-annotated, which are expensive processes.

We propose a cost-effective approach to approximate the disparity of a given unlabeled dataset, with respect to a protected attribute, using a control set of labeled representative examples. Our proposed algorithm uses the pairwise similarity between elements in the dataset and elements in the control set to effectively bootstrap an approximation to the disparity of the dataset. Importantly, we show that using a control set whose size is much smaller than the size of the dataset is sufficient to achieve a small approximation error. Further, based on our theoretical framework, we also provide an algorithm to construct adaptive control sets that achieve smaller approximation errors than randomly chosen control sets. Simulations on two image datasets and one Twitter dataset demonstrate the efficacy of our approach (using random and adaptive control sets) in auditing the diversity of a wide variety of datasets.

Q-Learning Lagrange Policies for Multi-Action Restless Bandits

Multi-action restless multi-armed bandits (RMABs) are a powerful framework for constrained resource allocation in which N independent processes are managed. However, previous work only study the offline setting where problem dynamics are known. We address this restrictive assumption, designing the first algorithms for learning good policies for Multi-action RMABs online using combinations of Lagrangian relaxation and Q-learning. Our first approach, MAIQL, extends a method for Q-learning the Whittle index in binary-action RMABs to the multi-action setting. We derive a generalized update rule and convergence proof and establish that, under standard assumptions, MAIQL converges to the asymptotically optimal multi-action RMAB policy as t → ∞. However, MAIQL relies on learning Q-functions and indexes on two timescales which leads to slow convergence and requires problem structure to perform well. Thus, we design a second algorithm, LPQL, which learns the well-performing and more general Lagrange policy for multi-action RMABs by learning to minimize the Lagrange bound through a variant of Q-learning. To ensure fast convergence, we take an approximation strategy that enables learning on a single timescale, then give a guarantee relating the approximation's precision to an upper bound of LPQL's return as t → ∞. Finally, we show that our approaches always outperform baselines across multiple settings, including one derived from real-world medication adherence data.

A Color-blind 3-Approximation for Chromatic Correlation Clustering and Improved Heuristics

Chromatic Correlation Clustering (CCC) models clustering of objects with categorical pairwise relationships. The model can be viewed as clustering the vertices of a graph with edge-labels (colors). Bonchi et al. [KDD 2012] introduced it as a natural generalization of the well studied problem Correlation Clustering (CC), motivated by real-world applications from data-mining, social networks and bioinformatics. We give theoretical as well as practical contributions to the study of CCC. Our main theoretical contribution is an alternative analysis of the famous Pivot algorithm for CC. We show that, when simply run color-blind, Pivot is also a linear time 3-approximation for CCC. The previous best theoretical results for CCC were a 4-approximation with a high-degree polynomial runtime and a linear time 11-approximation, both by Anava et al. [WWW 2015]. While this theoretical result justifies Pivot as a baseline comparison for other heuristics, its blunt color-blindness performs poorly in practice. We develop a color-sensitive, practical heuristic we call Greedy Expansion that empirically outperforms all heuristics proposed for CCC so far, both on real-world and synthetic instances. Further, we propose a novel generalization of CCC allowing for multi-labelled edges. We argue that it is more suitable for many of the real-world applications and extend our results to this model.

Fast Rotation Kernel Density Estimation over Data Streams

Kernel density estimation method is a powerful tool and is widely used in many important real-world applications such as anomaly detection and statistical learning. Unfortunately, current kernel methods suffer from high computational or space costs when dealing with large-scale, high-dimensional datasets, especially when the datasets of interest are given in a stream fashion. Although there are sketch methods designed for kernel density estimation over data streams, they still suffer from high computational costs. To address this problem, in this paper, we propose a novel Rotation Kernel. The Rotation Kernel is based on a Rotation Hash method and is much faster to compute. To achieve memory-efficient kernel density estimation over data streams, we design a method, RKD-Sketch, which compresses high dimensional data streams into a small array of integer counters. We conduct extensive experiments on both synthetic and real-world datasets, and experimental results demonstrate that our RKD-Sketch saves up to 216 times computational resources and up to 104 times space resources than state-of-the-arts. Furthermore, we apply our Rotation Kernel in active learning. Results show that our method achieves up to 256 times speedup and saves up to 13 times space to achieve the same accuracy as the baseline methods.

Dip-based Deep Embedded Clustering with k-Estimation

The combination of clustering with Deep Learning has gained much attention in recent years. Unsupervised neural networks like autoencoders can autonomously learn the essential structures in a data set. This idea can be combined with clustering objectives to learn relevant features automatically. Unfortunately, they are often based on a k-means framework, from which they inherit various assumptions, like spherical-shaped clusters. Another assumption, also found in approaches outside the k-means-family, is knowing the number of clusters a-priori. In this paper, we present the novel clustering algorithm DipDECK, which can estimate the number of clusters simultaneously to improving a Deep Learning-based clustering objective. Additionally, we can cluster complex data sets without assuming only spherically shaped clusters. Our algorithm works by heavily overestimating the number of clusters in the embedded space of an autoencoder and, based on Hartigan's Dip-test - a statistical test for unimodality - analyses the resulting micro-clusters to determine which to merge. We show in extensive experiments the various benefits of our method: (1) we achieve competitive results while learning the clustering-friendly representation and number of clusters simultaneously; (2) our method is robust regarding parameters, stable in performance, and allows for more flexibility in the cluster shape; (3) we outperform relevant competitors in the estimation of the number of clusters.

Large-Scale Data-Driven Airline Market Influence Maximization

We present a prediction-driven optimization framework to maximize the market influence in the US domestic air passenger transportation market by adjusting flight frequencies. At the lower level, our neural networks consider a wide variety of features, such as classical air carrier performance features and transportation network features, to predict the market influence. On top of the prediction models, we define a budget-constrained flight frequency optimization problem to maximize the market influence over 2,262 routes. This problem falls into the category of the non-linear optimization problem, which cannot be solved exactly by conventional methods. To this end, we present a novel adaptive gradient ascent (AGA) method. Our prediction models show two to eleven times better accuracy in terms of the median root-mean-square error (RMSE) over baselines. In addition, our AGA optimization method runs 690 times faster with a better optimization result (in one of our largest scale experiments) than a greedy algorithm.

Physical Equation Discovery Using Physics-Consistent Neural Network (PCNN) Under Incomplete Observability

Deep neural networks (DNNs) have been extensively applied to various fields, including physical-system monitoring and control. However, the requirement of a high confidence level in physical systems made system operators hard to trust black-box type DNNs. For example, while DNN can perform well at both training data and testing data, but when the physical system changes its operation points at a completely different range, never appeared in the history records, DNN can fail. To open the black box as much as possible, we propose a Physics-Consistent Neural Network (PCNN) for physical systems with the following properties: (1) PCNN can be shrunk to physical equations for sub-areas with full observability, (2) PCNN reduces unobservable areas into some virtual nodes, leading to a reduced network. Thus, for such a network, PCNN can also represent its underlying physical equation via a specifically designed deep-shallow hierarchy, and (3) PCNN is theoretically proved that the shallow NN in the PCNN is convex with respect to physical variables, leading to a set of convex optimizations to seek for the physics-consistent initial guess for the PCNN. We also develop a physical rule-based approach for initial guesses, significantly shortening the searching time for large systems. Comprehensive experiments on diversified systems are implemented to illustrate the outstanding performance of our PCNN.

Shapley Counterfactual Credits for Multi-Agent Reinforcement Learning

Centralized Training with Decentralized Execution (CTDE) has been a popular paradigm in cooperative Multi-Agent Reinforcement Learning (MARL) settings and is widely used in many real applications. One of the major challenges in the training process is credit assignment, which aims to deduce the contributions of each agent according to the global rewards. Existing credit assignment methods focus on either decomposing the joint value function into individual value functions or measuring the impact of local observations and actions on the global value function. These approaches lack a thorough consideration of the complicated interactions among multiple agents, leading to an unsuitable assignment of credit and subsequently mediocre results on MARL. We propose Shapley Counterfactual Credit Assignment, a novel method for explicit credit assignment which accounts for the coalition of agents. Specifically, Shapley Value and its desired properties are leveraged in deep MARL to credit any combinations of agents, which grants us the capability to estimate the individual credit for each agent. Despite this capability, the main technical difficulty lies in the computational complexity of Shapley Value who grows factorially as the number of agents. We instead utilize an approximation method via Monte Carlo sampling, which reduces the sample complexity while maintaining its effectiveness. We evaluate our method on StarCraft II benchmarks across different scenarios. Our method outperforms existing cooperative MARL algorithms significantly and achieves the state-of-the-art, with especially large margins on tasks with more severe difficulties.

A Difficulty-Aware Framework for Churn Prediction and Intervention in Games

User's leaving from the system without further return, called user churn, is a severe negative signal in online games. Therefore, churn prediction and intervention are of great value for improving players' experiences and system performance. However, the problem has not been well-studied in the game scenario. Especially, some crucial factors, such as game difficulty, have not been considered for large-scale churn analysis. In this paper, a novel Difficulty-Aware Framework (DAF) for churn prediction and intervention is proposed. Firstly, a Difficulty Flow for each user is proposed, which is utilized to derive users' Personalized Perceived Difficulty during the game process. Then, a survival analysis modelD-Cox-Time is designed to model the Dynamic Influence of Perceived Difficulty on player churn intention. Finally, thePersonalized Perceived Difficulty ~(PPD) andDynamic Difficulty Influence ~(DDI) are incorporated to churn prediction and intervention. The proposed DAF framework has been specified in a real-world puzzle game as an example for churn prediction and intervention. Extensive offline experiments show significant improvements in churn prediction by introducing difficulty-related features. Besides, we conduct an online intervention system to adjust difficulty dynamically in the online game. A/B test results verify that the proposed intervention system enhances user retention and engagement significantly. To the best of our knowledge, it is the first framework in games that illustrates an in-depth understanding and leveraging dynamic and personalized perceived difficulty during game playing, which is easy to be integrated with various churn prediction and intervention models.

Dimensionwise Separable 2-D Graph Convolution for Unsupervised and Semi-Supervised Learning on Graphs

Graph convolutional neural networks (GCN) have been the model of choice for graph representation learning, which is mainly due to the effective design of graph convolution that computes the representation of a node by aggregating those of its neighbors. However, existing GCN variants commonly use 1-D graph convolution that solely operates on the object link graph without exploring informative relational information among object attributes. This significantly limits their modeling capability and may lead to inferior performance on noisy and sparse real-world networks. In this paper, we explore 2-D graph convolution to jointly model object links and attribute relations for graph representation learning. Specifically, we propose a computationally efficient dimensionwise separable 2-D graph convolution (DSGC) for filtering node features. Theoretically, we show that DSGC can reduce intra-class variance of node features on both the object dimension and the attribute dimension to learn more effective representations. Empirically, we demonstrate that by modeling attribute relations, DSGC achieves significant performance gain over state-of-the-art methods for node classification and clustering on a variety of real-world networks. The source code for reproducing the experimental results is available at https://github.com/liqimai/DSGC.

An Efficient and Scalable Algorithm for Estimating Kemeny's Constant of a Markov Chain on Large Graphs

The mean hitting time of a Markov chain on a graph from an arbitrary node to a target node randomly chosen according to its stationary distribution is called Kemeny's constant, which is an important metric for network analysis and has a wide range of applications. It is, however, still computationally expensive to evaluate the Kemeny's constant, especially when it comes to a large graph, since it requires the computation of the spectrum of the corresponding transition matrix or its normalized Laplacian matrix. In this paper, we propose a simple yet computationally efficient Monte Carlo algorithm to approximate the Kemeny's constant, which is equipped with an ε,δ)-approximation estimator. Thanks to its inherent algorithmic parallelism, we are able to develop its parallel implementation on a GPU to speed up the computation. We provide extensive experiment results on 13 real-world graphs to demonstrate the computational efficiency and scalability of our algorithm, which achieves up to 500x speed-up over the state-of-the-art algorithm. We further present its practical enhancements to make our algorithm ready for practical use in real-world settings.

Structure-aware Interactive Graph Neural Networks for the Prediction of Protein-Ligand Binding Affinity

Drug discovery often relies on the successful prediction of protein-ligand binding affinity. Recent advances have shown great promise in applying graph neural networks (GNNs) for better affinity prediction by learning the representations of protein-ligand complexes. However, existing solutions usually treat protein-ligand complexes as topological graph data, thus the biomolecular structural information is not fully utilized. The essential long-range interactions among atoms are also neglected in GNN models. To this end, we propose a structure-aware interactive graph neural network (SIGN) which consists of two components: polar-inspired graph attention layers (PGAL) and pairwise interactive pooling (PiPool). Specifically, PGAL iteratively performs the node-edge aggregation process to update embeddings of nodes and edges while preserving the distance and angle information among atoms. Then, PiPool is adopted to gather interactive edges with a subsequent reconstruction loss to reflect the global interactions. Exhaustive experimental study on two benchmarks verifies the superiority of SIGN.

Mitigating Performance Saturation in Neural Marked Point Processes: Architectures and Loss Functions

Attributed event sequences are commonly encountered in practice. A recent research line focuses on incorporating neural networks with the statistical model--marked point processes, which is the conventional tool for dealing with attributed event sequences. Neural marked point processes possess good interpretability of probabilistic models as well as the representational power of neural networks. However, we find that performance of neural marked point processes is not always increasing as the network architecture becomes more complicated and larger, which is what we call the performance saturation phenomenon. This is due to the fact that the generalization error of neural marked point processes is determined by both the network representational ability and the model specification at the same time. Therefore we can draw two major conclusions: first, simple network structures can perform no worse than complicated ones for some cases; second, using a proper probabilistic assumption is as equally, if not more, important as improving the complexity of the network. Based on this observation, we propose a simple graph-based network structure called GCHP, which utilizes only graph convolutional layers, thus it can be easily accelerated by the parallel mechanism. We directly consider the distribution of interarrival times instead of imposing a specific assumption on the conditional intensity function, and propose to use a likelihood ratio loss with a moment matching mechanism for optimization and model selection. Experimental results show that GCHP can significantly reduce training time and the likelihood ratio loss with interarrival time probability assumptions can greatly improve the model performance.

FedRS: Federated Learning with Restricted Softmax for Label Distribution Non-IID Data

Federated Learning (FL) aims to generate a global shared model via collaborating decentralized clients with privacy considerations. Unlike standard distributed optimization, FL takes multiple optimization steps on local clients and then aggregates the model updates via a parameter server. Although this significantly reduces communication costs, the non-iid property across heterogeneous devices could make the local update diverge a lot, posing a fundamental challenge to aggregation. In this paper, we focus on a special kind of non-iid scene, i.e., label distribution skew, where each client can only access a partial set of the whole class set. Considering top layers of neural networks are more task-specific, we advocate that the last classification layer is more vulnerable to the shift of label distribution. Hence, we in-depth study the classifier layer and point out that the standard softmax will encounter several problems caused by missing classes. As an alternative, we propose "Restricted Softmax" to limit the update of missing classes' weights during the local procedure. Our proposed FedRS is very easy to implement with only a few lines of code. We investigate our methods on both public datasets and a real-world service awareness application. Abundant experimental results verify the superiorities of our methods.

Efficient Collaborative Filtering via Data Augmentation and Step-size Optimization

As a popular approach to collaborative filtering, matrix factorization (MF) models the underlying rating matrix as a product of two factor matrices, one for users and one for items. The MF model can be learned by Alternating Least Squares (ALS), which updates the two factor matrices alternately, keeping one fixed while updating the other. Although ALS improves the learning objective aggressively in each iteration, it suffers from high computational cost due to the necessity of inverting a separate matrix for every user and item. The softImpute-ALS reduces the per-iteration computation significantly using a strategy that requires only two matrix inversions; however, the computation saving leads to shrinkage of objective improvement. In this paper, we introduce a new algorithm, termed Data Augmentation with Optimal Step-size (DAOS), which alleviates the drawback of softImpute-ALS while still maintaining its low cost of computation per iteration. The DAOS is presented in the context that factor matrices may include fixed columns or rows, with this allowing bias terms and/or linear models to be incorporated into the ML model. Experimental results on synthetic and MovieLens 1M Dataset demonstrate the benefits of DAOS over ALS and softImpute-ALS in terms of generalization performance and computational time.

Learning Multiple Stock Trading Patterns with Temporal Routing Adaptor and Optimal Transport

Successful quantitative investment usually relies on precise predictions of the future movement of the stock price. Recently, machine learning based solutions have shown their capacity to give more accurate stock prediction and become indispensable components in modern quantitative investment systems. However, the i.i.d. assumption behind existing methods is inconsistent with the existence of diverse trading patterns in the stock market, which inevitably limits their ability to achieve better stock prediction performance. In this paper, we propose a novel architecture, Temporal Routing Adaptor (TRA), to empower existing stock prediction models with the ability to model multiple stock trading patterns. Essentially, TRA is a lightweight module that consists of a set of independent predictors for learning multiple patterns as well as a router to dispatch samples to different predictors. Nevertheless, the lack of explicit pattern identifiers makes it quite challenging to train an effective TRA-based model. To tackle this challenge, we further design a learning algorithm based on Optimal Transport (OT) to obtain the optimal sample to predictor assignment and effectively optimize the router with such assignment through an auxiliary loss term. Experiments on the real-world stock ranking task show that compared to the state-of-the-art baselines, e.g., Attention LSTM and Transformer, the proposed method can improve information coefficient (IC) from 0.053 to 0.059 and 0.051 to 0.056 respectively. Our dataset and code used in this work are publicly available2: https://github.com/microsoft/qlib.

What Do You See?: Evaluation of Explainable Artificial Intelligence (XAI) Interpretability through Neural Backdoors

EXplainable AI (XAI) methods have been proposed to interpret how a deep neural network predicts inputs through model saliency explanations that highlight the input parts deemed important to arrive at a decision for a specific target. However, it remains challenging to quantify the correctness of their interpretability as current evaluation approaches either require subjective input from humans or incur high computation cost with automated evaluation. In this paper, we propose backdoor trigger patterns--hidden malicious functionalities that cause misclassification--to automate the evaluation of saliency explanations. Our key observation is that triggers provide ground truth for inputs to evaluate whether the regions identified by an XAI method are truly relevant to its output. Since backdoor triggers are the most important features that cause deliberate misclassification, a robust XAI method should reveal their presence at inference time. We introduce three complementary metrics for the systematic evaluation of explanations that an XAI method generates. We evaluate seven state-of-the-art model-free and model-specific post-hoc methods through 36 models trojaned with specifically crafted triggers using color, shape, texture, location, and size. We found six methods that use local explanation and feature relevance fail to completely highlight trigger regions, and only a model-free approach can uncover the entire trigger region. We made our code available at https://github.com/yslin013/evalxai.

Multi-view Correlation based Black-box Adversarial Attack for 3D Object Detection

Deep neural networks have made tremendous progress in 3D object detection, which is an important task especially in autonomous driving scenarios. Benefited from the breakthroughs in deep learning and sensor technologies, 3D object detection methods based on different sensors, such as camera and LiDAR, have developed rapidly. Meanwhile, more and more researches notice that the abundant information contained in the multi-view data can be used to obtain more accurate understanding of the 3D surrounding environment. Therefore, many sensor-fusion 3D object detection methods have been proposed. As safety is critical in autonomous driving and the deep neural networks are known to be vulnerable to adversarial examples with visually imperceptible perturbations, it is significant to investigate adversarial attacks for 3D object detection. Recent works have shown that both image-based and LiDAR-based networks can be attacked by the adversarial examples while the attacks to the sensor-fusion models, which tend to be more robust, haven't been studied. To this end, we propose a simple multi-view correlation based adversarial attack method for the camera-LiDAR fusion 3D object detection models and focus on the black-box attack setting which is more practical in real-world systems. Specifically, we first design a generative network to generate image adversarial examples based on an auxiliary image semantic segmentation network. Then, we develop a cross-view perturbation projection method by exploiting the camera-LiDAR correlations to map each image adversarial example to the space of the point cloud data to form the point cloud adversarial examples in the LiDAR view. Extensive experiments on the KITTI dataset demonstrate the effectiveness of the proposed method.

ControlBurn: Feature Selection by Sparse Forests

Tree ensembles distribute feature importance evenly amongst groups of correlated features. The average feature ranking of the correlated group is suppressed, which reduces interpretability and complicates feature selection. In this paper we present ControlBurn, a feature selection algorithm that uses a weighted LASSO-based feature selection method to prune unnecessary features from tree ensembles, just as low-intensity fire reduces overgrown vegetation. Like the linear LASSO, ControlBurn assigns all the feature importance of a correlated group of features to a single feature. Moreover, the algorithm is efficient and only requires a single training iteration to run, unlike iterative wrapper-based feature selection methods. We show that ControlBurn performs substantially better than feature selection methods with comparable computational costs on datasets with correlated features.

Reinforced Anchor Knowledge Graph Generation for News Recommendation Reasoning

News recommendation systems play a key role in online news reading service. Knowledge graphs (KG), which contain comprehensive structural knowledge, are well known for their potential to enhance both accuracy and explainability. While existing works intensively study using KG to improve news recommendation accuracy, using KG for news recommendation reasoning has not been fully explored. A few works such as KPRN [18], [22] and ADAC [25] have discussed knowledge reasoning in some other recommendation domains such as music or movie, but their methods are not practical for the news. How to make reasoning scalable to generic KGs, easy to deploy for real-time serving and meanwhile elastic for both recall and ranking stages remains an open question.

In this paper, we fill the research gap by proposing a novel recommendation reasoning paradigm AnchorKG. For each article, AnchorKG generates a compact Anchor Knowledge G raph, which corresponds to a subset of entities and their k-hop neighbors in the KG, restoring the most important knowledge information of the article. On one hand, the anchor graph can be used to enhance the latent representation of the article. On the other hand, the interaction between two anchor graphs can be used for reasoning. We develop a reinforcement learning-based framework to train the anchor graph generator, in which there are three major components, including the joint learning of recommendation and reasoning, sophisticated reward signals, and a warm-up learning stage. We conduct experiments on one public dataset and one private dataset. Results demonstrate that the AnchorKG framework not only improves recommendation accuracy, but also provides high quality knowledge-aware reasoning. We release the source code at https://github.com/danyang-liu/AnchorKG .

Signed Graph Neural Network with Latent Groups

Signed graph representation learning is an effective approach to analyze the complex patterns in real-world signed graphs with the co-existence of positive and negative links. Most previous signed graph representation learning methods resort to balance theory, a classic social theory that originated from psychology as the core assumption. However, since balance theory is shown equivalent to a simple assumption that nodes can be divided into two conflicting groups, it fails to model the structure of real signed graphs. To solve this problem, we propose Group Signed Graph Neural Network (GS-GNN) model for signed graph representation learning beyond the balance theory assumption. GS-GNN has a dual GNN architecture that consists of the global and the local module. In the global module, we adopt a more generalized assumption that nodes can be divided into multiple latent groups and that the groups can have arbitrary relations and propose a novel prototype-based GNN to learn node representations based on the assumption. In the local module, to give the model enough flexibility in modeling other factors, we do not make any prior assumptions, treat positive links and negative links as two independent relations, and adopt a relational GNN to learn node representations. Both modules can complement each other, and the concatenation of two modules is fed into downstream tasks. Extensive experimental results demonstrate the effectiveness of our GS-GNN model on both synthetic and real-world signed graphs by greatly and consistently outperforming all the baselines and achieving new state-of-the-art results. Our implementation is available in PyTorch.

NewsEmbed: Modeling News through Pre-trained Document Representations

Effectively modeling text-rich fresh content such as news articles at document-level is a challenging problem. To ensure a content-based model generalize well to a broad range of applications, it is critical to have a training dataset that is large beyond the scale of human labels while achieving desired quality. In this work, we address those two challenges by proposing a novel approach to mine semantically-relevant fresh documents, and their topic labels, with little human supervision. Meanwhile, we design a multitask model called NewsEmbed that alternatively trains a contrastive learning with a multi-label classification to derive a universal document encoder. We show that the proposed approach can provide billions of high quality organic training examples and can be naturally extended to multilingual setting where texts in different languages are encoded in the same semantic space. We experimentally demonstrate NewsEmbed's competitive performance across multiple natural language understanding tasks, both supervised and unsupervised.

Neural-Answering Logical Queries on Knowledge Graphs

Logical queries constitute an important subset of questions posed in knowledge graph question answering systems. Yet, effectively answering logical queries on large knowledge graphs remains a highly challenging problem. Traditional subgraph matching based methods might suffer from the noise and incompleteness of the underlying knowledge graph, often with a prolonged online response time. Recently, an alternative type of method has emerged whose key idea is to embed knowledge graph entities and the query in an embedding space so that the embedding of answer entities is close to that of the query. Compared with subgraph matching based methods, it can better handle the noisy or missing information in knowledge graph, with a faster online response. Promising as it might be, several fundamental limitations still exist, including the linear transformation assumption for modeling relations and the inability to answer complex queries with multiple variable nodes. In this paper, we propose an embedding based method (NewLook) to address these limitations. Our proposed method offers three major advantages. First (Applicability), it supports four types of logical operations and can answer queries with multiple variable nodes. Second (Effectiveness), the proposed NewLook goes beyond the linear transformation assumption, and thus consistently outperforms the existing methods. Third (Efficiency), compared with subgraph matching based methods, NewLook is at least 3 times faster in answering the queries; compared with the existing embed-ding based methods, NewLook bears a comparable or even faster online response and offline training time.

Online Additive Quantization

Approximate nearest neighbor search (ANNs) plays an important role in many applications ranging from information retrieval, recommender systems to machine translation. Several ANN indexes, such as hashing and quantization, have been designed to update for the evolving database, but there exists a remarkable performance gap between them and retrained indexes on the entire database. To close the gap, we propose an online additive quantization algorithm (online AQ) to dynamically update quantization codebooks with the incoming streaming data. Then we derive the regret bound to theoretically guarantee the performance of the online AQ algorithm. Moreover, to improve the learning efficiency, we develop a randomized block beam search algorithm for assigning each data to the codewords of the codebook. Finally, we extensively evaluate the proposed online AQ algorithm on four real-world datasets, showing that it remarkably outperforms the state-of-the-art baselines.

Tail-GNN: Tail-Node Graph Neural Networks

The prevalence of graph structures in real-world scenarios enables important tasks such as node classification and link prediction. Graphs in many domains follow a long-tailed distribution in their node degrees, i.e., a significant fraction of nodes are tail nodes with a small degree. Although recent graph neural networks (GNNs) can learn powerful node representations, they treat all nodes uniformly and are not tailored to the large group of tail nodes. In particular, there is limited structural information (i.e., links) on tail nodes, resulting in inferior performance. Toward robust tail node embedding, in this paper we propose a novel graph neural network called Tail-GNN. It hinges on the novel concept of transferable neighborhood translation, to model the variable ties between a target node and its neighbors. On one hand, Tail-GNN learns a neighborhood translation from the structurally rich head nodes (i.e., high-degree nodes), which can be further transferred to the structurally limited tail nodes to enhance their representations. On the other hand, the ties with the neighbors are variable across different parts of the graph, and a global neighborhood translation is inflexible. Thus, we devise a node-wise adaptation to localize the global translation w.r.t. each node. Extensive experiments on five benchmark datasets demonstrate that our proposed Tail-GNN significantly outperforms the state-of-the-art baselines.

Dialogue Based Disease Screening Through Domain Customized Reinforcement Learning

In this paper, we study the problem of leveraging dialogue agents learned from reinforcement learning (RL) that can interact with patients for automatic disease screening. This application requires efficient and effective inquiry of appropriate symptoms to make accurate diagnosis recommendations. Existing studies have tried to use RL to perform both symptom inquiry and diagnosis simultaneously, which needs to deal with a large, heterogeneous action space that affects the learning efficiency and effectiveness. To address the challenge, we propose to leverage the models learned from the dialogue data to customize the settings of the reinforcement learning for more efficient action space exploration. In particular, a supervised diagnosis model is built and involved in the definition of state and reward. We also develop the clustering method to form a hierarchy in the action space. These customizations can make the learning task focus on checking the most relevant symptoms, which effectively boost the confidence of diagnosis. Besides, a novel hierarchical reinforcement learning framework with the pretraining strategy is used to reduce the dimension of action space and help the model to converge. For empirical evaluations, we conduct extensive experiments on both synthetic and real-world datasets. The results have demonstrated the superiority of our approach in diagnostic accuracy and interaction efficiency compared with other baseline methods.

HGK-GNN: Heterogeneous Graph Kernel based Graph Neural Networks

While Graph Neural Networks (GNNs) have achieved remarkable results in a variety of applications, recent studies exposed important shortcomings in their ability to capture heterogeneous structures and attributes of an underlying graph. Furthermore, though many Heterogeneous GNN (HGNN) variants have been proposed and have achieved state-of-the-art results, there are limited theoretical understandings of their properties. To this end, we introduce graph kernel to HGNNs and develop a Heterogeneous Graph Kernel-based Graph Neural Networks (HGK-GNN). Specifically, we incorporate the Mahalanobis distance (MD) to build a Heterogeneous Graph Kernel (HGK), and incorporating it into deep neural architectures, thus leveraging a heterogeneous GNN with a heterogeneous aggregation scheme. Also, we mathematically bridge HGK-GNN to metapath-based HGNNs, which are the most popular and effective variants of HGNNs. We theoretically analyze HGK-GNN with the indispensable Encoder and Aggregator component in metapath-based HGNNs, through which we provide a theoretical perspective to understand the most popular HGNNs. To the best of our knowledge, we are the first to introduce HGK into the field of HGNNs, and mark a first step in the direction of theoretically understanding and analyzing HGNNs. Correspondingly, both graph and node classification experiments are leveraged to evaluate HGK-GNN, where HGK-GNN outperforms a wide range of baselines on six real-world datasets, endorsing the analysis.

Leveraging Latent Features for Local Explanations

As the application of deep neural networks proliferates in numerous areas such as medical imaging, video surveillance, and self driving cars, the need for explaining the decisions of these models has become a hot research topic, both at the global and local level. Locally, most explanation methods have focused on identifying relevance of features, limiting the types of explanations possible. In this paper, we investigate a new direction by leveraging latent features to generate contrastive explanations; predictions are explained not only by highlighting aspects that are in themselves sufficient to justify the classification, but also by new aspects which if added will change the classification. The key contribution of this paper lies in how we add features to rich data in a formal yet humanly interpretable way that leads to meaningful results. Our new definition of "addition" uses latent features to move beyond the limitations of previous explanations and resolve an open question laid out in Dhurandhar, et. al. (2018), which creates local contrastive explanations but is limited to simple datasets such as grayscale images. The strength of our approach in creating intuitive explanations that are also quantitatively superior to other methods is demonstrated on three diverse image datasets (skin lesions, faces, and fashion apparel). A user study with 200 participants further exemplifies the benefits of contrastive information, which can be viewed as complementary to other state-of-the-art interpretability methods.

Are we really making much progress?: Revisiting, benchmarking and refining heterogeneous graph neural networks

Heterogeneous graph neural networks (HGNNs) have been blossoming in recent years, but the unique data processing and evaluation setups used by each work obstruct a full understanding of their advancements. In this work, we present a systematical reproduction of 12 recent HGNNs by using their official codes, datasets, settings, and hyperparameters, revealing surprising findings about the progress of HGNNs. We find that the simple homogeneous GNNs, e.g., GCN and GAT, are largely underestimated due to improper settings. GAT with proper inputs can generally match or outperform all existing HGNNs across various scenarios. To facilitate robust and reproducible HGNN research, we construct the Heterogeneous Graph Benchmark (HGB) , consisting of 11 diverse datasets with three tasks. HGB standardizes the process of heterogeneous graph data splits, feature processing, and performance evaluation. Finally, we introduce a simple but very strong baseline Simple-HGN-which significantly outperforms all previous models on HGB-to accelerate the advancement of HGNNs in the future.

Graph Adversarial Attack via Rewiring

Graph Neural Networks (GNNs) have demonstrated their powerful capability in learning representations for graph-structured data. Consequently, they have enhanced the performance of many graph-related tasks such as node classification and graph classification. However, it is evident from recent studies that GNNs are vulnerable to adversarial attacks. Their performance can be largely impaired by deliberately adding carefully created unnoticeable perturbations to the graph. Existing attacking methods often produce perturbation by adding/deleting a few edges, which might be noticeable even when the number of modified edges is small. In this paper, we propose a graph rewiring operation to perform the attack. It can affect the graph in a less noticeable way compared to existing operations such as adding/deleting edges. We then utilize deep reinforcement learning to learn the strategy to effectively perform the rewiring operations. Experiments on real-world graphs demonstrate the effectiveness of the proposed framework. To understand the proposed framework, we further analyze how its generated perturbation impacts the target model and the advantages of the rewiring operations. The implementation of the proposed framework is available at https://github.com/alge24/ReWatt.

BLOCKSET (Block-Aligned Serialized Trees): Reducing Inference Latency for Tree ensemble Deployment

We present methods to serialize and deserialize gradient-boosted trees and random forests that optimize inference latency when models are not loaded into memory. This arises when models are larger than memory, but also systematically when models are deployed on low-resource devices in the Internet of Things or run as cloud microservices where resources are allocated on demand. Block-Aligned Serialized Trees (BLOCKSET) introduce the concept of selective access for random forests and gradient boosted trees in which only the parts of the model needed for inference are deserialized and loaded into memory. %BLOCKSET combines concepts from external memory algorithms and data-parallel %layouts of random forests that maximize I/O-density for in-memory models. Using principles from external memory algorithms, we block-align the serialization format in order to minimize the number of I/Os. For gradient boosted trees, this results in a more than five time reduction in inference latency over layouts that do not perform selective access and a 2 times latency reduction over techniques that are selective, but do not encode I/O block boundaries in the layout.

Needle in a Haystack: Label-Efficient Evaluation under Extreme Class Imbalance

Important tasks like record linkage and extreme classification demonstrate extreme class imbalance, with 1 minority instance to every 1 million or more majority instances. Obtaining a sufficient sample of all classes, even just to achieve statistically-significant evaluation, is so challenging that most current approaches yield poor estimates or incur impractical cost. Where importance sampling has been levied against this challenge, restrictive constraints are placed on performance metrics, estimates do not come with appropriate guarantees, or evaluations cannot adapt to incoming labels. This paper develops a framework for online evaluation based on adaptive importance sampling. Given a target performance metric and model for p(y|x), the framework adapts a distribution over items to label in order to maximize statistical precision. We establish strong consistency and a central limit theorem for the resulting performance estimates, and instantiate our framework with worked examples that leverage Dirichlet-tree models. Experiments demonstrate an average MSE superior to state-of-the-art on fixed label budgets.

Temporal Graph Signal Decomposition

Temporal graph signals are multivariate time series with individual components associated with nodes of a fixed graph structure. Data of this kind arises in many domains including activity of social network users, sensor network readings over time, and time course gene expression within the interaction network of a model organism. Traditional matrix decomposition methods applied to such data fall short of exploiting structural regularities encoded in the underlying graph and also in the temporal patterns of the signal. How can we take into account such structure to obtain a succinct and interpretable representation of temporal graph signals?

We propose a general, dictionary-based framework for temporal graph signal decomposition (TGSD). The key idea is to learn a low-rank, joint encoding of the data via a combination of graph and time dictionaries. We propose a highly scalable decomposition algorithm for both complete and incomplete data, and demonstrate its advantage for matrix decomposition, imputation of missing values, temporal interpolation, clustering, period estimation, and rank estimation in synthetic and real-world data ranging from traffic patterns to social media activity. Our framework achieves 28% reduction in RMSE compared to baselines for temporal interpolation when as many as 75% of the observations are missing. It scales best among baselines taking under 20 seconds on 3.5 million data points and produces the most parsimonious models. To the best of our knowledge, TGSD is the first framework to jointly model graph signals by temporal and graph dictionaries.

Cross-Node Federated Graph Neural Network for Spatio-Temporal Data Modeling

Vast amount of data generated from networks of sensors, wearables, and the Internet of Things (IoT) devices underscores the need for advanced modeling techniques that leverage the spatio-temporal structure of decentralized data due to the need for edge computation and licensing (data access) issues. While federated learning (FL) has emerged as a framework for model training without requiring direct data sharing and exchange, effectively modeling the complex spatio-temporal dependencies to improve forecasting capabilities still remains an open problem. On the other hand, state-of-the-art spatio-temporal forecasting models assume unfettered access to the data, neglecting constraints on data sharing. To bridge this gap, we propose a federated spatio-temporal model -- Cross-Node Federated Graph Neural Network (CNFGNN) -- which explicitly encodes the underlying graph structure using graph neural network (GNN)-based architecture under the constraint of cross-node federated learning, which requires that data in a network of nodes is generated locally on each node and remains decentralized. CNFGNN operates by disentangling the temporal dynamics modeling on devices and spatial dynamics on the server, utilizing alternating optimization to reduce the communication cost, facilitating computations on the edge devices. Experiments on the traffic flow forecasting task show that CNFGNN achieves the best forecasting performance in both transductive and inductive learning settings with no extra computation cost on edge devices, while incurring modest communication cost.

MULTIVERSE: Mining Collective Data Science Knowledge from Code on the Web to Suggest Alternative Analysis Approaches

Data analyses are based on a series of "decision points" including data filtering, feature operationalization and selection, model specification, and parametric assumptions. "Multiverse Analysis" research has shown that a lack of exploration of these decisions can lead to non-robust conclusions based on highly sensitive decision points. Importantly, even if myopic analyses are technically correct, analysts' focus on one set of decision points precludes them from exploring alternate formulations that may produce very different results. Prior work has also shown that analysts' exploration is often limited based on their training, domain, and personal experience. However, supporting analysts in exploring alternative approaches is challenging and typically requires expert feedback that is costly and hard to scale.

Here, we formulate the tasks of identifying decision points and suggesting alternative analysis approaches as a classification task and a sequence-to-sequence prediction task, respectively. We leverage public collective data analysis knowledge in the form of code submissions to the popular data science platform Kaggle to build the first predictive model which supports Multiverse Analysis. Specifically, we mine this code repository for 70k small differences between 40k submissions, and demonstrate that these differences often highlight key decision points and alternative approaches in their respective analyses.We leverage information on relationships within libraries through neural graph representation learning in a multitask learning framework. We demonstrate that our model, MULTIVERSE, is able to correctly predict decision points with up to 0.81 ROC AUC, and alternative code snippets with up to 50.3% GLEU, and that it performs favorably compared to a suite of baselines and ablations. We show that when our model has perfect information about the location of decision points, say provided by the analyst, its performance increases significantly from 50.3% to 73.4% GLEU. Finally, we show through a human evaluation that real data analysts find alternatives provided by MULTIVERSE to be more reasonable, acceptable, and syntactically correct than alternatives from comparable baselines, including other transformer-based seq2seq models.

DeGNN: Improving Graph Neural Networks with Graph Decomposition

Mining from graph-structured data is an integral component of graph data management. A recent trending technique, graph convolutional network (GCN), has gained momentum in the graph mining field, and plays an essential part in numerous graph-related tasks. Although the emerging GCN optimization techniques bring improvements to specific scenarios, they perform diversely in different applications and introduce many trial-and-error costs for practitioners. Moreover, existing GCN models often suffer from oversmoothing problem. Besides, the entanglement of various graph patterns could lead to non-robustness and harm the final performance of GCNs. In this work, we propose a simple yet efficient graph decomposition approach to improve the performance of general graph neural networks. We first empirically study existing graph decomposition methods and propose an automatic connectivity-ware graph decomposition algorithm, DeGNN. To provide a theoretical explanation, we then characterize GCN from the information-theoretic perspective and show that under certain conditions, the mutual information between the output after l layers and the input of GCN converges to 0 exponentially with respect to l. On the other hand, we show that graph decomposition can potentially weaken the condition of such convergence rate, alleviating the information loss when GCN becomes deeper. Extensive experiments on various academic benchmarks and real-world production datasets demonstrate that graph decomposition generally boosts the performance of GNN models. Moreover, our proposed solution DeGNN achieves state-of-the-art performances on almost all these tasks.

Semi-Supervised Deep Learning for Multiplex Networks

Multiplex networks are complex graph structures in which a set of entities are connected to each other via multiple types of relations, each relation representing a distinct layer. Such graphs are used to investigate many complex biological, social, and technological systems. In this work, we present a novel semi-supervised approach for structure-aware representation learning on multiplex networks. Our approach relies on maximizing the mutual information between local node-wise patch representations and label correlated structure-aware global graph representations to model the nodes and cluster structures jointly. Specifically, it leverages a novel cluster-aware, node-contextualized global graph summary generation strategy for effective joint-modeling of node and cluster representations across the layers of a multiplex network. Empirically, we demonstrate that the proposed architecture outperforms state-of-the-art methods in a range of tasks: classification, clustering, visualization, and similarity search on seven real-world multiplex networks for various experiment settings.

Scalable Hierarchical Agglomerative Clustering

The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points. We perform a detailed theoretical analysis, showing that under mild separability conditions our algorithm can not only recover the optimal flat partition but also provide a two-approximation to non-parametric DP-Means objective. This introduces a novel application of hierarchical clustering as an approximation algorithm for the non-parametric clustering objective. We additionally relate our algorithm to the classic hierarchical agglomerative clustering method. We perform extensive empirical experiments in both hierarchical and flat clustering settings and show that our proposed approach achieves state-of-the-art results on publicly available clustering benchmarks. Finally, we demonstrate our method's scalability by applying it to a dataset of 30 billion queries. Human evaluation of the discovered clusters show that our method finds better quality of clusters than the current state-of-the-art.

An Efficient Framework for Balancing Submodularity and Cost

In the classical selection problem, the input consists of a collection of elements and the goal is to pick a subset of elements from the collection such that some objective function ƒ is maximized. This problem has been studied extensively in the data-mining community and it has multiple applications including influence maximization in social networks, team formation and recommender systems. A particularly popular formulation that captures the needs of many such applications is one where the objective function ƒ is a monotone and non-negative submodular function. In these cases, the corresponding computational problem can be solved using a simple greedy (1-1/e)-approximation algorithm.

In this paper, we consider a generalization of the above formulation where the goal is to optimize a function that maximizes the submodular function ƒ minus a linear cost function cost. This formulation appears as a more natural one, particularly when one needs to strike a balance between the value of the objective function and the cost being paid in order to pick the selected elements. We address variants of this problem both in an offline setting, where the collection is known apriori, as well as in online settings, where the elements of the collection arrive in an online fashion. We demonstrate that by using simple variants of the standard greedy algorithm (used for submodular optimization) we can design algorithms that have provable approximation guarantees, are extremely efficient and work very well in practice.

Filtration Curves for Graph Representation

The two predominant approaches to graph comparison in recent years are based on (i) enumerating matching subgraphs or (ii) comparing neighborhoods of nodes. In this work, we complement these two perspectives with a third way of representing graphs: using filtration curves from topological data analysis that capture both edge weight information and global graph structure. Filtration curves are highly efficient to compute and lead to expressive representations of graphs, which we demonstrate on graph classification benchmark datasets. Our work opens the door to a new form of graph representation in data mining.

Dynamic Hawkes Processes for Discovering Time-evolving Communities' States behind Diffusion Processes

Sequences of events including infectious disease outbreaks, social network activities, and crimes are ubiquitous and the data on such events carry essential information about the underlying diffusion processes between communities (e.g., regions, online user groups). Modeling diffusion processes and predicting future events are crucial in many applications including epidemic control, viral marketing, and predictive policing. Hawkes processes offer a central tool for modeling the diffusion processes, in which the influence from the past events is described by the triggering kernel. However, the triggering kernel parameters, which govern how each community is influenced by the past events, are assumed to be static over time. In the real world, the diffusion processes depend not only on the influences from the past, but also the current (time-evolving) states of the communities, e.g., people's awareness of the disease and people's current interests. In this paper, we propose a novel Hawkes process model that is able to capture the underlying dynamics of community states behind the diffusion processes and predict the occurrences of events based on the dynamics. Specifically, we model the latent dynamic function that encodes these hidden dynamics by a mixture of neural networks. Then we design the triggering kernel using the latent dynamic function and its integral. The proposed method, termed DHP (Dynamic Hawkes Processes), offers a flexible way to learn complex representations of the time-evolving communities' states, while at the same time it allows to computing the exact likelihood, which makes parameter learning tractable. Extensive experiments on four real-world event datasets show that DHP outperforms five widely adopted methods for event prediction.

Explaining Algorithmic Fairness Through Fairness-Aware Causal Path Decomposition

Algorithmic fairness has aroused considerable interests in data mining and machine learning communities recently. So far the existing research has been mostly focusing on the development of quantitative metrics to measure algorithm disparities across different protected groups, and approaches for adjusting the algorithm output to reduce such disparities. In this paper, we propose to study the problem of identification of the source of model disparities. Unlike existing interpretation methods which typically learn feature importance, we consider the causal relationships among feature variables and propose a novel framework to decompose the disparity into the sum of contributions from fairness-aware causal paths, which are paths linking the sensitive attribute and the final predictions, on the graph. We also consider the scenario when the directions on certain edges within those paths cannot be determined. Our framework is also model agnostic and applicable to a variety of quantitative disparity measures. Empirical evaluations on both synthetic and real-world data sets are provided to show that our method can provide precise and comprehensive explanations to the model disparities.

Toward Deep Supervised Anomaly Detection: Reinforcement Learning from Partially Labeled Anomaly Data

We consider the problem of anomaly detection with a small set of partially labeled anomaly examples and a large-scale unlabeled dataset. This is a common scenario in many important applications. Existing related methods either exclusively fit the limited anomaly examples that typically do not span the entire set of anomalies, or proceed with unsupervised learning from the unlabeled data. We propose here instead a deep reinforcement learning-based approach that enables an end-to-end optimization of the detection of both labeled and unlabeled anomalies. This approach learns the known abnormality by automatically interacting with an anomaly-biased simulation environment, while continuously extending the learned abnormality to novel classes of anomaly (i.e., unknown anomalies) by actively exploring possible anomalies in the unlabeled data. This is achieved by jointly optimizing the exploitation of the small labeled anomaly data and the exploration of the rare unlabeled anomalies. Extensive experiments on 48 real-world datasets show that our model significantly outperforms five state-of-the-art competing methods.

Fast and Accurate Partial Fourier Transform for Time Series Data

Given a time-series vector, how can we efficiently detect anomalies? A widely used method is to use Fast Fourier transform (FFT) to compute Fourier coefficients, take first few coefficients while discarding the remaining small coefficients, and reconstruct the original time series to find points with large errors. Despite the pervasive use, the method requires to compute all of the Fourier coefficients which can be cumbersome if the input length is large or when we need to perform many FFT operations.

In this paper, we propose Partial Fourier Transform (PFT), an efficient and accurate algorithm for computing only a part of Fourier coefficients. PFT approximates a part of twiddle factors (trigonometric constants) using polynomials, thereby reducing the computational complexity due to the mixture of many twiddle factors. We derive the asymptotic time complexity of PFT with respect to input and output sizes, and tolerance. We also show that PFT provides an option to set an arbitrary approximation error bound, which is useful especially when the fast evaluation is of utmost importance. Experimental results show that PFT outperforms the current state-of-the-art algorithms, with an order of magnitude of speedup for sufficiently small output sizes without sacrificing accuracy. In addition, we demonstrate the accuracy and efficacy of PFT on real-world anomaly detection, with interpretations of anomalies in stock price data.

Faster and Generalized Temporal Triangle Counting, via Degeneracy Ordering

Triangle counting is a fundamental technique in network analysis, that has received much attention in various input models. The vast majority of triangle counting algorithms are targeted to static graphs. Yet, many real-world graphs are directed and temporal, where edges come with timestamps. Temporal triangles yield much more information, since they account for both the graph topology and the timestamps.

Temporal triangle counting has seen a few recent results, but there are varying definitions of temporal triangles. In all cases, temporal triangle patterns enforce constraints on the time interval between edges (in the triangle). We define a general notion (δ1,3, δ1,2, δ2,3)-temporal triangles that allows for separate time constraints for all pairs of edges.

Our main result is a new algorithm, DOTTT (Degeneracy Oriented Temporal Triangle Totaler), that exactly counts all directed variants of (δ1,3, δ1,2, δ2,3)-temporal triangles. Using the classic idea of degeneracy ordering with careful combinatorial arguments, we can prove that DOTTT runs in O(mκłog m) time, where m is the number of (temporal) edges and κ is the graph degeneracy (max core number). Up to log factors, this matches the running time of the best static triangle counters. Moreover, this running time is better than existing.

DOTTT has excellent practical behavior and runs twice as fast as existing state-of-the-art temporal triangle counters (and is also more general). For example, DOTTT computes all types of temporal queries in Bitcoin temporal network with half a billion edges in less than an hour on a commodity machine.

Local Algorithms for Estimating Effective Resistance

Effective resistance is an important metric that measures the similarity of two vertices in a graph. It has found applications in graph clustering, recommendation systems and network reliability, among others. In spite of the importance of the effective resistances, we still lack efficient algorithms to exactly compute or approximate them on massive graphs.

In this work, we design several local algorithms for estimating effective resistances, which are algorithms that only read a small portion of the input while still having provable performance guarantees. To illustrate, our main algorithm approximates the effective resistance between any vertex pair s,t with an arbitrarily small additive error ε in time O(poly (log n/ε)), whenever the underlying graph has bounded mixing time. We perform an extensive empirical study on several benchmark datasets, validating the performance of our algorithms.

Simple Yet Efficient Algorithms for Maximum Inner Product Search via Extreme Order Statistics

We present a novel dimensionality reduction method for the approximate maximum inner product search (MIPS), named CEOs, based on the theory of concomitants of extreme order statistics. Utilizing the asymptotic behavior of these concomitants, we show that a few projections associated with the extreme values of the query signature are enough to estimate inner products. This yields a sublinear approximate MIPS algorithm with search recall guarantee under a mild condition. The indexing space is exponential but optimal for the approximate MIPS on a unit sphere.

To deal with the exponential space complexity, we present practical variants, including CEOs-TA and coCEOs, that use near-linear indexing space and time. CEOs-TA exploits the threshold algorithm (TA) and provides superior search recalls to LSH-based MIPS solvers. coCEOs is a new data and dimension co-reduction technique that outperforms CEOs-TA and other competitive methods. Empirically, they are simple to implement and achieve at least 100x speedup compared to the bruteforce search while returning top-10 MIPS with recall of at least 90% on many large-scale data sets.

MaNIACS: Approximate Mining of Frequent Subgraph Patterns through Sampling

We present MaNIACS, a sampling-based randomized algorithm for computing high-quality approximations of the collection of the subgraph patterns that are frequent in a single, large, vertex-labeled graph, according to the Minimum Node Image-based (MNI) frequency measure. The output of MaNIACS comes with strong probabilistic guarantees, obtained by using the empirical Vapnik-Chervonenkis (VC) dimension, a key concept from statistical learning theory, together with strong probabilistic tail bounds on the difference between the frequency of a pattern in the sample and its exact frequency. MaNIACS leverages properties of the MNI-frequency to aggressively prune the pattern search space, and thus to reduce the time spent in exploring subspaces containing no frequent patterns. In turn, this pruning leads to better bounds to the maximum frequency estimation error, which leads to increased pruning, resulting in a beneficial feedback effect. The results of our experimental evaluation of MaNIACS on real graphs show that it returns high-quality collections of frequent patterns in large graphs up to two orders of magnitude faster than the exact algorithm.

Learning to Recommend Visualizations from Data

Visualization recommendation is important for exploratory analysis and making sense of the data quickly by automatically recommending relevant visualizations to the user. In this work, we propose the first end-to-end ML-based visualization recommendation system that leverages a large corpus of datasets and their relevant visualizations to learn a visualization recommendation model automatically. Then, given a new unseen dataset from an arbitrary user, the model automatically generates visualizations for that new dataset, derives scores for the visualizations, and outputs a list of recommended visualizations to the user ordered by effectiveness. We also describe an evaluation framework to quantitatively evaluate visualization recommendation models learned from a large corpus of visualizations and datasets. Through quantitative experiments, a user study, and qualitative analysis, we show that our end-to-end ML-based system recommends more effective and useful visualizations compared to existing state-of-the-art rule-based systems.

Network-Wide Traffic States Imputation Using Self-interested Coalitional Learning

Accurate network-wide traffic state estimation is vital to many transportation operations and urban applications. However, existing methods often suffer from the scalability issue when performing real-time inference at the city-level, or not robust enough under limited data. Currently, GPS trajectory data from probe vehicles has become a popular data source for many transportation applications. GPS trajectory data has large coverage area, which is ideal for network-wide applications, but also has the disadvantage of being sparse and highly heterogeneous among different time and locations. In this study, we focus on developing a robust and interpretable network-wide traffic state imputation framework using partially observed traffic information. We introduce a new learning strategy, called self-interested coalitional learning (SCL), which forges cooperation between a main self-interested semi-supervised learning task and a discriminator as a critic to facilitate main task training while providing interpretability on the results. In our detailed model, we use a temporal graph convolutional variational autoencoder (TG-VAE) as the reconstructor, which models the complex spatio-temporal pattern in data and solves the main traffic state imputation task. A discriminator is introduced to output interpretable imputation confidence on the estimated results and also help to enhance the performance of the reconstructor. The framework is evaluated using a large GPS trajectory dataset from taxis in Jinan, China. Extensive experiments against the state-of-the-art baselines demonstrate the effectiveness and robustness of the proposed method for network-wide traffic state estimation.

Retrieval & Interaction Machine for Tabular Data Prediction

Prediction over tabular data is an essential task in many data science applications such as recommender systems, online advertising, medical treatment, etc. Tabular data is structured into rows and columns, with each row as a data sample and each column as a feature attribute. Both the columns and rows of the tabular data carry useful patterns that could improve the model prediction performance. However, most existing models focus on the cross-column patterns yet overlook the cross-rowpatterns as they deal with single samples independently. In this work, we propose a general learning framework named Retrieval & Interaction Machine (RIM) that fully exploits both cross-row and cross-column patterns among tabular data. Specifically, RIM first leverages search engine techniques to efficiently retrieve useful rows of the table to assist the label prediction of the target row, then uses feature interaction networks to capture the cross-column patterns among the target row and the retrieved rows so as to make the final label prediction. We conduct extensive experiments on 11 datasets of three important tasks, i.e., CTR prediction (classification), top-n recommendation (ranking) and rating prediction (regression). Experimental results show that RIM achieves significant improvements over the state-of-the-art and various baselines, demonstrating the superiority and efficacy of RIM.

ImGAGN: Imbalanced Network Embedding via Generative Adversarial Graph Networks

Imbalanced classification on graphs is ubiquitous yet challenging in many real-world applications, such as fraudulent node detection. Recently, graph neural networks (GNNs) have shown promising performance on many network analysis tasks. However, most existing GNNs have almost exclusively focused on the balanced networks, and would get unappealing performance on the imbalanced networks. To bridge this gap, in this paper, we present a generative adversarial graph network model, called ImGAGN to address the imbalanced classification problem on graphs. It introduces a novel generator for graph structure data, named GraphGenerator, which can simulate both the minority class nodes' attribute distribution and network topological structure distribution by generating a set of synthetic minority nodes such that the number of nodes in different classes can be balanced. Then a graph convolutional network (GCN) discriminator is trained to discriminate between real nodes and fake (i.e., generated) nodes, and also between minority nodes and majority nodes on the synthetic balanced network. To validate the effectiveness of the proposed method, extensive experiments are conducted on four real-world imbalanced network datasets. Experimental results demonstrate that the proposed method ImGAGN outperforms state-of-the-art algorithms for semi-supervised imbalanced node classification task.

Individual Treatment Prescription Effect Estimation in a Low Compliance Setting

Individual Treatment Effect (ITE) estimation is an extensively researched problem, with applications in various domains. We model the case where there exists heterogeneous non-compliance to a randomly assigned treatment, a typical situation in health (because of non-compliance to prescription) or digital advertising (because of competition and ad blockers for instance). The lower the compliance, the more the effect of treatment prescription - or individual prescription effect (IPE) - signal fades away and becomes harder to estimate. We propose a new approach for the estimation of the IPE that takes advantage of observed compliance information to prevent signal fading. Using the Structural Causal Model framework and do-calculus, we define a general mediated causal effect setting and propose a corresponding estimator which consistently recovers the IPE with asymptotic variance guarantees. Finally, we conduct experiments on both synthetic and real-world datasets that highlight the benefit of the approach, which consistently improves state-of-the-art in low compliance settings.

MTrajRec: Map-Constrained Trajectory Recovery via Seq2Seq Multi-task Learning

With the increasing adoption of GPS modules, there are a wide range of urban applications based on trajectory data analysis, such as vehicle navigation, travel time estimation, and driver behavior analysis. The effectiveness of urban applications relies greatly on the high sampling rates of trajectories precisely matched to the map. However, a large number of trajectories are collected under a low sampling rate in real-world practice, due to certain communication loss and energy constraints. To enhance the trajectory data and support the urban applications more effectively, many trajectory recovery methods are proposed to infer the trajectories in free space. In addition, the recovered trajectory still needs to be mapped to the road network, before it can be used in the applications. However, the two-stage pipeline, which first infers high-sampling-rate trajectories and then performs the map matching, is inaccurate and inefficient. In this paper, we propose a Map-constrained Trajectory Recovery framework, MTrajRec, to recover the fine-grained points in trajectories and map match them on the road network in an end-to-end manner. MTrajRec implements a multi-task sequence-to-sequence learning architecture to predict road segment and moving ratio simultaneously. Constraint mask, attention mechanism, and attribute module are proposed to overcome the limits of coarse grid representation and improve the performance. Extensive experiments based on large-scale real-world trajectory data confirm the effectiveness and efficiency of our approach.

ProtoPShare: Prototypical Parts Sharing for Similarity Discovery in Interpretable Image Classification

In this work, we introduce an extension to ProtoPNet called ProtoPShare which shares prototypical parts between classes. To obtain prototype sharing we prune prototypical parts using a novel data-dependent similarity. Our approach substantially reduces the number of prototypes needed to preserve baseline accuracy and finds prototypical similarities between classes. We show the effectiveness of ProtoPShare on the CUB-200-2011 and the Stanford Cars datasets and confirm the semantic consistency of its prototypical parts in user-study.

Spectral Clustering of Attributed Multi-relational Graphs

Graph clustering aims at discovering a natural grouping of the nodes such that similar nodes are assigned to a common cluster. Many different algorithms have been proposed in the literature: for simple graphs, for graphs with attributes associated to nodes, and for graphs where edges represent different types of relations among nodes. However, complex data in many domains can be represented as both attributed and multi-relational networks.

In this paper, we propose SpectralMix, a joint dimensionality reduction technique for multi-relational graphs with categorical node attributes. SpectralMix integrates all information available from the attributes, the different types of relations, and the graph structure to enable a sound interpretation of the clustering results. Moreover, it generalizes existing techniques: it reduces to spectral embedding and clustering when only applied to a single graph and to homogeneity analysis when applied to categorical data.

Experiments conducted on several real-world datasets enable us to detect dependencies between graph structure and categorical attributes, moreover, they exhibit the superiority of SpectralMix over existing methods.

Identifying Coordinated Accounts on Social Media through Hidden Influence and Group Behaviours

Disinformation campaigns on social media, involving coordinated activities from malicious accounts towards manipulating public opinion, have become increasingly prevalent. Existing approaches to detect coordinated accounts either make very strict assumptions about coordinated behaviours, or require part of the malicious accounts in the coordinated group to be revealed in order to detect the rest. To address these drawbacks, we propose a generative model, AMDN-HAGE (Attentive Mixture Density Network with Hidden Account Group Estimation) which jointly models account activities and hidden group behaviours based on Temporal Point Processes (TPP) and Gaussian Mixture Model (GMM), to capture inherent characteristics of coordination which is, accounts that coordinate must strongly influence each other's activities, and collectively appear anomalous from normal accounts. To address the challenges of optimizing the proposed model, we provide a bilevel optimization algorithm with theoretical guarantee on convergence. We verified the effectiveness of the proposed method and training algorithm on real-world social network data collected from Twitter related to coordinated campaigns from Russia's Internet Research Agency targeting the 2016 U.S. Presidential Elections, and to identify coordinated campaigns related to the COVID-19 pandemic. Leveraging the learned model, we find that the average influence between coordinated account pairs is the highest. On COVID-19, we found coordinated group spreading anti-vaccination, anti-masks conspiracies that suggest the pandemic is a hoax and political scam.

Learning Process-consistent Knowledge Tracing

Knowledge tracing (KT), which aims to trace students' changing knowledge state during their learning process, has improved students' learning efficiency in online learning systems. Recently, KT has attracted much research attention due to its critical significance in education. However, most of the existing KT methods pursue high accuracy of student performance prediction but neglect the consistency of students' changing knowledge state with their learning process. In this paper, we explore a new paradigm for the KT task and propose a novel model named Learning Process-consistent Knowledge Tracing (LPKT), which monitors students' knowledge state through directly modeling their learning process. Specifically, we first formalize the basic learning cell as the tuple exercise---answer time---answer. Then, we deeply measure the learning gain as well as its diversity from the difference of the present and previous learning cells, their interval time, and students' related knowledge state. We also design a learning gate to distinguish students' absorptive capacity of knowledge. Besides, we design a forgetting gate to model the decline of students' knowledge over time, which is based on their previous knowledge state, present learning gains, and the interval time. Extensive experimental results on three public datasets demonstrate that LPKT could obtain more reasonable knowledge state in line with the learning process. Moreover, LPKT also outperforms state-of-the-art KT methods on student performance prediction. Our work indicates a potential future research direction for KT, which is of both high interpretability and accuracy.

Simple and Efficient Hard Label Black-box Adversarial Attacks in Low Query Budget Regimes

We focus on the problem of black-box adversarial attacks, where the aim is to generate adversarial examples for deep learning models solely based on information limited to output label (hard-label) to a queried data input. We propose a simple and efficient Bayesian Optimization (BO) based approach for developing black-box adversarial attacks. Issues with BO's performance in high dimensions are avoided by searching for adversarial examples in a structured low-dimensional subspace. We demonstrate the efficacy of our proposed attack method by evaluating both ℓ∞ and ℓ2 norm constrained untargeted and targeted hard label black-box attacks on three standard datasets - MNIST, CIFAR-10, and ImageNet. Our proposed approach consistently achieves 2x to 10x higher attack success rate while requiring 10x to 20x fewer queries compared to the current state-of-the-art black-box adversarial attacks.

Fruit-fly Inspired Neighborhood Encoding for Classification

Inspired by the fruit-fly olfactory circuit, the Fly Bloom Filter is able to efficiently summarize the data with a single pass and has been used for novelty detection. We propose a new classifier that effectively encodes the different local neighborhoods for each class with a per-class Fly Bloom Filter. The inference on test data requires an efficient Flyhash[6] operation followed by a high-dimensional, but very sparse, dot product with the per-class Bloom Filters. On the theoretical side, we establish conditions under which the predictions of our proposed classifier agrees with the predictions of the nearest neighbor classifier. We extensively evaluate our proposed scheme with 71 data sets of varied data dimensionality to demonstrate that the predictive performance of our proposed neuroscience inspired classifier is competitive to the nearest-neighbor classifiers and other single-pass classifiers.

Deep Clustering based Fair Outlier Detection

In this paper, we focus on the fairness issues regarding unsupervised outlier detection. Traditional algorithms, without a specific design for algorithmic fairness, could implicitly encode and propagate statistical bias in data and raise societal concerns. To correct such unfairness and deliver a fair set of potential outlier candidates, we propose Deep Clustering based Fair Outlier Detection (DCFOD) that learns a good representation for utility maximization while enforcing the learnable representation to be subgroup-invariant on the sensitive attribute. Considering the coupled and reciprocal nature between clustering and outlier detection, we leverage deep clustering to discover the intrinsic cluster structure and out-of-structure instances. Meanwhile, an adversarial training erases the sensitive pattern for instances for fairness adaptation. Technically, we propose an instance-level weighted representation learning strategy to enhance the joint deep clustering and outlier detection, where the dynamic weight module re-emphasizes contributions of likely-inliers while mitigating the negative impact from outliers. Demonstrated by experiments on eight datasets comparing to 17 outlier detection algorithms, our DCFOD method consistently achieves superior performance on both the outlier detection validity and two types of fairness notions in outlier detection.

Robust Learning by Self-Transition for Handling Noisy Labels

Real-world data inevitably contains noisy labels, which induce the poor generalization of deep neural networks. It is known that the network typically begins to rapidly memorize false-labeled samples after a certain point of training. Thus, to counter the label noise challenge, we propose a novel self-transitional learning method called MORPH, which automatically switches its learning phase at the transition point from seeding to evolution. In the seeding phase, the network is updated using all the samples to collect a seed of clean samples. Then, in the evolution phase, the network is updated using only the set of arguably clean samples, which precisely keeps expanding by the updated network. Thus, MORPH effectively avoids the overfitting to false-labeled samples throughout the entire training period. Extensive experiments using five real-world or synthetic benchmark datasets demonstrate substantial improvements over state-of-the-art methods in terms of robustness and efficiency.

Triangle-aware Spectral Sparsifiers and Community Detection

Triangle-aware graph partitioning has proven to be a successful approach to finding communities in real-world data [8, 40, 51, 54]. But how can we explain its empirical success? Triangle-aware graph partitioning methods rely on the count of triangles an edge is contained in, in contrast to the well-established measure of effective resistance [12] that requires global information about the graph.

In this work we advance the understanding of triangle-based graph partitioning in two ways. First, we introduce a novel triangle-aware sparsification scheme. Our scheme provably produces a spectral sparsifier with high probability [46, 47] on graphs that exhibit strong triadic closure, a hallmark property of real-world networks. Importantly, our sampling scheme is amenable to distributed computing, since it relies simply on computing node degrees, and edge triangle counts. Finally, we compare our methods to the Spielman-Srivastava sparsification algorithm [46] on a wide variety of real-world graphs, and we verify the applicability of our proposed sparsification scheme on real-world networks.

Secondly, we develop a data-driven approach towards understanding properties of real-world communities with respect to effective resistances, and triangle counts. Our empirical approach is mainly based on the notion of ground-truth communities in datasets made available originally by Yang and Leskovec [53]. We perform a study of triangle-aware measures, and effective resistances on edges within, and across communities, and we discover certain interesting empirical findings. For example, we observe that the Jaccard similarity of an edge used by Satuluri [40], and the closely related Tectonic similarity measure introduced by Tsourakakis et al. [51] provide consistently good signals of whether an edge is contained within a community or not.

Probabilistic Gradient Boosting Machines for Large-Scale Probabilistic Regression

Gradient Boosting Machines (GBM) are hugely popular for solving tabular data problems. However, practitioners are not only interested in point predictions, but also in probabilistic predictions in order to quantify the uncertainty of the predictions. Creating such probabilistic predictions is difficult with existing GBM-based solutions: they either require training multiple models or they become too computationally expensive to be useful for large-scale settings. We propose Probabilistic Gradient Boosting Machines (PGBM), a method to create probabilistic predictions with a single ensemble of decision trees in a computationally efficient manner. PGBM approximates the leaf weights in a decision tree as a random variable, and approximates the mean and variance of each sample in a dataset via stochastic tree ensemble update equations. These learned moments allow us to subsequently sample from a specified distribution after training. We empirically demonstrate the advantages of PGBM compared to existing state-of-the-art methods: (i) PGBM enables probabilistic estimates without compromising on point performance in a single model, (ii) PGBM learns probabilistic estimates via a single model only (and without requiring multi-parameter boosting), and thereby offers a speedup of up to several orders of magnitude over existing state-of-the-art methods on large datasets, and (iii) PGBM achieves accurate probabilistic estimates in tasks with complex differentiable loss functions, such as hierarchical time series problems, where we observed up to 10% improvement in point forecasting performance and up to 300% improvement in probabilistic forecasting performance.

Redescription Model Mining

This paper introduces Redescription Model Mining, a novel approach to identify interpretable patterns across two datasets that share only a subset of attributes and have no common instances. In particular, Redescription Model Mining aims to find pairs of describable data subsets -- one for each dataset -- that induce similar exceptional models with respect to a prespecified model class. To achieve this, we combine two previously separate research areas: Exceptional Model Mining and Redescription Mining. For this new problem setting, we develop interestingness measures to select promising patterns, propose efficient algorithms, and demonstrate their potential on synthetic and real-world data. Uncovered patterns can hint at common underlying phenomena that manifest themselves across datasets, enabling the discovery of possible associations between (combinations of) attributes that do not appear in the same dataset.

A Stagewise Hyperparameter Scheduler to Improve Generalization

Stochastic gradient descent (SGD) augmented with various momentum variants (e.g. heavy ball momentum (SHB) and Nesterov's accelerated gradient (NAG)) has been the default optimizer for many learning tasks. Tuning the optimizer's hyperparameters is arguably the most time-consuming part of model training. Many new momentum variants, despite their empirical advantage over classical SHB/NAG, introduce even more hyperparameters to tune. Automating the tedious and error-prone tuning is essential for AutoML. This paper focuses on how to efficiently tune a large class of multistage momentum variants to improve generalization. We use the general formulation of quasi-hyperbolic momentum (QHM) and extend "constant and drop'', the widespread learning rate α scheduler where α is set large initially and then dropped every few epochs, to other hyperparameters (e.g. batch size b, momentum parameter β, instant discount factor ν). Multistage QHM is a unified framework which covers a large family of momentum variants as its special cases (e.g. vanilla SGD/SHB/NAG). Existing works mainly focus on scheduling α's decay, while multistage QHM allows additional varying hyperparameters such as b, β, and ν, and demonstrates better generalization ability than only tuning α. Our tuning strategies have rigorous justifications rather than a blind trial-and-error. We theoretically prove why our tuning strategies could improve generalization. We also show the convergence of multistage QHM for general nonconvex objective functions. Our strategies simplify the tuning process and beat competitive optimizers in test accuracy empirically.

Breaking the Limit of Graph Neural Networks by Improving the Assortativity of Graphs with Local Mixing Patterns

Graph neural networks (GNNs) have achieved tremendous success on multiple graph-based learning tasks by fusing network structure and node features. Modern GNN models are built upon iterative aggregation of neighbor's/proximity features by message passing. Its prediction performance has been shown to be strongly bounded by assortative mixing in the graph, a key property wherein nodes with similar attributes mix/connect with each other. We observe that real world networks exhibit heterogeneous or diverse mixing patterns and the conventional global measurement of assortativity, such as global assortativity coefficient, may not be a representative statistic in quantifying this mixing. We adopt a generalized concept, node-level assortativity, one that is based at the node level to better represent the diverse patterns and accurately quantify the learnability of GNNs. We find that the prediction performance of a wide range of GNN models is highly correlated with the node level assortativity. To break this limit, in this work, we focus on transforming the input graph into a computation graph which contains both proximity and structural information as distinct type of edges. The resulted multi-relational graph has an enhanced level of assortativity and, more importantly, preserves rich information from the original graph. We then propose to run GNNs on this computation graph and show that adaptively choosing between structure and proximity leads to improved performance under diverse mixing. Empirically, we show the benefits of adopting our transformation framework for semi-supervised node classification task on a variety of real world graph learning benchmarks.

Norm Adjusted Proximity Graph for Fast Inner Product Retrieval

Efficient inner product search on embedding vectors is often the vital stage for online ranking services, such as recommendation and information retrieval. Recommendation algorithms, e.g., matrix factorization, typically produce latent vectors to represent users or items. The recommendation services are conducted by retrieving the most relevant item vectors given the user vector, where the relevance is often defined by inner product. Therefore, developing efficient recommender systems often requires solving the so-called maximum inner product search (MIPS) problem. In the past decade, there have been many studies on efficient MIPS algorithms. This task is challenging in part because the inner product does not follow the triangle inequality of metric space.

Compared with hash-based or quantization-based MIPS solutions, in recent years graph-based MIPS algorithms have demonstrated their strong empirical advantages in many real-world MIPS tasks. In this paper, we propose a new index graph construction method named norm adjusted proximity graph (NAPG), for efficient MIPS. With adjusting factors estimated on sampled data, NAPG is able to select more meaningful data points to connect with when constructing graph-based index for inner product search. Our extensive experiments on a variety of datasets verify that the improved graph-based index strategy provides another strong addition to the pool of efficient MIPS algorithms.

Analysis and Applications of Class-wise Robustness in Adversarial Training

Adversarial training is one of the most effective approaches to improve model robustness against adversarial examples. However, previous works mainly focus on the overall robustness of the model, and the in-depth analysis on the role of each class involved in adversarial training is still missing. In this paper, we propose to analyze the class-wise robustness in adversarial training. First, we provide a detailed diagnosis of adversarial training on six benchmark datasets, i.e., MNIST, CIFAR-10, CIFAR-100, SVHN, STL-10 and ImageNet. Surprisingly, we find that there are remarkable robustness discrepancies among classes, leading to unbalance/unfair class-wise robustness in the robust models. Furthermore, we keep investigating the relations between classes and find that the unbalanced class-wise robustness is pretty consistent among different attack and defense methods. Moreover, we observe that the stronger attack methods in adversarial learning achieve performance improvement mainly from a more successful attack on the vulnerable classes (i.e., classes with less robustness). Inspired by these interesting findings, we design a simple but effective attack method based on the traditional PGD attack, named Temperature-PGD attack, which proposes to enlarge the robustness disparity among classes with a temperature factor on the confidence distribution of each image. Experiments demonstrate our method can achieve a higher attack rate than the PGD attack. Furthermore, from the defense perspective, we also make some modifications in the training and inference phase to improve the robustness of the most vulnerable class, so as to mitigate the large difference in class-wise robustness. We believe our work can contribute to a more comprehensive understanding of adversarial training as well as rethinking the class-wise properties in robust models.

Choice Set Confounding in Discrete Choice

Standard methods in preference learning involve estimating the parameters of discrete choice models from data of selections (choices) made by individuals from a discrete set of alternatives (the choice set). While there are many models for individual preferences, existing learning methods overlook how choice set assignment affects the data. Often, the choice set itself is influenced by an individual's preferences; for instance, a consumer choosing a product from an online retailer is often presented with options from a recommender system that depend on information about the consumer's preferences. Ignoring these assignment mechanisms can mislead choice models into making biased estimates of preferences, a phenomenon that we call choice set confounding. We demonstrate the presence of such confounding in widely-used choice datasets.

To address this issue, we adapt methods from causal inference to the discrete choice setting. We use covariates of the chooser for inverse probability weighting and/or regression controls, accurately recovering individual preferences in the presence of choice set confounding under certain assumptions. When such covariates are unavailable or inadequate, we develop methods that take advantage of structured choice set assignment to improve prediction. We demonstrate the effectiveness of our methods on real-world choice data, showing, for example, that accounting for choice set confounding makes choices observed in hotel booking and commute transportation more consistent with rational utility maximization.

Learning Interpretable Feature Context Effects in Discrete Choice

Individuals are constantly making choices---purchasing products, consuming Web content, making social connections---so understanding what contributes to these decisions is crucial in many settings. A major interest is understanding context effects, which occur when the set of available options itself affects an individual's relative preferences. These violate traditional rationality assumptions but are commonly observed in human behavior. At the same time, identifying context effects from choice data remains a challenge; existing models posit a specific context effect a priori and then measure its effect from (often effect-targeting) data. Here, we develop discrete choice models that capture a broad range of context effects, which are learned from choice data rather than baked into the model. Our models yield intuitive, interpretable, and statistically testable context effects, all while being simple to train. We evaluate our model on several empirical choice datasets, discovering, e.g., that people are more willing to book higher-priced hotels when presented with options that are on sale. We also provide the first analysis of context effects in online social network growth, finding that users forming connections place relatively more emphasis on shared neighbors when popular users are an option.

Statistical Models Coupling Allows for Complex Local Multivariate Time Series Analysis

The increased availability of multivariate time-series asks for the development of suitable methods able to holistically analyse them. To this aim, we propose a novel flexible method for data-mining, forecasting and causal patterns detection that leverages the coupling of Hidden Markov Models and Gaussian Graphical Models. Given a multivariate non-stationary time-series, the proposed method simultaneously clusters time points while understanding probabilistic relationships among variables. The clustering divides the time points into stationary sub-groups whose underlying distribution can be inferred through a graphical model. Such coupling can be further exploited to build a time-varying regression model which allows to both make predictions and obtain insights on the presence of causal patterns. We extensively validate the proposed approach on synthetic data showing that it has better performance than the state of the art on clustering, graphical models inference and prediction. Finally, to demonstrate the applicability of our approach in real-world scenarios, we exploit its characteristics to build a profitable investment portfolio. Results show that we are able to improve the state of the art, by going from a -%20 profit to a noticeable 80%.

The Generalized Mean Densest Subgraph Problem

Finding dense subgraphs of a large graph is a standard problem in graph mining that has been studied extensively both for its theoretical richness and its many practical applications. In this paper we introduce a new family of dense subgraph objectives, parameterized by a single parameter p, based on computing generalized means of degree sequences of a subgraph. Our objective captures both the standard densest subgraph problem and the maximum k-core as special cases, and provides a way to interpolate between and extrapolate beyond these two objectives when searching for other notions of dense subgraphs. In terms of algorithmic contributions, we first show that our objective can be minimized in polynomial time for all p ≥ 1 using repeated submodular minimization. A major contribution of our work is analyzing the performance of different types of peeling algorithms for dense subgraphs both in theory and practice. We prove that the standard peeling algorithm can perform arbitrarily poorly on our generalized objective, but we then design a more sophisticated peeling method which for p ≥ 1 has an approximation guarantee that is always at least 1/2 and converges to 1 as p ⟶ ₶. In practice, we show that this algorithm obtains extremely good approximations to the optimal solution, scales to large graphs, and highlights a range of different meaningful notions of density on graphs coming from numerous domains. Furthermore, it is typically able to approximate the densest subgraph problem better than the standard peeling algorithm, by better accounting for how the removal of one node affects other nodes in its neighborhood.

Environment Agnostic Invariant Risk Minimization for Classification of Sequential Datasets

The generalization of predictive models that follow the standard risk minimization paradigm of machine learning can be hindered by the presence of spurious correlations in the data. Identifying invariant predictors while training on data from multiple environments can influence models to focus on features that have an invariant causal relationship with the target, while reducing the effect of spurious features. Such invariant risk minimization approaches heavily rely on clearly defined environments and data being perfectly segmented into these environments for training. However, in real-world settings, perfect segmentation is challenging to achieve and these environment-aware approaches prove to be sensitive to segmentation errors. In this work, we present an environment-agnostic approach to develop generalizable models for classification tasks in sequential datasets without needing prior knowledge of environments. We show that our approach results in models that can generalize to out-of-distribution data and are not influenced by spurious correlations. We evaluate our approach on real-world sequential datasets from various domains.

Alphacore: Data Depth based Core Decomposition

Core decomposition in networks has proven useful for evaluating the importance of nodes and communities in a variety of application domains, ranging from biology to social networks and finance. However, existing core decomposition algorithms have limitations in simultaneously handling multiple node and edge attributes.

We propose a novel unsupervised core decomposition method that can be easily applied to directed and weighted networks. Our algorithm, AlphaCore, allows us to systematically and mathematically rigorously combine multiple node properties by using the notion of data depth. In addition, it can be used as a mixture of centrality measure and core decomposition. Compared to existing approaches, AlphaCore avoids the need to specify numerous thresholds or coefficients and yields meaningful quantitative and qualitative insights into the network structural organization.

We evaluate AlphaCore's performance with a focus on financial, blockchain-based token networks, the social network Reddit and a transportation network of international flight routes. We compare our results with existing core decomposition and centrality algorithms. Using ground truth about node importance, we show that AlphaCore yields the best precision and recall results among core decomposition methods using the same input features. An implementation is available at https://github.com/friedhelmvictor/alphacore.

Multi-Objective Model-based Reinforcement Learning for Infectious Disease Control

Severe infectious diseases such as the novel coronavirus (COVID-19) pose a huge threat to public health. Stringent control measures, such as school closures and stay-at-home orders, while having significant effects, also bring huge economic losses. In the face of an emerging infectious disease, a crucial question for policymakers is how to make the trade-off and implement the appropriate interventions timely given the huge uncertainty. In this work, we propose a Multi-Objective Model-based Reinforcement Learning framework to facilitate data-driven decision-making and minimize the overall long-term cost. Specifically, at each decision point, a Bayesian epidemiological model is first learned as the environment model, and then the proposed model-based multi-objective planning algorithm is applied to find a set of Pareto-optimal policies. This framework, combined with the prediction bands for each policy, provides a real-time decision support tool for policymakers. The application is demonstrated with the spread of COVID-19 in China.

Certified Robustness of Graph Neural Networks against Adversarial Structural Perturbation

Graph neural networks (GNNs) have recently gained much attention for node and graph classification tasks on graph-structured data. However, multiple recent works showed that an attacker can easily make GNNs predict incorrectly via perturbing the graph structure, i.e., adding or deleting edges in the graph. We aim to defend against such attacks via developing certifiably robust GNNs. Specifically, we prove the first certified robustness guarantee of any GNN for both node and graph classifications against structural perturbation. Moreover, we show that our certified robustness guarantee is tight. Our results are based on a recently proposed technique called randomized smoothing, which we extend to graph data. We also empirically evaluate our method for both node and graph classifications on multiple GNNs and multiple benchmark datasets. For instance, on the Cora dataset, Graph Convolutional Network with our randomized smoothing can achieve a certified accuracy of 0.49 when the attacker can arbitrarily add/delete at most 15 edges in the graph.

SESSION: Research Track Papers

Privacy-Preserving Representation Learning on Graphs: A Mutual Information Perspective

Learning with graphs has attracted significant attention recently. Existing representation learning methods on graphs have achieved state-of-the-art performance on various graph-related tasks such as node classification, link prediction, etc. However, we observe that these methods could leak serious private information. For instance, one can accurately infer the links (or node identity) in a graph from a node classifier (or link predictor) trained on the learnt node representations by existing methods. To address the issue, we propose a privacy-preserving representation learning framework on graphs from the mutual information perspective. Specifically, our framework includes a primary learning task and a privacy protection task, and we consider node classification and link prediction as the two tasks of interest. Our goal is to learn node representations such that they can be used to achieve high performance for the primary learning task, while obtaining performance for the privacy protection task close to random guessing. We formally formulate our goal via mutual information objectives. However, it is intractable to compute mutual information in practice. Then, we derive tractable variational bounds for the mutual information terms, where each bound can be parameterized via a neural network. Next, we train these parameterized neural networks to approximate the true mutual information and learn privacy-preserving node representations. We finally evaluate our framework on various graph datasets.

JOHAN: A Joint Online Hurricane Trajectory and Intensity Forecasting Framework

Hurricanes are one of the most catastrophic natural forces with potential to inflict severe damages to properties and loss of human lives from high winds and inland flooding. Accurate long-term forecasting of the trajectory and intensity of advancing hurricanes is therefore crucial to provide timely warnings for civilians and emergency responders to mitigate costly damages and their life-threatening impact. In this paper, we present a novel online learning framework called JOHAN that simultaneously predicts the trajectory and intensity of a hurricane based on outputs produced by an ensemble of dynamic (physical) hurricane models. In addition, JOHAN is designed to generate accurate forecasts of the ordinal-valued hurricane intensity categories to ensure that their severity level can be reliably communicated to the public. The framework also employs exponentially-weighted quantile loss functions to bias the algorithm towards improving its prediction accuracy for high category hurricanes approaching landfall. Experimental results using real-world hurricane data demonstrated the superiority of JOHAN compared to several state-of-the-art learning approaches.

Approximate Graph Propagation

Efficient computation of node proximity queries such as transition probabilities, Personalized PageRank, and Katz are of fundamental importance in various graph mining and learning tasks. In particular, several recent works leverage fast node proximity computation to improve the scalability of Graph Neural Networks (GNN). However, prior studies on proximity computation and GNN feature propagation are on a case-by-case basis, with each paper focusing on a particular proximity measure.

In this paper, we propose Approximate Graph Propagation (AGP), a unified randomized algorithm that computes various proximity queries and GNN feature propagations, including transition probabilities, Personalized PageRank, heat kernel PageRank, Katz, SGC, GDC, and APPNP. Our algorithm provides a theoretical bounded error guarantee and runs in almost optimal time complexity. We conduct an extensive experimental study to demonstrate AGP's effectiveness in two concrete applications: local clustering with heat kernel PageRank and node classification with GNNs. Most notably, we present an empirical study on a billion-edge graph Papers100M, the largest publicly available GNN dataset so far. The results show that AGP can significantly improve various existing GNN models' scalability without sacrificing prediction accuracy.

Relational Message Passing for Knowledge Graph Completion

Knowledge graph completion aims to predict missing relations between entities in a knowledge graph. In this work, we propose a relational message passing method for knowledge graph completion. Different from existing embedding-based methods, relational message passing only considers edge features (i.e., relation types) without entity IDs in the knowledge graph, and passes relational messages among edges iteratively to aggregate neighborhood information. Specifically, two kinds of neighborhood topology are modeled for a given entity pair under the relational message passing framework: (1) Relational context, which captures the relation types of edges adjacent to the given entity pair; (2) Relational paths, which characterize the relative position between the given two entities in the knowledge graph. The two message passing modules are combined together for relation prediction. Experimental results on knowledge graph benchmarks as well as our newly proposed dataset show that, our method PathCon outperforms state-of-the-art knowledge graph completion methods by a large margin. PathCon is also shown applicable to inductive settings where entities are not seen in training stage, and it is able to provide interpretable explanations for the predicted results. The code and all datasets are available at https://github.com/hwwang55/PathCon.

Deep Learning Embeddings for Data Series Similarity Search

A key operation for the (increasingly large) data series collection analysis is similarity search. According to recent studies, SAX-based indexes offer state-of-the-art performance for similarity search tasks. However, their performance lags under high-frequency, weakly correlated, excessively noisy, or other dataset-specific properties. In this work, we propose Deep Embedding Approximation (DEA), a novel family of data series summarization techniques based on deep neural networks. Moreover, we describe SEAnet, a novel architecture especially designed for learning DEA, that introduces the Sum of Squares preservation property into the deep network design. Finally, we propose a new sampling strategy, SEASam, that allows SEAnet to effectively train on massive datasets. Comprehensive experiments on 7 diverse synthetic and real datasets verify the advantages of DEA learned using SEAnet, when compared to other state-of-the-art traditional and DEA solutions, in providing high-quality data series summarizations and similarity search results.

Deconfounded Recommendation for Alleviating Bias Amplification

Recommender systems usually amplify the biases in the data. The model learned from historical interactions with imbalanced item distribution will amplify the imbalance by over-recommending items from the majority groups. Addressing this issue is essential for a healthy ecosystem of recommendation in the long run. Existing work applies bias control to the ranking targets (e.g., calibration, fairness, and diversity), but ignores the true reason for bias amplification and trades off the recommendation accuracy.

In this work, we scrutinize the cause-effect factors for bias amplification, identifying the main reason lies in the confounding effect of imbalanced item distribution on user representation and prediction score. The existence of such confounder pushes us to go beyond merely modeling the conditional probability and embrace the causal modeling for recommendation. Towards this end, we propose a Deconfounded Recommender System (DecRS), which models the causal effect of user representation on the prediction score. The key to eliminating the impact of the confounder lies in backdoor adjustment, which is however difficult to do due to the infinite sample space of the confounder. For this challenge, we contribute an approximation operator for backdoor adjustment which can be easily plugged into most recommender models. Lastly, we devise an inference strategy to dynamically regulate backdoor adjustment according to user status. We instantiate DecRS on two representative models FM [32] and NFM [16], and conduct extensive experiments over two benchmarks to validate the superiority of our proposed DecRS.

Self-supervised Heterogeneous Graph Neural Network with Co-contrastive Learning

Heterogeneous graph neural networks (HGNNs) as an emerging technique have shown superior capacity of dealing with heterogeneous information network (HIN). However, most HGNNs follow a semi-supervised learning manner, which notably limits their wide use in reality since labels are usually scarce in real applications. Recently, contrastive learning, a self-supervised method, becomes one of the most exciting learning paradigms and shows great potential when there are no labels. In this paper, we study the problem of self-supervised HGNNs and propose a novel co-contrastive learning mechanism for HGNNs, named HeCo. Different from traditional contrastive learning which only focuses on contrasting positive and negative samples, HeCo employs cross-view contrastive mechanism. Specifically, two views of a HIN (network schema and meta-path views) are proposed to learn node embeddings, so as to capture both of local and high-order structures simultaneously. Then the cross-view contrastive learning, as well as a view mask mechanism, is proposed, which is able to extract the positive and negative embeddings from two views. This enables the two views to collaboratively supervise each other and finally learn high-level node embeddings. Moreover, two extensions of HeCo are designed to generate harder negative samples with high quality, which further boosts the performance of HeCo. Extensive experiments conducted on a variety of real-world networks show the superior performance of the proposed methods over the state-of-the-arts.

Meta Self-training for Few-shot Neural Sequence Labeling

Neural sequence labeling is widely adopted for many Natural Language Processing (NLP) tasks, such as Named Entity Recognition (NER) and slot tagging for dialog systems and semantic parsing. Recent advances with large-scale pre-trained language models have shown remarkable success in these tasks when fine-tuned on large amounts of task-specific labeled data. However, obtaining such large-scale labeled training data is not only costly, but also may not be feasible in many sensitive user applications due to data access and privacy constraints. This is exacerbated for sequence labeling tasks requiring such annotations at token-level. In this work, we develop techniques to address the label scarcity challenge for neural sequence labeling models. Specifically, we propose a meta self-training framework which leverages very few manually annotated labels for training neural sequence models. While self-training serves as an effective mechanism to learn from large amounts of unlabeled data via iterative knowledge exchange -- meta-learning helps in adaptive sample re-weighting to mitigate error propagation from noisy pseudo-labels. Extensive experiments on six benchmark datasets including two for massive multilingual NER and four slot tagging datasets for task-oriented dialog systems demonstrate the effectiveness of our method. With only 10 labeled examples for each class in each task, the proposed method achieves 10% improvement over state-of-the-art methods demonstrating its effectiveness for limited training labels regime.

Understanding and Improving Fairness-Accuracy Trade-offs in Multi-Task Learning

As multi-task models gain popularity in a wider range of machine learning applications, it is becoming increasingly important for practitioners to understand the fairness implications associated with those models. Most existing fairness literature focuses on learning a single task more fairly, while how ML fairness interacts with multiple tasks in the joint learning setting is largely under-explored. In this paper, we are concerned with how group fairness (e.g., equal opportunity, equalized odds) as an ML fairness concept plays out in the multi-task scenario. In multi-task learning, several tasks are learned jointly to exploit task correlations for a more efficient inductive transfer. This presents a multi-dimensional Pareto frontier on (1) the trade-off between group fairness and accuracy with respect to each task, as well as (2) the trade-offs across multiple tasks. We aim to provide a deeper understanding on how group fairness interacts with accuracy in multi-task learning, and we show that traditional approaches that mainly focus on optimizing the Pareto frontier of multi-task accuracy might not perform well on fairness goals. We propose a new set of metrics to better capture the multi-dimensional Pareto frontier of fairness-accuracy trade-offs uniquely presented in a multi-task learning setting. We further propose a Multi-Task-Aware Fairness (MTA-F) approach to improve fairness in multi-task learning. Experiments on several real-world datasets demonstrate the effectiveness of our proposed approach.

Error-Bounded Online Trajectory Simplification with Multi-Agent Reinforcement Learning

Trajectory data has been widely used in various applications, including taxi services, traffic management, mobility analysis, etc. It is usually collected at a sensor's side in real time and corresponds to a sequence of sampled points. Constrained by the storage and/or network bandwidth of a sensor, it is common to simplify raw trajectory data when it is collected by dropping some sampled points. Many algorithms have been proposed for the error-bounded online trajectory simplification (EB-OTS) problem, which is to drop as many points as possible subject to that the error is bounded by an error tolerance. Nevertheless, these existing algorithms rely on pre-defined rules for decision making during the trajectory simplification process and there is no theoretical ground supporting their effectiveness. In this paper, we propose a multi-agent reinforcement learning method called MARL4TS for EB-OTS. MARL4TS involves two agents for different decision making problems during the trajectory simplification processes. Besides, MARL4TS has its objective equivalent to that of the EB-OTS problem, which provides some theoretical ground of its effectiveness. We conduct extensive experiments on real-world trajectory datasets, which verify that MARL4TS outperforms all existing algorithms in effectiveness and provides competitive efficiency.

Zero-shot Node Classification with Decomposed Graph Prototype Network

Node classification is a central task in graph data analysis. Scarce or even no labeled data of emerging classes is a big challenge for existing methods. A natural question arises: can we classify the nodes from those classes that have never been seen?

In this paper, we study this zero-shot node classification (ZNC) problem which has a two-stage nature: (1) acquiring high-quality class semantic descriptions (CSDs) for knowledge transfer, and (2) designing a well generalized graph-based learning model. For the first stage, we give a novel quantitative CSDs evaluation strategy based on estimating the real class relationships, to get the "best" CSDs in a completely automatic way. For the second stage, we propose a novel Decomposed Graph Prototype Network (DGPN) method, following the principles of locality and compositionality for zero-shot model generalization. Finally, we conduct extensive experiments to demonstrate the effectiveness of our solutions.

TUTA: Tree-based Transformers for Generally Structured Table Pre-training

We propose TUTA, a unified pre-training architecture for understanding generally structured tables. Noticing that understanding a table requires spatial, hierarchical, and semantic information, we enhance transformers with three novel structure-aware mechanisms. First, we devise a unified tree-based structure, called a bi-dimensional coordinate tree, to describe both the spatial and hierarchical information of generally structured tables. Upon this, we propose tree-based attention and position embedding to better capture the spatial and hierarchical information. Moreover, we devise three progressive pre-training objectives to enable representations at the token, cell, and table levels. We pre-train TUTA on a wide range of unlabeled web and spreadsheet tables and fine-tune it on two critical tasks in the field of table structure understanding: cell type classification and table type classification. Experiments show that TUTA is highly effective, achieving state-of-the-art on five widely-studied datasets.

Model-Agnostic Counterfactual Reasoning for Eliminating Popularity Bias in Recommender System

The general aim of the recommender system is to provide personalized suggestions to users, which is opposed to suggesting popular items. However, the normal training paradigm, i.e., fitting a recommender model to recover the user behavior data with pointwise or pairwise loss, makes the model biased towards popular items. This results in the terrible Matthew effect, making popular items be more frequently recommended and become even more popular. Existing work addresses this issue with Inverse Propensity Weighting (IPW), which decreases the impact of popular items on the training and increases the impact of long-tail items. Although theoretically sound, IPW methods are highly sensitive to the weighting strategy, which is notoriously difficult to tune.

In this work, we explore the popularity bias issue from a novel and fundamental perspective --- cause-effect. We identify that popularity bias lies in the direct effect from the item node to the ranking score, such that an item's intrinsic property is the cause of mistakenly assigning it a higher ranking score. To eliminate popularity bias, it is essential to answer the counterfactual question that what the ranking score would be if the model only uses item property. To this end, we formulate a causal graph to describe the important cause-effect relations in the recommendation process. During training, we perform multi-task learning to achieve the contribution of each cause; during testing, we perform counterfactual inference to remove the effect of item popularity. Remarkably, our solution amends the learning process of recommendation which is agnostic to a wide range of models --- it can be easily implemented in existing methods. We demonstrate it on Matrix Factorization (MF) and LightGCN [20], which are representative of the conventional and SOTA model for collaborative filtering. Experiments on five real-world datasets demonstrate the effectiveness of our method.

Probabilistic Label Tree for Streaming Multi-Label Learning

Multi-label learning aims to predict a subset of relevant labels for each instance, which has many real-world applications. Most extant multi-label learning studies focus on a fixed size of label space. However, in many cases, the environment is open and changes gradually and new labels emerge, which is coined as streaming multi-label learning (SMLL). SMLL poses great challenges in twofolds: (1) the target output space expands dynamically; (2) new labels emerge frequently and can reach a significantly large number. Previous attempts on SMLL leverage label correlations between past and emerging labels to improve the performance, while they are inefficient when deal with large-scale problems. To cope with this challenge, in this paper, we present a new learning framework, i.e., the probabilistic streaming label tree(Pslt). In particular, each non-leaf node of the tree corresponding to a subset of labels, and a binary classifier is learned at each leaf node. Initially, Pslt is learned on partially observed labels, both tree structure and node classifiers are updated while new labels emerge. Using carefully designed updating mechanism, Psltcan seamlessly incorporate new labels by first passing them down from the root to leaf nodes and then update node classifiers accordingly. We provide theoretical bounds for the iteration complexity of tree update procedure and the estimation error on newly arrived labels. Experiments show that the proposed approach improves the performance in comparison with eleven baselines in terms of multiple evaluation metrics. The source code is available at https://gitee.com/pslt-kdd2021/pslt.

Towards Robust Prediction on Tail Labels

Extreme multi-label learning (XML) works to annotate objects with relevant labels from an extremely large label set. Many previous methods treat labels uniformly such that the learned model tends to perform better on head labels, while the performance is severely deteriorated for tail labels. However, it is often desirable to predict more tail labels in many real-world applications. To alleviate this problem, in this work, we show theoretical and experimental evidence for the inferior performance of representative XML methods on tail labels. Our finding is that the norm of label classifier weights typically follows a long-tailed distribution similar to the label frequency, which results in the over-suppression of tail labels. Base on this new finding, we present two new modules: (1)ReRank works to re-rank the predicted score, which significantly improves the performance on tail labels by eliminating the effect of label-priors; (2)Taug augments tail labels via a decoupled learning scheme, which can yield more balanced classification boundary. We conduct experiments on commonly used XML benchmarks with hundreds of thousands of labels, showing that the proposed methods improve the performance of many state-of-the-art XML models by a considerable margin (6% performance gain with respect to PSP@1 on average). Anonymous source code is available at https://github.com/ReRANK-XML/rerank-XML.

Enhancing SVMs with Problem Context Aware Pipeline

In recent years, many data mining practitioners have treated deep neural networks (DNNs) as a standard recipe of creating the state-of-the-art solutions. As a result, models like Support Vector Machines (SVMs) have been overlooked. While the results from DNNs are encouraging, DNNs also come with their huge number of parameters in the model and overheads in long training/inference time. SVMs have excellent properties such as convexity, good generality and efficiency. In this paper, we propose techniques to enhance SVMs with an automatic pipeline which exploits the context of the learning problem. The pipeline consists of several components including data aware subproblem construction, feature customization, data balancing among subproblems with augmentation, and kernel hyper-parameter tuner. Comprehensive experiments show that our proposed solution is more efficient, while producing better results than the other SVM based approaches. Additionally, we conduct a case study of our proposed solution on a popular sentiment analysis problem---the aspect term sentiment analysis (ATSA) task. The study shows that our SVM based solution can achieve competitive predictive accuracy to DNN (and even majority of the BERT) based approaches. Furthermore, our solution is about 40 times faster in inference and has 100 times fewer parameters than the models using BERT. Our findings can encourage more research work on conventional machine learning techniques which may be a good alternative for smaller model size and faster training/inference.

Triple Adversarial Learning for Influence based Poisoning Attack in Recommender Systems

As an important means to solve information overload, recommender systems have been widely applied in many fields, such as e-commerce and advertising. However, recent studies have shown that recommender systems are vulnerable to poisoning attacks; that is, injecting a group of carefully designed user profiles into the recommender system can severely affect recommendation quality. Despite the development from shilling attacks to optimization-based attacks, the imperceptibility and harmfulness of the generated data in most attacks are arduous to balance. To this end, we propose a triple adversarial learning for influence based poisoning attack (TrialAttack), a flexible end-to-end poisoning framework to generate non-notable and harmful user profiles. Specifically, given the input noise, TrialAttack directly generates malicious users through triple adversarial learning of the generator, discriminator, and influence module. Besides, to provide reliable influence for TrialAttack training, we explore a new approximation approach for estimating each fake user's influence. Through theoretical analysis, we prove that the distribution characterized by TrialAttack approximates to the rating distribution of real users under the premise of performing an efficient attack. This property allows the injected users to attack in an unremarkable way. Experiments on three real-world datasets show that TrialAttack's attack performance outperforms state-of-the-art attacks, and the generated fake profiles are more difficult to detect compared to baselines.

Quantifying Uncertainty in Deep Spatiotemporal Forecasting

Deep learning is gaining increasing popularity for spatiotemporal forecasting. However, prior works have mostly focused on point estimates without quantifying the uncertainty of the predictions. In high stakes domains, being able to generate probabilistic forecasts with confidence intervals is critical to risk assessment and decision making. Hence, a systematic study of uncertainty quantification (UQ) methods for spatiotemporal forecasting is missing in the community. In this paper, we describe two types of spatiotemporal forecasting problems: regular grid-based and graph-based. Then we analyze UQ methods from both the Bayesian and the frequentist point of view, casting in a unified framework via statistical decision theory. Through extensive experiments on real-world road network traffic, epidemics, and air quality forecasting tasks, we reveal the statistical and computational trade-offs for different UQ methods: Bayesian methods are typically more robust in mean prediction, while confidence levels obtained from frequentist methods provide more extensive coverage over data variations. Computationally, quantile regression type methods are cheaper for a single confidence interval but require re-training for different intervals. Sampling based methods generate samples that can form multiple confidence intervals, albeit at a higher computational cost.

Indirect Invisible Poisoning Attacks on Domain Adaptation

Unsupervised domain adaptation has been successfully applied across multiple high-impact applications, since it improves the generalization performance of a learning algorithm when the source and target domains are related. However, the adversarial vulnerability of domain adaptation models has largely been neglected. Most existing unsupervised domain adaptation algorithms might be easily fooled by an adversary, resulting in deteriorated prediction performance on the target domain, when transferring the knowledge from a maliciously manipulated source domain.

To demonstrate the adversarial vulnerability of existing domain adaptation techniques, in this paper, we propose a generic data poisoning attack framework named I2Attack for domain adaptation with the following properties: (1) perceptibly unnoticeable: all the poisoned inputs are natural-looking; (2)adversarially indirect: only source examples are maliciously manipulated; (3) algorithmically invisible: both source classification error and marginal domain discrepancy between source and target domains will not increase. Specifically, it aims to degrade the overall prediction performance on the target domain by maximizing the label-informed domain discrepancy over both input feature space and class-label space be-tween source and target domains. Within this framework, a family of practical poisoning attacks are presented to fool the existing domain adaptation algorithms associated with different discrepancy measures. Extensive experiments on various domain adaptation benchmarks confirm the effectiveness and computational efficiency of our proposed I2Attack framework.

MapEmbed: Perfect Hashing with High Load Factor and Fast Update

Perfect hashing is a hash function that maps a set of distinct keys to a set of continuous integers without collision. However,most existing perfect hash schemes are static, which means that they cannot support incremental updates, while most datasets in practice are dynamic. To address this issue, we propose a novel hashing scheme, namely MapEmbed Hashing. Inspired by divide-and-conquer and map-and-reduce, our key idea is named map-and-embed and includes two phases: 1) Map all keys into many small virtual tables; 2) Embed all small tables into a large table by circular move. Our experimental results show that under the same experimental setting, the state-of-the-art perfect hashing (dynamic perfect hashing) can achieve around 15% load factor, around 0.3 Mops update speed, while our MapEmbed achieves around 90% ~ 95% load factor, and around 8.0 Mops update speed per thread. All codes of ours and other algorithms are open-sourced at GitHub.

Geometric Graph Representation Learning on Protein Structure Prediction

Determining a protein's 3D from its sequences is one of the most challenging problems in biology. Recently, geometric deep learning has achieved great success on non-Euclidean domains including social networks, chemistry, and computer graphics. Although it is natural to present protein structures as 3D graphs, existing research has rarely studied protein structures as graphs directly. The present research explores the geometry deep learning of three-dimensional graphs on protein structures and proposes a graph neural network architecture to address these challenges. The proposed Protein Geometric Graph Neural Network (PG-GNN) models both distance geometric graph representation and dihedral geometric graph representation by geometric graph convolutions. This research shed new light on protein 3D structure studies. We investigated the effectiveness of graph neural networks over five real datasets. Our results demonstrate the potential of GNNs for 3D structure prediction.

Forecasting Interaction Order on Temporal Graphs

Link prediction is a fundamental task for graph analysis and the topic has been studied extensively for static or dynamic graphs. Essentially, the link prediction is formulated as a binary classification problem about two nodes. However, for temporal graphs, links (or interactions) among node sets appear in sequential orders. And the orders may lead to interesting applications. While a binary link prediction formulation fails to handle such an order-sensitive case. In this paper, we focus on such an interaction order prediction problem among a given node set on temporal graphs. For the technical aspect, we develop a graph neural network model named Temporal ATtention network (TAT), which utilizes the fine-grained time information on temporal graphs by encoding continuous real-valued timestamps as vectors. For each transformation layer of the model, we devise an attention mechanism to aggregate neighborhoods' information based on their representations and time encodings attached to their specific edges. We also propose a novel training scheme to address the permutation-sensitive property of the problem. Experiments on several real-world temporal graphs reveal that TAT outperforms some state-of-the-art graph neural networks by 55% on average under the AUC metric.

Learning How to Propagate Messages in Graph Neural Networks

This paper studies the problem of learning message propagation strategies for graph neural networks (GNNs). One of the challenges for graph neural networks is that of defining the propagation strategy. For instance, the choices of propagation steps are often specialized to a single graph and are not personalized to different nodes. To compensate for this, in this paper, we present learning to propagate, a general learning framework that not only learns the GNN parameters for prediction but more importantly, can explicitly learn the interpretable and personalized propagate strategies for different nodes and various types of graphs. We introduce the optimal propagation steps as latent variables to help find the maximum-likelihood estimation of the GNN parameters in a variational Expectation- Maximization (VEM) framework. Extensive experiments on various types of graph benchmarks demonstrate that our proposed frame- work can significantly achieve better performance compared with the state-of-the-art methods, and can effectively learn personalized and interpretable propagate strategies of messages in GNNs.

Partial Multi-Label Learning with Meta Disambiguation

In partial multi-label learning (PML) problems, each instance is partially annotated with a candidate label set, which consists of multiple relevant labels and some noisy labels. To solve PML problems, existing methods typically try to recover the ground-truth information from partial annotations based on extra assumptions on the data structures. While the assumptions hardly hold in real-world applications, the trained model may not generalize well to varied PML tasks. In this paper, we propose a novel approach for partial multi-label learning with meta disambiguation (PML-MD). Instead of relying on extra assumptions, we try to disambiguate between ground-truth and noisy labels in a meta-learning fashion. On one hand, the multi-label classifier is trained by minimizing a confidence-weighted ranking loss, which distinctively utilizes the supervised information according to the label quality; on the other hand, the confidence for each candidate label is adaptively estimated with its performance on a small validation set. To speed up the optimization, these two procedures are performed alternately with an online approximation strategy. Comprehensive experiments on multiple datasets and varied evaluation metrics validate the effectiveness of the proposed method.

Contrastive Multi-View Multiplex Network Embedding with Applications to Robust Network Alignment

Despite its success in learning network node representations, network embedding is still relatively new for multiplex networks (MNs) with multiple types of edges. In such networks, the inter-layer anchor links are usually missing, which represent the alignment relations between nodes on different layers and are a crucial prerequisite for many cross-network applications like network alignment. For mining such anchor links between layers for MNs, multiplex network embedding (MNE) has become one of the most promising techniques. In this paper, we consider two problems for MNs: 1) edges can be missing to different extent, and data augmentation may mitigate this issue; 2) the known alignment anchor links between layers can be misleading since the behaviors of nodes on different layers are not always consistent, so the most informative ones should be emphasized compared with those misleading ones. However, most existing works neglect the two problems and simply 1) adopt one structural view for all the layers (e.g. random walk with the same window size) and 2) equally extract information from all the anchor links. We propose an end-to-end contrastive framework called cM2NE for MNE, utilizing multiple structural views for each layer and learning with several plug-in components for different scenarios. Through end-to-end optimization on three levels, the intra-view, inter-view, and inter-layer level, our framework achieves to select the fitted views for different layers and maximize the inter-layer mutual information by emphasizing those most informative anchor links. Extensive experimental results on real-world datasets for node classification and multi-network alignment show that our approach consistently outperforms peer methods.

Removing Disparate Impact on Model Accuracy in Differentially Private Stochastic Gradient Descent

In differentially private stochastic gradient descent (DPSGD), gradient clipping and random noise addition disproportionately affect underrepresented and complex classes and subgroups. As a consequence, DPSGD has disparate impact: the accuracy of a model trained using DPSGD tends to decrease more on these classes and subgroups vs. the original, non-private model. If the original model is unfair in the sense that its accuracy is not the same across all subgroups, DPSGD exacerbates this unfairness. In this work, we study the inequality in utility loss due to differential privacy, which compares the changes in prediction accuracy w.r.t. each group between the private model and the non-private model. We analyze the cost of privacy w.r.t. each group and explain how the group sample size along with other factors is related to the privacy impact on group accuracy. Furthermore, we propose a modified DPSGD algorithm, called DPSGD-F, to achieve differential privacy, equal costs of differential privacy, and good utility. DPSGD-F adaptively adjusts the contribution of samples in a group depending on the group clipping bias such that differential privacy has no disparate impact on group accuracy. Our experimental evaluation shows the effectiveness of our removal algorithm on achieving equal costs of differential privacy with satisfactory utility.

NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural Architecture Search

While pre-trained language models (e.g., BERT) have achieved impressive results on different natural language processing tasks, they have large numbers of parameters and suffer from big computational and memory costs, which make them difficult for real-world deployment. Therefore, model compression is necessary to reduce the computation and memory cost of pre-trained models. In this work, we aim to compress BERT and address the following two challenging practical issues: (1) The compression algorithm should be able to output multiple compressed models with different sizes and latencies, in order to support devices with different memory and latency limitations; (2) The algorithm should be downstream task agnostic, so that the compressed models are generally applicable for different downstream tasks. We leverage techniques in neural architecture search (NAS) and propose NAS-BERT, an efficient method for BERT compression. NAS-BERT trains a big supernet on a carefully designed search space containing a variety of architectures and outputs multiple compressed models with adaptive sizes and latency. Furthermore, the training of NAS-BERT is conducted on standard self-supervised pre-training tasks (e.g., masked language model) and does not depend on specific downstream tasks. Thus, the compressed models can be used across various downstream tasks. The technical challenge of NAS-BERT is that training a big supernet on the pre-training task is extremely costly. We employ several techniques including block-wise search, search space pruning, and performance approximation to improve search efficiency and accuracy. Extensive experiments on GLUE and SQuAD benchmark datasets demonstrate that NAS-BERT can find lightweight models with better accuracy than previous approaches, and can be directly applied to different downstream tasks with adaptive model sizes for different requirements of memory or latency.

Exploring Self-Supervised Representation Ensembles for COVID-19 Cough Classification

The usage of smartphone-collected respiratory sound, trained with deep learning models, for detecting and classifying COVID-19 becomes popular recently. It removes the need for in-person testing procedures especially for rural regions where related medical supplies, experienced workers, and equipment are limited. However, existing sound-based diagnostic approaches are trained in a fully-supervised manner, which requires large scale well-labelled data. It is critical to discover new methods to leverage unlabelled respiratory data, which can be obtained more easily. In this paper, we propose a novel self-supervised learning enabled framework for COVID-19 cough classification. A contrastive pre-training phase is introduced to train a Transformer-based feature encoder with unlabelled data. Specifically, we design a random masking mechanism to learn robust representations of respiratory sounds. The pre-trained feature encoder is then fine-tuned in the downstream phase to perform cough classification. In addition, different ensembles with varied random masking rates are also explored in the downstream phase. Through extensive evaluations, we demonstrate that the proposed contrastive pre-training, the random masking mechanism, and the ensemble architecture contribute to improving cough classification performance.

MTC: Multiresolution Tensor Completion from Partial and Coarse Observations

Existing tensor completion formulation mostly relies on partial observations from a single tensor. However, tensors extracted from real-world data often are more complex due to: (i) Partial observation: Only a small subset of tensor elements are available. (ii) Coarse observation: Some tensor modes only present coarse and aggregated patterns (e.g., monthly summary instead of daily reports). In this paper, we are given a subset of the tensor and some aggregated/coarse observations (along one or more modes) and seek to recover the original fine-granular tensor with low-rank factorization. We formulate a coupled tensor completion problem and propose an efficient Multi-resolution Tensor Completion model (MTC) to solve the problem. Our MTC model explores tensor mode properties and leverages the hierarchy of resolutions to recursively initialize an optimization setup, and optimizes on the coupled system using alternating least squares. MTC ensures low computational and space complexity. We evaluate our model on two COVID-19 related spatio-temporal tensors. The experiments show that MTC could provide 65.20% and 75.79% percentage of fitness (PoF) in tensor completion with only 5% fine granular observations, which is 27.96% relative improvement over the best baseline. To evaluate the learned low-rank factors, we also design a tensor prediction task for daily and cumulative disease case predictions, where MTC achieves 50% in PoF and 30% relative improvements over the best baseline.

Model-Based Counterfactual Synthesizer for Interpretation

Counterfactuals, serving as one of the emerging type of model interpretations, have recently received attention from both researchers and practitioners. Counterfactual explanations formalize the exploration of "what-if'' scenarios, and are an instance of example-based reasoning using a set of hypothetical data samples. Counterfactuals essentially show how the model decision alters with input perturbations. Existing methods for generating counterfactuals are mainly algorithm-based, which are time-inefficient and assume the same counterfactual universe for different queries. To address these limitations, we propose a Model-based Counterfactual Synthesizer (MCS) framework for interpreting machine learning models. We first analyze the model-based counterfactual process and construct a base synthesizer using a conditional generative adversarial net (CGAN). To better approximate the counterfactual universe for those rare queries, we novelly employ the umbrella sampling technique to conduct the MCS framework training. Besides, we also enhance the MCS framework by incorporating the causal dependence among attributes with model inductive bias, and validate its design correctness from the causality identification perspective. Experimental results on several datasets demonstrate the effectiveness as well as efficiency of our proposed MCS framework, and verify the advantages compared with other alternatives.

Discrete-time Temporal Network Embedding via Implicit Hierarchical Learning in Hyperbolic Space

Representation learning over temporal networks has drawn considerable attention in recent years. Efforts are mainly focused on modeling structural dependencies and temporal evolving regularities in Euclidean space which, however, underestimates the inherent complex and hierarchical properties in many real-world temporal networks, leading to sub-optimal embeddings. To explore these properties of a complex temporal network, we propose a hyperbolic temporal graph network (HTGN) that fully takes advantage of the exponential capacity and hierarchical awareness of hyperbolic geometry. More specially, HTGN maps the temporal graph into hyperbolic space, and incorporates hyperbolic graph neural network and hyperbolic gated recurrent neural network, to capture the evolving behaviors and implicitly preserve hierarchical information simultaneously. Furthermore, in the hyperbolic space, we propose two important modules that enable HTGN to successfully model temporal networks: (1) hyperbolic temporal contextual self-attention (HTA) module to attend to historical states and (2) hyperbolic temporal consistency (HTC) module to ensure stability and generalization. Experimental results on multiple real-world datasets demonstrate the superiority of HTGN for temporal graph embedding, as it consistently outperforms competing methods by significant margins in various temporal link prediction tasks. Specifically, HTGN achieves AUC improvement up to 9.98% for link prediction and 11.4% for new link prediction. Moreover, the ablation study further validates the representational ability of hyperbolic geometry and the effectiveness of the proposed HTA and HTC modules.

Numerical Formula Recognition from Tables

Claims over the numerical relationships among some measures are commonly expressed in tabular forms, and widely exist in the published documents on the Web. This paper introduces the problem of numerical formula recognition from tables, namely recognizing all numerical formulas inside a given table. It can well support many interesting downstream applications, such as numerical error correction in tables, formula recommendation in tables. Here, we emphasize that table is a kind of language that adopts a different linguistic paradigm from natural language. It uses visual grammar like visual layout and visual settings (e.g., indentation, font style) to express the grammatical relationships among the table cells. Understanding tables and recognizing formulas require decoding the visual grammar while simultaneously understanding the textual information. Another challenge is that formulas are complicated in terms of diverse math functions and variable-length of arguments. To address these challenges, we convert this task into a uniform framework, extracting relations of table cell pairs in a table. A two-channel neural network model TaFor is proposed to embed both the textual and visual features for a table cell. Our framework achieves the formula-level F1-score = 0.90 on a real-world dataset of 190179 tables while a retrieval-based method achieves F1-score = 0.72. We also perform extensive experiments to demonstrate the effectiveness of each component in our model, and conduct a case study to discuss the limits of the proposed model. With our published data this study also aims to attract the community's interest in deep semantic understanding over tables.

TopNet: Learning from Neural Topic Model to Generate Long Stories

Long story generation (LSG) is one of the coveted goals in natural language processing. Different from most text generation tasks, LSG requires to output a long story of rich content based on a much shorter text input, and often suffers from information sparsity. In this paper, we propose TopNet to alleviate this problem, by leveraging the recent advances in neural topic modeling to obtain high-quality skeleton words to complement the short input. In particular, instead of directly generating a story, we first learn to map the short text input to a low-dimensional topic distribution (which is pre-assigned by a topic model). Based on this latent topic distribution, we can use the reconstruction decoder of the topic model to sample a sequence of inter-related words as a skeleton for the story. Experiments on two benchmark datasets show that our proposed framework is highly effective in skeleton word selection and significantly outperforms the state-of-the-art models in both automatic evaluation and human evaluation.

Context-aware Outstanding Fact Mining from Knowledge Graphs

An Outstanding Fact (OF) is an attribute that makes a target entity stand out from its peers. The mining of OFs has important applications, especially in Computational Journalism, such as news promotion, fact-checking, and news story finding. However, existing approaches to OF mining: (i) disregard the context in which the target entity appears, hence may report facts irrelevant to that context; and (ii) require relational data, which are often unavailable or incomplete in many application domains. In this paper, we introduce the novel problem of mining Context-aware Outstanding Facts (COFs) for a target entity under a given context specified by a context entity. We propose FMiner, a context-aware mining framework that leverages knowledge graphs (KGs) for COF mining. FMiner generates COFs in two steps. First, it discovers top-k relevant relationships between the target and the context entity from a KG. We propose novel optimizations and pruning techniques to expedite this operation, as this process is very expensive on large KGs due to its exponential complexity. Second, for each derived relationship, we find the attributes of the target entity that distinguish it from peer entities that have the same relationship with the context entity, yielding the top-l COFs. As such, the mining process is modeled as a top-(k,l) search problem. Context-awareness is ensured by relying on the relevant relationships with the context entity to derive peer entities for COF extraction. Consequently, FMiner can effectively navigate the search to obtain context-aware OFs by incorporating a context entity. We conduct extensive experiments, including a user study, to validate the efficiency and the effectiveness of FMiner.

Energy-Efficient Models for High-Dimensional Spike Train Classification using Sparse Spiking Neural Networks

Spike train classification is an important problem in many areas such as healthcare and mobile sensing, where each spike train is a high-dimensional time series of binary values. Conventional research on spike train classification mainly focus on developing Spiking Neural Networks (SNNs) under resource-sufficient settings (e.g., on GPU servers). The neurons of the SNNs are usually densely connected in each layer. However, in many real-world applications, we often need to deploy the SNN models on resource-constrained platforms (e.g., mobile devices) to analyze high-dimensional spike train data. The high resource requirement of the densely-connected SNNs can make them hard to deploy on mobile devices. In this paper, we study the problem of energy-efficient SNNs with sparsely-connected neurons. We propose an SNN model with sparse spatio-temporal coding. Our solution is based on the re-parameterization of weights in an SNN and the application of sparsity regularization during optimization. We compare our work with the state-of-the-art SNNs and demonstrate that our sparse SNNs achieve significantly better computational efficiency on both neuromorphic and standard datasets with comparable classification accuracy. Furthermore, compared with densely-connected SNNs, we show that our method has a better capability of generalization on small-size datasets through extensive experiments.

Defending Privacy Against More Knowledgeable Membership Inference Attackers

Membership Inference Attack (MIA) in deep learning is a common form of privacy attack which aims to infer whether a data sample is in a target classifier's training dataset or not. Previous studies of MIA typically tackle either a black-box or a white-box adversary model, assuming an attacker not knowing (or knowing) the structure and parameters of the target classifier while having access to the confidence vector of the query output. With the popularity of privacy protection methods such as differential privacy, it is increasingly easier for an attacker to obtain the defense method adopted by the target classifier, which poses extra challenge to privacy protection. In this paper, we name such attacker a crystal-box adversary. We present definitions for utility and privacy of target classifier, and formulate the design goal of the defense method as an optimization problem. We also conduct theoretical analysis on the respective forms of the optimization for three adversary models, namely black-box, white-box, and crystal-box, and prove that the optimization problem is NP-hard. Thereby we solve a surrogate problem and propose three defense methods, which, if used together, can make trade-off between utility and privacy. A notable advantage of our approach is that it can be used to resist attacks from three adversary models, namely black-box, white-box, and crystal-box, simultaneously. Evaluation results show effectiveness of our proposed approach for defending privacy against MIA and better performance compared to previous defense methods.

Accurate Multivariate Stock Movement Prediction via Data-Axis Transformer with Multi-Level Contexts

How can we efficiently correlate multiple stocks for accurate stock movement prediction? Stock movement prediction has received growing interest in data mining and machine learning communities due to its substantial impact on financial markets. One way to improve the prediction accuracy is to utilize the correlations between multiple stocks, getting a reliable evidence regardless of the random noises of individual prices. However, it has been challenging to acquire accurate correlations between stocks because of their asymmetric and dynamic nature which is also influenced by the global movement of a market. In this work, we propose DTML (Data-axis Transformer with Multi-Level contexts), a novel approach for stock movement prediction that learns the correlations between stocks in an end-to-end way. DTML makes asymmetric and dynamic correlations by a) learning temporal correlations within each stock, b) generating multi-level contexts based on a global market context, and c) utilizing a transformer encoder for learning inter-stock correlations. DTML achieves the state-of-the-art accuracy on six datasets collected from various stock markets from US, China, Japan, and UK, making up to 13.8%p higher profits than the best competitors and the annualized return of 44.4% on investment simulation.

Performance-Adaptive Sampling Strategy Towards Fast and Accurate Graph Neural Networks

The main challenge of adapting Graph convolutional networks (GCNs) to large-scale graphs is the scalability issue due to the uncontrollable neighborhood expansion in the aggregation stage. Several sampling algorithms have been proposed to limit the neighborhood expansion. However, these algorithms focus on minimizing the variance in sampling to approximate the original aggregation. This leads to two critical problems: 1) low accuracy because the sampling policy is agnostic to the performance of the target task, and 2) vulnerability to noise or adversarial attacks on the graph.

In this paper, we propose a performance-adaptive sampling strategy PASS that samples neighbors informative for a target task. PASS optimizes directly towards task performance, as opposed to variance reduction. PASS trains a sampling policy by propagating gradients of the task performance loss through GCNs and the non-differentiable sampling operation. We dissect the back-propagation process and analyze how PASS learns from the gradients which neighbors are informative and assigned high sampling probabilities. In our extensive experiments, PASS outperforms state-of-the-art sampling methods by up to 10% accuracy on public benchmarks and up to 53% accuracy in the presence of adversarial attacks.

Extremely Compact Non-local Representation Learning

In contrast to regular convolutions with local receptive fields, non-local operations have widely proven an effective method for modeling long-range dependencies. Although lots of prior works have been proposed, prohibitive computation and GPU memory occupation are still the major concerns. Different from that carrying out non-local operations pixel-wise or channel-wise in a computation intensive way, we argue that we can achieve effective non-local operation using a more compact high-order statistic, which can be computed more efficiently and may convey some high-level information. In this paper, we propose an extremely compact non-local learning module (CoNL) with high-order reasoning based on a graph convolution as the core. In our CoNL, a global Hadamard pooling (GHP) as a non-local operation is used to extract a compact second-order feature vector from the input tensor. With the help of a light-weight graph convolution network (GCN), this high-order compact vector is further refined with high-level reasoning. After the GCN refinement, the compact high-order vector intuitively indicates some global semantic characteristics, and is eventually applied to enhance the input tensor through a channel scaling operation. The CoNL module is designed easily pluggable to upgrade existing networks. Extensive experiments on a wide range of tasks demonstrate the effectiveness and efficiency of our work. The proposed CoNL can achieve comparable or superior performance over previous state-of-the-art baselines on video recognition, semantic segmentation, object detection and instance segmentation tasks. For a 96 x 96 x 2048 input, our block consumes 13.6 x less in computational cost than non-local block while 7.6 x smaller in GPU memory occupation.

Fed2: Feature-Aligned Federated Learning

Federated learning learns from scattered data by fusing collaborative models from local nodes. However, conventional coordinate-based model averaging by FedAvg ignored the random information encoded per parameter and may suffer from structural feature misalignment. In this work, we propose Fed2, a feature-aligned federated learning framework to resolve this issue by establishing a firm structure-feature alignment across the collaborative models. Fed2 is composed of two major designs: First, we design a feature-oriented model structure adaptation method to ensure explicit feature allocation in different neural network structures. Applying the structure adaptation to collaborative models, matchable structures with similar feature information can be initialized at the very early training stage. During the federated learning process, we then propose a feature paired averaging scheme to guarantee aligned feature distribution and maintain no feature fusion conflicts under either IID or non-IID scenarios. Eventually, Fed2 could effectively enhance the federated learning convergence performance under extensive homo- and heterogeneous settings, providing excellent convergence speed, accuracy, and computation/communication efficiency.

A Novel Multi-View Clustering Method for Unknown Mapping Relationships Between Cross-View Samples

The existing multi-view clustering algorithms require that a sample in a view is completely or partially mapped onto one or more samples in a different corresponding view. However, this requirement could not be satisfied in many practical applications. Fortunately, there is a common cognition that the graph structure formed from each view should be as consistent as possible. Thus, this paper proposes a novel multi-view clustering method for unknown mapping relationships between cross-view samples based on the framework of non-negative matrix factorization, as an attempt to solve this problem. The objective function is designed by effectively building reconstruction error terms, local structural constraint terms, and cross-view mapping loss terms by exploring cross-view relationships. The experimental results show that the proposed method not only performs well to reveal the real mapping relationships between cross-view samples but also outperforms the comparison algorithms on the obtained clustering results.

Socially-Aware Self-Supervised Tri-Training for Recommendation

Self-supervised learning (SSL), which can automatically generate ground-truth samples from raw data, holds vast potential to improve recommender systems. Most existing SSL-based methods perturb the raw data graph with uniform node/edge dropout to generate new data views and then conduct the self-discrimination based contrastive learning over different views to learn generalizable representations. Under this scheme, only a bijective mapping is built between nodes in two different views, which means that the self-supervision signals from other nodes are being neglected. Due to the widely observed homophily in recommender systems, we argue that the supervisory signals from other nodes are also highly likely to benefit the representation learning for recommendation. To capture these signals, a general socially-aware SSL framework that integrates tri-training is proposed in this paper. Technically, our framework first augments the user data views with the user social information. And then under the regime of tri-training for multi-view encoding, the framework builds three graph encoders (one for recommendation) upon the augmented views and iteratively improves each encoder with self-supervision signals from other users, generated by the other two encoders. Since the tri-training operates on the augmented views of the same data sources for self-supervision signals, we name it self-supervised tri-training. Extensive experiments on multiple real-world datasets consistently validate the effectiveness of the self-supervised tri-training framework for improving recommendation. The code is released at https://github.com/Coder-Yu/QRec.

Efficient Optimization Methods for Extreme Similarity Learning with Nonlinear Embeddings

We study the problem of learning similarity by using nonlinear embedding models (e.g., neural networks) from all possible pairs. This problem is well-known for its difficulty of training with the extreme number of pairs. For the special case of using linear embeddings, many studies have addressed this issue of handling all pairs by considering certain loss functions and developing efficient optimization algorithms. This paper aims to extend results for general nonlinear embeddings. First, we finish detailed derivations and provide clean formulations for efficiently calculating some building blocks of optimization algorithms such as function, gradient evaluation, and Hessian-vector product. The result enables the use of many optimization methods for extreme similarity learning with nonlinear embeddings. Second, we study some optimization methods in detail. Due to the use of nonlinear embeddings, implementation issues different from linear cases are addressed. In the end, some methods are shown to be highly efficient for extreme similarity learning with nonlinear embeddings.

Enhancing Taxonomy Completion with Concept Generation via Fusing Relational Representations

Automatic construction of a taxonomy supports many applications in e-commerce, web search, and question answering. Existing taxonomy expansion or completion methods assume that new concepts have been accurately extracted and their embedding vectors learned from the text corpus. However, one critical and fundamental challenge in fixing the incompleteness of taxonomies is the incompleteness of the extracted concepts, especially for those whose names have multiple words and consequently low frequency in the corpus. To resolve the limitations of extraction-based methods, we propose GenTaxo to enhance taxonomy completion by identifying positions in existing taxonomies that need new concepts and then generating appropriate concept names. Instead of relying on the corpus for concept embeddings, GenTaxo learns the contextual embeddings from their surrounding graph-based and language-based relational information, and leverages the corpus for pre-training a concept name generator. Experimental results demonstrate that GenTaxo improves the completeness of taxonomies over existing methods.

A Transformer-based Framework for Multivariate Time Series Representation Learning

We present a novel framework for multivariate time series representation learning based on the transformer encoder architecture. The framework includes an unsupervised pre-training scheme, which can offer substantial performance benefits over fully supervised learning on downstream tasks, both with but even without leveraging additional unlabeled data, i.e., by reusing the existing data samples. Evaluating our framework on several public multivariate time series datasets from various domains and with diverse characteristics, we demonstrate that it performs significantly better than the best currently available methods for regression and classification, even for datasets which consist of only a few hundred training samples. Given the pronounced interest in unsupervised learning for nearly all domains in the sciences and in industry, these findings represent an important landmark, presenting the first unsupervised method shown to push the limits of state-of-the-art performance for multivariate time series regression and classification.

Off-Policy Evaluation via Adaptive Weighting with Data from Contextual Bandits

It has become increasingly common for data to be collected adaptively, for example using contextual bandits. Historical data of this type can be used to evaluate other treatment assignment policies to guide future innovation or experiments. However, policy evaluation is challenging if the target policy differs from the one used to collect data, and popular estimators, including doubly robust (DR) estimators, can be plagued by bias, excessive variance, or both. In particular, when the pattern of treatment assignment in the collected data looks little like the pattern generated by the policy to be evaluated, the importance weights used in DR estimators explode, leading to excessive variance.

In this paper, we improve the DR estimator by adaptively weighting observations to control its variance. We show that a t-statistic based on our improved estimator is asymptotically normal under certain conditions, allowing us to form confidence intervals and test hypotheses. Using synthetic data and public benchmarks, we provide empirical evidence for our estimator's improved accuracy and inferential properties relative to existing alternatives.

Efficient Incremental Computation of Aggregations over Sliding Windows

Computing aggregation over sliding windows, i.e., finite subsets of an unbounded stream, is a core operation in streaming analytics. We propose PBA (Parallel Boundary Aggregator), a novel parallel algorithm that groups continuous slices of streaming values into chunks and exploits two buffers, cumulative slice aggregations and left cumulative slice aggregations, to compute sliding window aggregations efficiently. PBA runs in O(1) time, performing at most 3 merging operations per slide while consuming O(n) space for windows with n partial aggregations. Our empirical experiments demonstrate that PBA can improve throughput up to 4X while reducing latency, compared to state-of-the-art algorithms.

Domain-oriented Language Modeling with Adaptive Hybrid Masking and Optimal Transport Alignment

Motivated by the success of pre-trained language models such as BERT in a broad range of natural language processing (NLP) tasks, recent research efforts have been made for adapting these models for different application domains. Along this line, existing domain-oriented models have primarily followed the vanilla BERT architecture and have a straightforward use of the domain corpus. However, domain-oriented tasks usually require accurate understanding of domain phrases, and such fine-grained phrase-level knowledge is hard to be captured by existing pre-training scheme. Also, the word co-occurrences guided semantic learning of pre-training models can be largely augmented by entity-level association knowledge. But meanwhile, there is a risk of introducing noise due to the lack of groundtruth word-level alignment. To address the issues, we provide a generalized domain-oriented approach, which leverages auxiliary domain knowledge to improve the existing pre-training framework from two aspects. First, to preserve phrase knowledge effectively, we build a domain phrase pool as auxiliary knowledge, meanwhile we introduce Adaptive Hybrid Masked Model to incorporate such knowledge. It integrates two learning modes, word learning and phrase learning, and allows them to switch between each other. Second, we introduce Cross Entity Alignment to leverage entity association as weak supervision to augment the semantic learning of pre-trained models. To alleviate the potential noise in this process, we introduce an interpretableOptimal Transport based approach to guide alignment learning. Experiments on four domain-oriented tasks demonstrate the superiority of our framework.

Data Poisoning Attack against Recommender System Using Incomplete and Perturbed Data

Recent studies reveal that recommender systems are vulnerable to data poisoning attack due to their openness nature. In data poisoning attack, the attacker typically recruits a group of controlled users to inject well-crafted user-item interaction data into the recommendation model's training set to modify the model parameters as desired. Thus, existing attack approaches usually require full access to the training data to infer items' characteristics and craft the fake interactions for controlled users. However, such attack approaches may not be feasible in practice due to the attacker's limited data collection capability and the restricted access to the training data, which sometimes are even perturbed by the privacy preserving mechanism of the service providers. Such design-reality gap may cause failure of attacks. In this paper, we fill the gap by proposing two novel adversarial attack approaches to handle the incompleteness and perturbations in user-item interaction data. First, we propose a bi-level optimization framework that incorporates a probabilistic generative model to find the users and items whose interaction data is sufficient and has not been significantly perturbed, and leverage these users and items' data to craft fake user-item interactions. Moreover, we reverse the learning process of recommendation models and develop a simple yet effective approach that can incorporate context-specific heuristic rules to handle data incompleteness and perturbations. Extensive experiments on two datasets against three representative recommendation models show that the proposed approaches can achieve better attack performance than existing approaches.

Data Poisoning Attacks Against Outcome Interpretations of Predictive Models

The past decades have witnessed significant progress towards improving the accuracy of predictions powered by complex machine learning models. Despite much success, the lack of model interpretability prevents the usage of these techniques in life-critical systems such as medical diagnosis and self-driving systems. Recently, the interpretability issue has received much attention, and one critical task is to explain why a predictive model makes a specific decision. We refer to this task as outcome interpretation. Many outcome interpretation methods have been developed to produce human-understandable interpretations by utilizing intermediate results of the machine learning models, such as gradients and model parameters.

Although the effectiveness of outcome interpretation approaches has been shown in a benign environment, their robustness against data poisoning attacks (i.e., attacks at the training phase) has not been studied. As the first work towards this direction, we aim to answer an important question: Can training-phase adversarial samples manipulate the outcome interpretation of target samples? To answer this question, we propose a data poisoning attack framework named IMF (Interpretation Manipulation Framework), which can manipulate the interpretations of target samples produced by representative outcome interpretation methods. Extensive evaluations verify the effectiveness and efficiency of the proposed attack strategies on two real-world datasets.

ELITE: Robust Deep Anomaly Detection with Meta Gradient

Deep Learning techniques have been widely used in detecting anomalies from complex data. Most of these techniques are either unsupervised or semi-supervised because of a lack of a large number of labeled anomalies. However, they typically rely on a clean training data not polluted by anomalies to learn the distribution of the normal data. Otherwise, the learned distribution tends to be distorted and hence ineffective in distinguishing between normal and abnormal data. To solve this problem, we propose a novel approach called ELITE that uses a small number of labeled examples to infer the anomalies hidden in the training samples. It then turns these anomalies into useful signals that help to better detect anomalies from user data. Unlike the classical semi-supervised classification strategy which uses labeled examples as training data, ELITE uses them as validation set. It leverages the gradient of the validation loss to predict if one training sample is abnormal. The intuition is that correctly identifying the hidden anomalies could produce a better deep anomaly model with reduced validation loss. Our experiments on public benchmark datasets show that ELITE achieves up to 30% improvement in ROC AUC comparing to the state-of-the-art, yet robust to polluted training data.

Knowledge-Enhanced Domain Adaptation in Few-Shot Relation Classification

Relation classification (RC) is an important task in knowledge extraction from texts, while data-driven approaches, although achieving high performance, heavily rely on a large amount of annotated training data. Recently, many few-shot RC models have been proposed and yielded promising results in general domain datasets, but when adapting to a specific domain, such as medicine, the performance drops dramatically. In this paper, we propose a Knowledge-Enhanced Few-shot RC model for the Domain Adaptation task (KEFDA), which incorporates general and domain-specific knowledge graphs (KGs) to the RC model to improve its domain adaptability. With the help of concept-level KGs, the model can better understand the semantics of texts and easily summarize the global semantics of relation types from only a few instances. To be more important, as a kind of meta-information, the manner of utilizing KGs can be transferred from existing tasks to new tasks, even across domains. Specifically, we design a knowledge-enhanced prototypical network to conduct instance matching, and a relation-meta learning network for implicit relation matching. The two scoring functions are combined to infer the relation type of a new instance. Experimental results on the Domain Adaptation Challenge in the FewRel 2.0 benchmark demonstrate that our approach significantly outperforms the state-of-the-art models (by 6.63% on average).

Attentive Heterogeneous Graph Embedding for Job Mobility Prediction

Job mobility prediction is an emerging research topic that can benefit both organizations and talents in various ways, such as job recommendation, talent recruitment, and career planning. Nevertheless, most existing studies only focus on modeling the individual-level career trajectories of talents, while the impact of macro-level job transition relationships (e.g., talent flow among companies and job positions) has been largely neglected. To this end, in this paper we propose an enhanced approach to job mobility prediction based on a heterogeneous company-position network constructed from the massive career trajectory data. Specifically, we design an Attentive heterogeneous graph embedding for sequential prediction (Ahead) framework to predict the next career move of talents, which contains two components, namely an attentive heterogeneous graph embedding (AHGN) model and a Dual-GRU model for career path mining. In particular, the AHGN model is used to learn the comprehensive representation for company and position on the heterogeneous network, in which two kinds of aggregators are employed to aggregate the information from external and internal neighbors for a node. Afterwards, a novel type-attention mechanism is designed to automatically fuse the information of the two aggregators for updating node representations. Moreover, the Dual-GRU model is devised to model the parallel sequences that appear in pair, which can be used to capture the sequential interactive information between companies and positions. Finally, we conduct extensive experiments on a real-world dataset for evaluating our Ahead framework. The experimental results clearly validate the effectiveness of our approach compared with the state-of-the-art baselines in terms of job mobility prediction.

Scalable Heterogeneous Graph Neural Networks for Predicting High-potential Early-stage Startups

It is critical and important for venture investors to find high-potential startups at their early stages. Indeed, many efforts have been made to study the key factors for the success of startups through the topological analysis of the heterogeneous information network of people, startup, and venture firms or representation learning of latent startup profile features. However, the existing topological analysis lacks an in-depth understanding of heterogeneous information. Also, the approach based on representation learning heavily relies on domain-specific knowledge for feature selections. Instead, in this paper, we propose aScalable Heterogeneous Graph Markov Neural Network (SHGMNN) for identifying the high-potential startups. The general idea is to use graph neural networks (GNN) to learn effective startup representations through end-to-end efficient training and model the label dependency among startups through Maximum A Posterior (MAP) inference. Specifically, we first define different metapaths to capture various semantics over the heterogeneous information network (HIN) and aggregate all semantic information into a summated graph structure. To predict the high-potential early-stage startups, we introduce GNN to diffuse the information over the summated graph. We then adopt an MAP inference over Hinge-Loss Markov Random Fields to enforce label dependency. Here, a pseudolikelihood variational expectation-maximization (EM) framework is incorporated to optimize both MAP inference and GNN iteratively: The E-step calculates the inference, and the M-step updates the GNN. For efficiency concerns, we develop a GNN with a lightweight linear diffusion architecture to perform graph propagation over web-scale heterogeneous information networks. Finally, extensive experiments and case studies on real-world datasets demonstrate the superiority of SHGMNN.

Balancing Consistency and Disparity in Network Alignment

Network alignment plays an important role in a variety of applications. Many traditional methods explicitly or implicitly assume the alignment consistency which might suffer from over-smoothness, whereas some recent embedding based methods could somewhat embrace the alignment disparity by sampling negative alignment pairs. However, under different or even competing designs of negative sampling distributions, some methods advocate positive correlation which could result in false negative samples incorrectly violating the alignment consistency, whereas others champion negative correlation or uniform distribution to sample nodes which may contribute little to learning meaningful embeddings. In this paper, we demystify the intrinsic relationships behind various network alignment methods and between these competing design principles of sampling. Specifically, in terms of model design, we theoretically reveal the close connections between a special graph convolutional network model and the traditional consistency based alignment method. For model training, we quantify the risk of embedding learning for network alignment with respect to the sampling distributions. Based on these, we propose NeXtAlign which strikes a balance between alignment consistency and disparity. We conduct extensive experiments that demonstrate the proposed method achieves significant improvements over the state-of-the-arts.

Where are we in embedding spaces?

Hyperbolic space and hyperbolic embeddings are becoming a popular research field for recommender systems. However, it is not clear under what circumstances the hyperbolic space should be considered. To fill this gap, This paper provides theoretical analysis and empirical results on when and where to use hyperbolic space and hyperbolic embeddings in recommender systems. Specifically, we answer the questions that which type of models and datasets are more suited for hyperbolic space, as well as which latent size to choose. We evaluate our answers by comparing the performance of Euclidean space and hyperbolic space on different latent space models in both general item recommendation domain and social recommendation domain, with 6 widely used datasets and different latent sizes. Additionally, we propose a new metric learning based recommendation method called SCML and its hyperbolic version HSCML. We evaluate our conclusions regarding hyperbolic space on SCML and show the state-of-the-art performance of hyperbolic space by comparing HSCML with other baseline methods.

ROD: Reception-aware Online Distillation for Sparse Graphs

Graph neural networks (GNNs) have been widely used in many graph-based tasks such as node classification, link prediction, and node clustering. However, GNNs gain their performance benefits mainly from performing the feature propagation and smoothing across the edges of the graph, thus requiring sufficient connectivity and label information for effective propagation. Unfortunately, many real-world networks are sparse in terms of both edges and labels, leading to sub-optimal performance of GNNs. Recent interest in this sparse problem has focused on the self-training approach, which expands supervised signals with pseudo labels. Nevertheless, the self-training approach inherently cannot realize the full potential of refining the learning performance on sparse graphs due to the unsatisfactory quality and quantity of pseudo labels.

In this paper, we propose ROD, a novel reception-aware online knowledge distillation approach for sparse graph learning. We design three supervision signals for ROD: multi-scale reception-aware graph knowledge, task-based supervision, and rich distilled knowledge, allowing online knowledge transfer in a peer-teaching manner. To extract knowledge concealed in the multi-scale reception fields, ROD explicitly requires individual student models to preserve different levels of locality information. For a given task, each student would predict based on its reception-scale knowledge, while simultaneously a strong teacher is established on-the-fly by combining multi-scale knowledge. Our approach has been extensively evaluated on 9 datasets and a variety of graph-based tasks, including node classification, link prediction, and node clustering. The result demonstrates that ROD achieves state-of-art performance and is more robust for the graph sparsity.

Learning Based Proximity Matrix Factorization for Node Embedding

Node embedding learns a low-dimensional representation for each node in the graph. Recent progress on node embedding shows that proximity matrix factorization methods gain superb performance and scale to large graphs with millions of nodes. Existing approaches first define a proximity matrix and then learn the embeddings that fit the proximity by matrix factorization. Most existing matrix factorization methods adopt the same proximity for different tasks, while it is observed that different tasks and datasets may require different proximity, limiting their representation power.

Motivated by this, we propose Lemane, a framework with trainable proximity measures, which can be learned to best suit the datasets and tasks at hand automatically. Our method is end-to-end, which incorporates differentiable SVD in the pipeline so that the parameters can be trained via backpropagation. However, this learning process is still expensive on large graphs. To improve the scalability, we train proximity measures only on carefully subsampled graphs, and then apply standard proximity matrix factorization on the original graph using the learned proximity. Note that, computing the learned proximities for each pair is still expensive for large graphs, and existing techniques for computing proximities are not applicable to the learned proximities. Thus, we present generalized push techniques to make our solution scalable to large graphs with millions of nodes. Extensive experiments show that our proposed solution outperforms existing solutions on both link prediction and node classification tasks on almost all datasets.

Multi-Task Learning via Generalized Tensor Trace Norm

The trace norm is widely used in multi-task learning as it can discover low-rank structures among tasks in terms of model parameters. Nowadays, with the emerging of big complex datasets and the popularity of deep learning techniques, tensor trace norms have been used for deep multi-task models. However, existing tensor trace norms cannot discover all the low-rank structures and they require users to determine the importance of their components manually. To solve those two issues, in this paper, we propose a Generalized Tensor Trace Norm (GTTN). The GTTN is defined as a convex combination of matrix trace norms of all possible tensor flattenings and hence it can discover all the possible low-rank structures. Based on the induced objective function with the GTTN, we can learn combination coefficients in the GTTN with several strategies. Experiments on real-world datasets demonstrate the effectiveness of the proposed GTTN.

Initialization Matters: Regularizing Manifold-informed Initialization for Neural Recommendation Systems

Proper initialization is crucial to the optimization and the generalization of neural networks. However, most existing neural recommendation systems initialize the user and item embeddings randomly. In this work, we propose a new initialization scheme for user and item embeddings called Laplacian Eigenmaps with Popularity-based Regularization for Isolated Data (LEPORID). LEPORID endows the embeddings with information regarding multi-scale neighborhood structures on the data manifold and performs adaptive regularization to compensate for high embedding variance on the tail of the data distribution. Exploiting matrix sparsity, LEPORID embeddings can be computed efficiently. We evaluate LEPORID in a wide range of neural recommendation models. In contrast to the recent surprising finding that the simple K-nearest-neighbor (KNN) method often outperforms neural recommendation systems, we show that existing neural systems initialized with LEPORID often perform on par or better than KNN. To maximize the effects of the initialization, we propose the Dual-Loss Residual Recommendation (DLR^2) network, which, when initialized with LEPORID, substantially outperforms both traditional and state-of-the-art neural recommender systems.

H2MN: Graph Similarity Learning with Hierarchical Hypergraph Matching Networks

Graph similarity learning, which measures the similarities between a pair of graph-structured objects, lies at the core of various machine learning tasks such as graph classification, similarity search, etc. In this paper, we devise a novel graph neural network based framework to address this challenging problem, motivated by its great success in graph representation learning. As the vast majority of existing graph neural network models mainly concentrate on learning effective node or graph level representations of a single graph, little effort has been made to jointly reason over a pair of graph-structured inputs for graph similarity learning. To this end, we propose Hierarchical Hypergraph Matching Networks (H2sup>MN) to calculate the similarities between graph pairs with arbitrary structure. Specifically, our proposed H2MN learns graph representation from the perspective of hypergraph, and takes each hyperedge as a subgraph to perform subgraph matching, which could capture the rich substructure similarities across the graph. To enable hierarchical graph representation and fast similarity computation, we further propose a hyperedge pooling operator to transform each graph into a coarse graph of reduced size. Then, a multi-perspective cross-graph matching layer is employed on the coarsened graph pairs to extract the inter-graph similarity. Comprehensive experiments on five public datasets empirically demonstrate that our proposed model can outperform state-of-the-art baselines with different gains for graph-graph classification and regression tasks.

DHS: Adaptive Memory Layout Organization of Sketch Slots for Fast and Accurate Data Stream Processing

Data stream processing is a crucial computation task in data mining applications. The rigid and fixed data structures in existing solutions limit their accuracy, throughput, and generality in measurement tasks. We propose Dynamic Hierarchical Sketch (DHS), a sketch-based hybrid solution targeting these properties. During the online stream processing, DHS hashes items to buckets and organizes cells in each bucket dynamically; the size of all cells in a bucket is adjusted adaptively to the actual size and distribution of flows. Thus, memory is efficiently used to precisely record elephant flows and cover more mice flows. Implementation and evaluation show that DHS achieves high accuracy, high throughput, and high generality on five measurement tasks: flow size estimation, flow size distribution estimation, heavy hitter detection, heavy changer detection, and entropy estimation.

Fairness-Aware Online Meta-learning

In contrast to offline working fashions, two research paradigms are devised for online learning: (1) Online Meta-Learning (OML)[6, 20, 26] learns good priors over model parameters (or learning to learn) in a sequential setting where tasks are revealed one after another. Although it provides a sub-linear regret bound, such techniques completely ignore the importance of learning with fairness which is a significant hallmark of human intelligence. (2) Online Fairness-Aware Learning [1, 8, 21]. This setting captures many classification problems for which fairness is a concern. But it aims to attain zero-shot generalization without any task-specific adaptation. This therefore limits the capability of a model to adapt onto newly arrived data. To overcome such issues and bridge the gap, in this paper for the first time we proposed a novel online meta-learning algorithm, namely FFML, which is under the setting of unfairness prevention. The key part of FFML is to learn good priors of an online fair classification model's primal and dual parameters that are associated with the model's accuracy and fairness, respectively. The problem is formulated in the form of a bi-level convex-concave optimization. The theoretic analysis provides sub-linear upper bounds O(log T)for loss regret and O(√log T)violation of cumulative fairness constraints. Our experiments demonstrate the versatility of FFML by applying it to classification on three real-world datasets and show substantial improvements over the best prior work on the tradeoff between fairness and classification accuracy.

Temporal Biased Streaming Submodular Optimization

Submodular optimization lies at the core of many data mining and machine learning applications such as data summarization and subset selection. For data streams where elements arrive one at a time, streaming submodular optimization (SSO) algorithms are desired. Existing SSO solutions are mainly designed for insertion-only streams where elements in the stream all participate in the analysis, or sliding-window streams where only the most recent data participates in the analysis. SSO for insertion-only streams does not sufficiently emphasize recent data. SSO for sliding-window streams abruptly forgets all past data. In this work, we propose a new SSO problem, i.e., temporal biased streaming submodular optimization (TBSSO), which embraces the special settings of all previous studies. TBSSO leverages a temporal bias function to force each element in the stream to participate in the analysis with a probability decreasing over time and hence elements in the stream are forgotten gradually. We design novel streaming algorithms to solve the TBSSO problem with provable approximation guarantees. Experiments show that our algorithm can find high quality solutions and improve the efficiency to about one order of magnitude faster than the baseline method.

Cluster-Reduce: Compressing Sketches for Distributed Data Streams

Sketches, a type of probabilistic algorithms, have been widely accepted as the approximate summary of data streams. Compressing sketches is the best choice in distributed data streams to reduce communication overhead. The ideal compression algorithm should meet the following three requirements: high efficiency of compression procedure, support of direct query without decompression, and high accuracy of compressed sketches. However, no prior work can meet these requirements at the same time. Especially, the accuracy is poor after compression using existing methods. In this paper, we propose Cluster-Reduce, a framework for compressing sketches, which can meet all three requirements. Our key technique nearness clustering rearranges the adjacent counters with similar values in the sketch to significantly improve the accuracy. We use Cluster-Reduce to compress four kinds of sketches in two use-cases: distributed data streams and distributed machine learning. Extensive experimental results show that Cluster-Reduce can achieve up to 60 times smaller error than prior works. The source codes of Cluster-Reduce are available at Github anonymously[1].

Multi-graph Multi-label Learning with Dual-granularity Labeling

Graphs are a powerful and versatile data structure that easily captures real life relationship. Multi-graph Multi-label learning (MGML) is a supervised learning task, which aims to learn a Multi-label classifier to label a set of objects of interest (e.g. image or text) with a bag-of-graphs representation. However, prior techniques on the MGML are developed based on transferring graphs into instances that does not fully utilize the structure information in the learning, and focus on learning the unseen labels only at the bag level. There is no existing work studying how to label the graphs within a bag that is of importance in many applications like image or text annotation. To bridge this gap, in this paper, we present a novel coarse and fine-grained Multi-graph Multi-label (cfMGML) learning framework which directly builds the learning model over the graphs and empowers the label prediction at both the coarse (aka. bag) level and fine-grained (aka. graph in each bag) level. In particular, given a set of labeled multi-graph bags, we design the scoring functions at both graph and bag levels to model the relevance between the label and data using specific graph kernels. Meanwhile, we propose a thresholding rank-loss objective function to rank the labels for the graphs and bags and minimize the hamming-loss simultaneously at one-step, which aims to address the error accumulation issue in traditional rank-loss algorithms. To tackle the non-convex optimization problem, we further develop an effective sub-gradient descent algorithm to handle high-dimensional space computation required in cfMGML. Experiments over various real-world datasets demonstrate cfMGML achieves superior performance than the state-of-arts algorithms.

Multi-view Denoising Graph Auto-Encoders on Heterogeneous Information Networks for Cold-start Recommendation

Cold-start recommendation is a challenging problem due to the lack of user-item interactions. Recently, heterogeneous information network~(HIN)-based recommendation methods use rich auxiliary information to enhance users and items' connections, helping alleviate the cold-start problem. Despite progress, most existing methods model HINs under traditional supervised learning settings, ignoring the gaps between training and inference procedures in cold-start scenarios. In this paper, we regard cold-start recommendation as a missing data problem where some user-item interaction data are missing. Inspired by denoising auto-encoders that train a model to reconstruct the input from its corrupted version, we propose a novel model called Multi-view Denoising Graph Auto-Encoders~(MvDGAE) on HINS. Specifically, we first extract multifaceted meaningful semantics on HINs as multi-views for both users and items, effectively enhancing user/item relationships on different aspects. Then we conduct the training procedure by randomly dropping out some user-item interactions in the encoder while forcing the decoder to use these limited views to recover the full views, including the missing ones. In this way, the complementary representations for both users and items are more informative and robust to adjust to cold-start scenarios. Moreover, the decoder's reconstruction goals are multi-view user-user and item-item relationship graphs rather than the original input graphs, which make the features of similar users (or items) in the meta-paths closer together. Finally, we adopt a Bayesian task weight learner to balance multi-view graph reconstruction objectives automatically. Extensive experiments on both public benchmark datasets and a large-scale industry dataset WeChat Channel demonstrate that MvDGAE significantly outperforms the state-of-the-art recommendation models in various cold-start scenarios. The case studies also illustrate that MvDGAE has potentially good interpretability.

Accelerating Set Intersections over Graphs by Reducing-Merging

Given two sets of vertices Sa and Sb of a graph, computing their common vertices, namely set intersection, is one primitive operation in many graph algorithms such as triangle counting, maximal clique enumeration, and subgraph matching. Thus, accelerating set intersections is beneficial to these algorithms. In the paper, we propose a novel reducing-merging framework for set intersections over graphs rather than intersecting the two sets directly. In the reducing phase, the vertices that cannot fall into the intersection are screened out by applying the range reduction. Based on the truncated subsets, the intersection can be easily obtained using the classic merging algorithm. To optimize the range codes that sketch the vertices, we formulate the problem of range code optimization and prove its NP-hardness. We develop efficient yet effective algorithms for two typical scenarios global intersection and local intersection. Moreover, we present a novel two-level merging algorithm to enhance the performance. The results of extensive experiments over real graphs show that our approach can achieve significant speedups compared to the merge-based algorithm.

Knowledge is Power: Hierarchical-Knowledge Embedded Meta-Learning for Visual Reasoning in Artistic Domains

This paper deals with the challenging problem of building visual reasoning models for answering questions related to artworks in artistic domains. The nature of abstract styles and cultural contexts within an artistic image makes the corresponding learning tasks extremely difficult. We propose a novel framework termed as Hierarchical-Knowledge Embedded Meta-Learning to address the critical issues of visual reasoning in artistic domains. In particular, we firstly present a deep relational model to capture and memorize the relations among different samples. Then, we provide the hierarchical-knowledge embedding that mines the implicit relationship between question-answer pairs for knowledge representation as the guidance of our meta-learner. This is a case of "knowledge is power" in the sense that the hierarchical knowledge representation is incorporated into our meta-learning based model. The final classification is derived from our model by learning to compare the features of samples. Experimental results show that our approach achieves significantly higher performance compared with other state-of-the-arts.

Quantifying Assimilate-Contrast Effects in Online Rating Systems: Modeling, Analysis and Application

Online rating system serves as an indispensable building block for many web applications such as Amazon, TripAdvior and Yelp. It enables production quality estimation via aggregate ratings (a.k.a. wisdom of the crowd) as well as product recommendation via inferring user preference from ratings, etc. Previous studies showed that due to assimilate-contrast effects, historical ratings can significantly distort user's ratings, leading to low accuracy of product quality estimation and recommendation. To understand assimilate-contrast effects, an "accurate'' model is still missing as previous models do not capture important factors like rating recency, selection bias, etc. Furthermore, an analytical framework to characterize product estimation accuracy under assimilate-contrast effects is also missing. This paper aims to fill in this gap. We propose a mathematical model to quantify the aforementioned important factors on assimilate-contrast effects. Our model attains a good balance between model complexity and model accuracy, such that it is neat enough for us to develop an analytical framework to study assimilate-contrast effects. Based on our model, we derive sufficient conditions, under which the product estimation and collective opinion converges to the "ground-truth''. These conditions reveal important insights on how the aforementioned factors influence the convergence and guide the online rating system operator to design appropriate rating aggregation rules and rating displaying strategies. To demonstrate the versatility of our model, we apply to rating prediction tasks and product recommendation tasks. Experiment results on four public datasets show that our model can improve the rating prediction and and recommendation accuracy over previous models significantly.

Triplet Attention: Rethinking the Similarity in Transformers

The Transformer model has benefited various real-world applications, where the self-attention mechanism with dot-products shows superior alignment ability on building long dependency. However, the pair-wisely attended self-attention limits further performance improvement on challenging tasks. To the extent of our knowledge, this is the first work to define the Triplet Attention (A3) for Transformer, which introduces triplet connections as the complementary dependency. Specifically, we define the triplet attention based on the scalar triplet product, which may be interchangeably used with the canonical one within the multi-head attention. It allows the self-attention mechanism to attend to diverse triplets and capture complex dependency. Then, we utilize the permuted formulation and kernel tricks to establish a linear approximation to A3. The proposed architecture could be smoothly integrated into the pre-training by modifying head configurations. Extensive experiments show that our methods achieve significant performance improvement on various tasks and two benchmarks.

Table2Charts: Recommending Charts by Learning Shared Table Representations

It is common for people to create different types of charts to explore a multi-dimensional dataset (table). However, to recommend commonly composed charts in real world, one should take the challenges of efficiency, imbalanced data and table context into consideration. In this paper, we propose Table2Charts framework which learns common patterns from a large corpus of (table, charts) pairs. Based on deep Q-learning with copying mechanism and heuristic searching, Table2Charts does table-to-sequence generation, where each sequence follows a chart template. On a large spreadsheet corpus with 165k tables and 266k charts, we show that Table2Charts could learn a shared representation of table fields so that recommendation tasks on different chart types could mutually enhance each other. Table2Charts outperforms other chart recommendation systems in both multi-type task (with doubled recall numbers R@3=0.61 and R@1=0.43) and human evaluations.

Maximizing Influence of Leaders in Social Networks

The operation of adding edges has been frequently used to the study of opinion dynamics in social networks for various purposes. In this paper, we consider the edge addition problem for the DeGroot model of opinion dynamics in a social network with n nodes and m edges, in the presence of a small number s << n of competing leaders with binary opposing opinions 0 or 1. Concretely, we pose and investigate the problem of maximizing the equilibrium overall opinion by creating k new edges in a candidate edge set, where each edge is incident to a 1-valued leader and a follower node. We show that the objective function is monotone and submodular. We then propose a simple greedy algorithm with an approximation factor (1 - 1 over e) that approximately solves the problem in O(n3) time. Moreover, we provide a fast algorithm with a (1 - 1 over e -∈) approximation ratio and Õ(mke∈-2) time complexity for any ∈ > 0, where Õ (⋅) notation suppresses the poly (log n) factors. Extensive experiments demonstrate that our second approximate algorithm is efficient and effective, which scales to large networks with more than a million nodes.

PURE: Positive-Unlabeled Recommendation with Generative Adversarial Network

Recommender systems are powerful tools for information filtering with the ever-growing amount of online data. Despite its success and wide adoption in various web applications and personalized products, many existing recommender systems still suffer from multiple drawbacks such as large amount of unobserved feedback, poor model convergence, etc. These drawbacks of existing work are mainly due to the following two reasons: first, the widely used negative sampling strategy, which treats the unlabeled entries as negative samples, is invalid in real-world settings; second, all training samples are retrieved from the discrete observations, and the underlying true distribution of the users and items is not learned.

In this paper, we address these issues by developing a novel framework named PURE, which trains an unbiased positive-unlabeled discriminator to distinguish the true relevant user-item pairs against the ones that are non-relevant, and a generator that learns the underlying user-item continuous distribution. For a comprehensive comparison, we considered 14 popular baselines from 5 different categories of recommendation approaches. Extensive experiments on two public real-world data sets demonstrate that PURE achieves the best performance in terms of 8 ranking based evaluation metrics.

Modeling Context-aware Features for Cognitive Diagnosis in Student Learning

The contexts and cultures have a direct impact on student learning by affecting student's implicit cognitive states, such as the preference and the proficiency on specific knowledge. Motivated by the success of context-aware modeling in various fields, such as recommender systems, in this paper, we propose to study how to model context-aware features and adapt them for more precisely diagnosing student's knowledge proficiency. Specifically, by analyzing the characteristics of educational contexts, we design a two-stage framework ECD (Educational context-aware Cognitive Diagnosis), where a hierarchical attentive network is first proposed to represent the context impact on students and then an adaptive optimization is used to achieve diagnosis enhancement by aggregating the cognitive states reflected from both educational contexts and students' historical learning records. Moreover, we give three implementations of general ECD framework following the typical cognitive diagnosis solutions. Finally, we conduct extensive experiments on nearly 52 million records of the students sampled by PISA (Programme for International Student Assessment) from 73 countries and regions. The experimental results not only prove that ECD is more effective in student performance prediction since it can well capture the impact from educational contexts to students' cognitive states, but also give some interesting discoveries regarding the difference among different educational contexts in different countries and regions.

S-LIME: Stabilized-LIME for Model Explanation

An increasing number of machine learning models have been deployed in domains with high stakes such as finance and healthcare. Despite their superior performances, many models are black boxes in nature which are hard to explain. There are growing efforts for researchers to develop methods to interpret these black-box models. Post hoc explanations based on perturbations, such as LIME [39], are widely used approaches to interpret a machine learning model after it has been built. This class of methods has been shown to exhibit large instability, posing serious challenges to the effectiveness of the method itself and harming user trust. In this paper, we propose S-LIME, which utilizes a hypothesis testing framework based on central limit theorem for determining the number of perturbation points needed to guarantee stability of the resulting explanation. Experiments on both simulated and real world data sets are provided to demonstrate the effectiveness of our method.

Popularity Bias in Dynamic Recommendation

Popularity bias is a long-standing challenge in recommender systems: popular items are overly recommended at the expense of less popular items that users may be interested in being under-recommended. Such a bias exerts detrimental impact on both users and item providers, and many efforts have been dedicated to studying and solving such a bias. However, most existing works situate the popularity bias in a static setting, where the bias is analyzed only for a single round of recommendation with logged data. These works fail to take account of the dynamic nature of real-world recommendation process, leaving several important research questions unanswered: how does the popularity bias evolve in a dynamic scenario? what are the impacts of unique factors in a dynamic recommendation process on the bias? and how to debias in this long-term dynamic process? In this work, we investigate the popularity bias in dynamic recommendation and aim to tackle these research gaps. Concretely, we conduct an empirical study by simulation experiments to analyze popularity bias in the dynamic scenario and propose a dynamic debiasing strategy and a novel False Positive Correction method utilizing false positive signals to debias, which show effective performance in extensive experiments.

Controllable Generation from Pre-trained Language Models via Inverse Prompting

Large-scale pre-trained language models have demonstrated strong capabilities of generating realistic texts. However, it remains challenging to control the generation results. Previous approaches such as prompting are far from sufficient, and lack of controllability limits the usage of language models. To tackle this challenge, we propose an innovative method, inverse prompting, to better control text generation. The core idea of inverse prompting is to use generated text to inversely predict the prompt during beam search, which enhances the relevance between the prompt and the generated text and thus improves controllability. Empirically, we pre-train a large-scale Chinese language model to perform a systematic study using human evaluation on the tasks of open-domain poem generation and open-domain long-form question answering. Results demonstrate that our proposed method substantially outperforms the baselines and that our generation quality is close to human performance on some of the tasks.

TDGIA: Effective Injection Attacks on Graph Neural Networks

Graph Neural Networks (GNNs) have achieved promising performance in various real-world applications. However, recent studies have shown that GNNs are vulnerable to adversarial attacks. In this paper, we study a recently-introduced realistic attack scenario on graphs---graph injection attack (GIA). In the GIA scenario, the adversary is not able to modify the existing link structure and node attributes of the input graph, instead the attack is performed by injecting adversarial nodes into it. We present an analysis on the topological vulnerability of GNNs under GIA setting, based on which we propose the Topological Defective Graph Injection Attack (TDGIA) for effective injection attacks. TDGIA first introduces the topological defective edge selection strategy to choose the original nodes for connecting with the injected ones. It then designs the smooth feature optimization objective to generate the features for the injected nodes. Extensive experiments on large-scale datasets show that TDGIA can consistently and significantly outperform various attack baselines in attacking dozens of defense GNN models. Notably, the performance drop on target GNNs resultant from TDGIA is more than double the damage brought by the best attack solution among hundreds of submissions on KDD-CUP 2020.

SESSION: ADS Track Papers

Practical Approach to Asynchronous Multivariate Time Series Anomaly Detection and Localization

Engineers at eBay utilize robust methods in monitoring IT system signals for anomalies. However, the growing scale of signals, both in volumes and dimensions, overpowers traditional statistical state-space or supervised learning tools. Thus, state-of-the-art methods based on unsupervised deep learning are sought in recent research. However, we experienced flaws when implementing those methods, such as requiring partial supervision and weaknesses to high dimensional datasets, among other reasons discussed in this paper. We propose a practical approach for inferring anomalies from large multivariate sets. We observe an abundance of time series in real-world applications, which exhibit asynchronous and consistent repetitive variations, such as IT, weather, utility, and transportation. Our solution is designed to leverage this behavior. The solution utilizes spectral analysis on the latent representation of a pre-trained autoencoder to extract dominant frequencies across the signals, which are then used in a subsequent network that learns the phase shifts across the signals and produces a synchronized representation of the raw multivariate. Random subsets of the synchronous multivariate are then fed into an array of autoencoders learning to minimize the quantile reconstruction losses, which are then used to infer and localize anomalies based on a majority vote. We benchmark this method against state-of-the-art approaches on public datasets and eBay's data using their referenced evaluation methods. Furthermore, we address the limitations of the referenced evaluation methods and propose a more realistic evaluation method.

Counterfactual Graphs for Explainable Classification of Brain Networks

Training graph classifiers able to distinguish between healthy brains and dysfunctional ones, can help identifying substructures associated to specific cognitive phenotypes. However, the mere predictive power of the graph classifier is of limited interest to the neuroscientists, which have plenty of tools for the diagnosis of specific mental disorders. What matters is the interpretation of the model, as it can provide novel insights and new hypotheses. In this paper we propose counterfactual graphs as a way to produce local post-hoc explanations of any black-box graph classifier. Given a graph and a black-box, a counterfactual is a graph which, while having high structural similarity with the original graph, is classified by the black-box in a different class. We propose and empirically compare several strategies for counterfactual graph search. Our experiments against a white-box classifier with known optimal counterfactual, show that our methods, although heuristic, can produce counterfactuals very close to the optimal one. Finally, we show how to use counterfactual graphs to build global explanations correctly capturing the behaviour of different black-box classifiers and providing interesting insights for the neuroscientists.

All Models Are Useful: Bayesian Ensembling for Robust High Resolution COVID-19 Forecasting

Timely, high-resolution forecasts of infectious disease incidence are useful for policy makers in deciding intervention measures and estimating healthcare resource burden. In this paper, we consider the task of forecasting COVID-19 confirmed cases at the county level for the United States. Although multiple methods have been explored for this task, their performance has varied across space and time due to noisy data and the inherent dynamic nature of the pandemic. We present a forecasting pipeline which incorporates probabilistic forecasts from multiple statistical, machine learning and mechanistic methods through a Bayesian ensembling scheme, and has been operational for nearly 6 months serving local, state and federal policymakers in the United States. While showing that the Bayesian ensemble is at least as good as the individual methods, we also show that each individual method contributes significantly for different spatial regions and time points. We compare our model's performance with other similar models being integrated into CDC-initiated COVID-19 Forecast Hub, and show better performance at longer forecast horizons. Finally, we also describe how such forecasts are used to increase lead time for training mechanistic scenario projections. Our work demonstrates that such a real-time high resolution forecasting pipeline can be developed by integrating multiple methods within a performance-based ensemble to support pandemic response.

Dynamic Language Models for Continuously Evolving Content

The content on the web is in a constant state of flux. New entities,issues, and ideas continuously emerge, while the semantics of the existing conversation topics gradually shift. In recent years, pre-trained language models like BERT greatly improved the state-of-the-art for a large spectrum of content understanding tasks.Therefore, in this paper, we aim to study how these language models can be adapted to better handle continuously evolving web content.In our study, we first analyze the evolution of 2013 - 2019 Twitter data, and unequivocally confirm that a BERT model trained on past tweets would heavily deteriorate when directly applied to data from later years. Then, we investigate two possible sources of the deterioration: the semantic shift of existing tokens and the sub-optimal or failed understanding of new tokens. To this end, we both explore two different vocabulary composition methods, as well as propose three sampling methods which help in efficient incremental training for BERT-like models. Compared to a new model trained from scratch offline, our incremental training (a) reduces the training costs, (b) achieves better performance on evolving content, and (c)is suitable for online deployment. The superiority of our methods is validated using two downstream tasks. We demonstrate significant improvements when incrementally evolving the model from a particular base year, on the task of Country Hashtag Prediction, as well as on the OffensEval 2019 task.

Quantifying and Addressing Ranking Disparity in Human-Powered Data Acquisition

Algorithmic bias has been identified as a key challenge in many AI applications. One major source of bias is the data used to build these applications. For instance, many AI applications rely on human users to generate training data. The generated data might be biased if the data acquisition process is skewed towards certain groups of people based on say gender, ethnicity or location. This typically happens as a result of a hidden association between the people's qualifications for data acquisition and the people's protected attributes. In this paper, we study how to unveil and address disparity in data acquisition. We focus on the case where the data acquisition process involves ranking of people and we define disparity as the unbalanced targeting of people by the data acquisition process. To quantify disparity, we formulate an optimization problem that partitions people on their protected attributes, computes the qualifications of people in each partition, and finds the partitioning that exhibits the highest disparity in qualifications. Due to the combinatorial nature of our problem, we devise heuristics to navigate the space of partitions. We also discuss how to address disparity between partitions. We conduct a series of experiments on real and simulated datasets that demonstrate that our proposed approach is successful in quantifying and addressing ranking disparity in human-powered data acquisition.

On Training Sample Memorization: Lessons from Benchmarking Generative Modeling with a Large-scale Competition

Many recent developments on generative models for natural images have relied on heuristically-motivated metrics that can be easily gamed by memorizing a small sample from the true distribution or training a model directly to improve the metric. In this work, we critically evaluate the gameability of these metrics by designing and deploying a generative modeling competition. Our competition received over 11000 submitted models. The competitiveness between participants allowed us to investigate both intentional and unintentional memorization in generative modeling. To detect intentional memorization, we propose the "Memorization-Informed Frechet Inception Distance" (MiFID) as a new memorization-aware metric and design benchmark procedures to ensure that winning submissions made genuine improvements in perceptual quality. Furthermore, we manually inspect the code for the 1000 top-performing models to understand and label different forms of memorization. Our analysis reveals that unintentional memorization is a serious and common issue in popular generative models. The generated images and our memorization labels of those models as well as code to compute MiFID are released to facilitate future studies on benchmarking generative models.

Auto-Split: A General Framework of Collaborative Edge-Cloud AI

In many industry scale applications, large and resource consuming machine learning models reside in powerful cloud servers. At the same time, large amounts of input data are collected at the edge of cloud. The inference results are also communicated to users or passed to downstream tasks at the edge. The edge often consists of a large number of low-power devices. It is a big challenge to design industry products to support sophisticated deep model deployment and conduct model inference in an efficient manner so that the model accuracy remains high and the end-to-end latency is kept low. This paper describes the techniques and engineering practice behind Auto-Split, an edge-cloud collaborative prototype of Huawei Cloud. This patented technology is already validated on selected applications, is on its way for broader systematic edge-cloud application integration, and is being made available for public use as an automated pipeline service for end-to-end cloud-edge collaborative intelligence deployment. To the best of our knowledge, there is no existing industry product that provides the capability of Deep Neural Network (DNN) splitting.

Unpaired Generative Molecule-to-Molecule Translation for Lead Optimization

Molecular lead optimization is an important task of drug discovery focusing on generating novel molecules similar to a drug candidate but with enhanced properties. Prior works focused on supervised models requiring datasets of pairs of a molecule and an enhanced molecule. These approaches require large amounts of data and are limited by the bias of the specific examples of enhanced molecules. In this work, we present an unsupervised generative approach with a molecule-embedding component that maps a discrete representation of a molecule to a continuous space. The components are then coupled with a unique training architecture leveraging molecule fingerprints and applying double cycle constraints to enable both chemical resemblance to the original molecular lead while generating novel molecules with enhanced properties. We evaluate our method on multiple common molecular optimization tasks, including dopamine receptor (DRD2) and drug likeness (QED), and show our method outperforms previous state-of-the-art baselines. Moreover, we conduct thorough ablation experiments to show the effect and necessity of important components in our model. Furthermore, we demonstrate our method's ability to generate FDA-approved drugs it has never encountered before, such as Perazine and Clozapine, which are used to treat psychotic disorders, like Schizophrenia. The system is currently being deployed for use in the Targeted Drug Delivery and Personalized Medicine laboratories generating treatments using nanoparticle-based technology.

TimeSHAP: Explaining Recurrent Models through Sequence Perturbations

Although recurrent neural networks (RNNs) are state-of-the-art in numerous sequential decision-making tasks, there has been little research on explaining their predictions. In this work, we present TimeSHAP, a model-agnostic recurrent explainer that builds upon KernelSHAP and extends it to the sequential domain. TimeSHAP computes feature-, timestep-, and cell-level attributions. As sequences may be arbitrarily long, we further propose a pruning method that is shown to dramatically decrease both its computational cost and the variance of its attributions. We use TimeSHAP to explain the predictions of a real-world bank account takeover fraud detection RNN model, and draw key insights from its explanations: i) the model identifies important features and events aligned with what fraud analysts consider cues for account takeover; ii) positive predicted sequences can be pruned to only 10% of the original length, as older events have residual attribution values; iii) the most recent input event of positive predictions only contributes on average to 41% of the model's score; iv) notably high attribution to client's age, upheld on higher false positive rates for older clients.

A Framework for Modeling Cyber Attack Techniques from Security Vulnerability Descriptions

Attack graphs are one of the main techniques used to automate the cybersecurity risk assessment process. In order to derive a relevant attack graph, up-to-date information on known cyber attack techniques should be represented as interaction rules. However, designing and creating new interaction rules is a time consuming task performed manually by security experts. We present a novel, end-to-end, automated framework for modeling new attack techniques from the textual description of security vulnerabilities. Given a description of a security vulnerability, the proposed framework first extracts the relevant attack entities required to model the attack, completes missing information on the vulnerability, and derives a new interaction rule that models the attack; this new rule is then integrated within the MulVal attack graph tool. The proposed framework implements a novel data science pipeline that includes a dedicated cybersecurity linguistic model trained on the NVD repository, a recurrent neural network model used for attack entity extraction, a logistic regression model used for completing the missing information, and a transition probability matrix for automatically generating new interaction rule. We evaluated the performance of each of the individual algorithms, as well as the complete framework, and demonstrated its effectiveness.

VisRel: Media Search at Scale

In this paper, we present VisRel, a deployed large-scale media search system that leverages text understanding, media understanding, and multimodal technologies to deliver a modern multimedia search experience. We share our insight on developing image and video understanding models for content retrieval, training efficient and effective media-to-query relevance models, and refining online and offline metrics to measure the success of one of the largest media search databases in the industry. We summarize our learnings gathered from hundreds of A/B test experiments and describe the most effective technical approaches. The techniques presented in this work have contributed 34% (abs.) improvement to media-to-query relevance and 10% improvement to user engagement. We believe that this work can provide practical solutions and insights for engineers who are interested in applying media understanding technologies to empower multimedia search systems that operate at Facebook scale.

GEM: Translation-Free Zero-Shot Global Entity Matcher for Global Catalogs

We propose a modular BiLSTM / CNN / Transformer deep-learning encoder architecture, together with a data synthesis and training approach, to solve the problem of matching catalog products across different languages, different local catalogs, and different catalog data contributors. The end-to-end model relies solely on raw natural language textual data in the catalog entries and on images of the products, without any feature engineering, and is entirely translation-free, not requiring the translation of the catalog natural-language data to the same base language for inference. We report experiments results on a 4-languages-scope model (English, French, German, Spanish) matching entities from 4 local catalogs (UK, France, Germany, Spain) of a retail website. We demonstrate that the model achieves performance comparable to state-of-the-art existing entity matchers that operate within a single language, and that the model achieves high-performance zero-shot inference on language pairs not seen in training.

A Semi-Personalized System for User Cold Start Recommendation on Music Streaming Apps

Music streaming services heavily rely on recommender systems to improve their users' experience, by helping them navigate through a large musical catalog and discover new songs, albums or artists. However, recommending relevant and personalized content to new users, with few to no interactions with the catalog, is challenging. This is commonly referred to as the user cold start problem. In this applied paper, we present the system recently deployed on the music streaming service Deezer to address this problem. The solution leverages a semi-personalized recommendation strategy, based on a deep neural network architecture and on a clustering of users from heterogeneous sources of information. We extensively show the practical impact of this system and its effectiveness at predicting the future musical preferences of cold start users on Deezer, through both offline and online large-scale experiments. Besides, we publicly release our code as well as anonymized usage data from our experiments. We hope that this release of industrial resources will benefit future research on user cold start recommendation.

Generating Mobility Trajectories with Retained Data Utility

This paper presents TrajGen, an approach to generate artificial datasets of mobility trajectories based on an original trajectory dataset while retaining the utility of the original data in supporting various mobility applications. The generated mobility data is disentangled with the original data and can be shared without compromising the data privacy. TrajGen leverages Generative Adversarial Nets combined with a Seq2Seq model to generate the spatial-temporal trajectory data. TrajGen is implemented and evaluated with real-world taxi trajectory data in Singapore. The extensive experimental results demonstrate that TrajGen is able to generate artificial trajectory data that retain key statistical characteristics of the original data. Two case studies, i.e. road map updating and Origin-Destination demand estimation are performed with the generated artificial data, and the results show that the artificial trajectories generated by TrajGen retain the utility of original data in supporting the two applications.

Interactive Audience Expansion On Large Scale Online Visitor Data

Online marketing platforms often store millions of website visitors' behavior as a large sparse matrix with rows as visitors and columns as behavior. These platforms allow marketers to conduct Audience Expansion, a technique to identify new audiences with similar behavior to the original target audiences. In this paper, we propose a method to achieve interactive Audience Expansion from millions of visitor data efficiently. Unlike other methods that undergo significant computations upon inputs, our approach provides interactive responses when a marketer inputs the target audiences and similarity measures. The idea is to apply data summarization technique on the large visitor matrix to obtain a small set of summaries representing the similarities in the matrix. We propose efficient algorithms to compute the data summaries on a distributed computing environment (i.e., Spark) and conduct the expansion using the summaries. Our experiment shows that our approach (1) provides 10 times more accurate and 27 times faster Audience Expansion results on real datasets and (2) achieves a 98% speed-up compared to straightforward data summarization implementations. We also present an interface to apply the algorithm for real-world scenarios.

Supporting COVID-19 Policy Response with Large-scale Mobility-based Modeling

Mobility restrictions have been a primary intervention for controlling the spread of COVID-19, but they also place a significant economic burden on individuals and businesses. To balance these competing demands, policymakers need analytical tools to assess the costs and benefits of different mobility reduction measures. In this paper, we present our work motivated by our interactions with the Virginia Department of Health on a decision-support tool that utilizes large-scale data and epidemiological modeling to quantify the impact of changes in mobility on infection rates. Our model captures the spread of COVID-19 by using a fine-grained, dynamic mobility network that encodes the hourly movements of people from neighborhoods to individual places, with over 3 billion hourly edges. By perturbing the mobility network, we can simulate a wide variety of reopening plans and forecast their impact in terms of new infections and the loss in visits per sector. To deploy this model in practice, we built a robust computational infrastructure to support running millions of model realizations, and we worked with policymakers to develop an interactive dashboard that communicates our model's predictions for thousands of potential policies.

Extreme Multi-label Learning for Semantic Matching in Product Search

We consider the problem of semantic matching in product search: given a customer query, retrieve all semantically related products from a huge catalog of size 100 million, or more. Because of large catalog spaces and real-time latency constraints, semantic matching algorithms not only desire high recall but also need to have low latency. Conventional lexical matching approaches (e.g., Okapi-BM25) exploit inverted indices to achieve fast inference time, but fail to capture behavioral signals between queries and products. In contrast, embedding-based models learn semantic representations from customer behavior data, but the performance is often limited by shallow neural encoders due to latency constraints. Semantic product search can be viewed as an eXtreme Multi-label Classification (XMC) problem, where customer queries are input instances and products are output labels. In this paper, we aim to improve semantic product search by using tree-based XMC models where inference time complexity is logarithmic in the number of products. We consider hierarchical linear models with n-gram features for fast real-time inference. Quantitatively, our method maintains a low latency of 1.25 milliseconds per query and achieves a 65% improvement of Recall@100 (60.9% v.s. 36.8%) over a competing embedding-based DSSM model. Our model is robust to weight pruning with varying thresholds, which can flexibly meet different system requirements for online deployments. Qualitatively, our method can retrieve products that are complementary to existing product search system and add diversity to the match set.

When Homomorphic Encryption Marries Secret Sharing: Secure Large-Scale Sparse Logistic Regression and Applications in Risk Control

Logistic Regression (LR) is the most widely used machine learning model in industry for its efficiency, robustness, and interpretability. Due to the problem of data isolation and the requirement of high model performance, many applications in industry call for building a secure and efficient LR model for multiple parties. Most existing work uses either Homomorphic Encryption (HE) or Secret Sharing (SS) to build secure LR. HE based methods can deal with high-dimensional sparse features, but they incur potential security risks. SS based methods have provable security, but they have efficiency issue under high-dimensional sparse features. In this paper, we first present CAESAR, which combines HE and SS to build secure large-scale sparse logistic regression model and achieves both efficiency and security. We then present the distributed implementation of CAESAR for scalability requirement. We have deployed CAESAR in a risk control task and conducted comprehensive experiments. Our experimental results show that CAESAR improves the state-of-the-art model by around 130 times.

Task-wise Split Gradient Boosting Trees for Multi-center Diabetes Prediction

Diabetes prediction is an important data science application in the social healthcare domain. There exist two main challenges in the diabetes prediction task: data heterogeneity since demographic and metabolic data are of different types, data insufficiency since the number of diabetes cases in a single medical center is usually limited. To tackle the above challenges, we employ gradient boosting decision trees (GBDT) to handle data heterogeneity and introduce multi-task learning (MTL) to solve data insufficiency. To this end, Task-wise Split Gradient Boosting Trees (TSGB) is proposed for the multi-center diabetes prediction task. Specifically, we firstly introduce task gain to evaluate each task separately during tree construction, with a theoretical analysis of GBDT's learning objective. Secondly, we reveal a problem when directly applying GBDT in MTL, i.e., the negative task gain problem. Finally, we propose a novel split method for GBDT in MTL based on the task gain statistics, named task-wise split, as an alternative to standard feature-wise split to overcome the mentioned negative task gain problem. Extensive experiments on a large-scale real-world diabetes dataset and a commonly used benchmark dataset demonstrate TSGB achieves superior performance against several state-of-the-art methods. Detailed case studies further support our analysis of negative task gain problems and provide insightful findings. The proposed TSGB method has been deployed as an online diabetes risk assessment software for early diagnosis.

Web-Scale Generic Object Detection at Microsoft Bing

In this paper, we present Generic Object Detection (GenOD), one of the largest object detection systems deployed to a web-scale general visual search engine that can detect over 900 categories for all Microsoft Bing Visual Search queries in near real-time. It acts as a fundamental visual query understanding service that provides object-centric information and shows gains in multiple production scenarios, improving upon domain-specific models. We discuss the challenges of collecting data, training, deploying and updating such a large-scale object detection model with multiple dependencies. We discuss a data collection pipeline that reduces per-bounding box labeling cost by 81.5% and latency by 61.2% while improving on annotation quality. We show that GenOD can improve weighted average precision by over 20% compared to multiple domain-specific models. We also improve the model update agility by nearly 2 times with the proposed disjoint detector training compared to joint fine-tuning. Finally we demonstrate how GenOD benefits visual search applications by significantly improving object-level search relevance by 54.9% and user engagement by 59.9%.

PD-Net: Quantitative Motor Function Evaluation for Parkinson's Disease via Automated Hand Gesture Analysis

Parkinson's Disease (PD) is a commonly diagnosed movement disorder with more than 10 million patients worldwide. Its clinical evaluation relies on a rating system called MDS-UPDRS, which includes subjective and error-prone motor examinations. This paper proposes an objective and interpretable visual system (PD-Net ) to quantitatively evaluate motor function of PD patients using video footage. The PD-Net consists of three modules: 1) a pose detector to infer 21 hand keypoints directly from RGB videos, 2) a movement analysis module to study temporal patterns of hand keypoints and discover motor symptoms, and 3) a scoring module to predict MDS-UPDRS ratings with retrieved symptoms. Trained with an in-house clinical dataset, PD-Net can effectively handle the unique challenges of PD examination videos, such as clinically-defined gestures, distinct self-occlusion/foreshortening effect and contextual background. And it detects hand keypoints of PD patients with an average accuracy of 84.1%, a 32.9% improvement over OpenPose. When compared to the ratings of experienced clinicians, PD-Net achieves an overall MDS-UPDRS rating score accuracy of 87.6% and Cohen's kappa of 0.82 on a testing dataset of 509 examination videos at a level exceeding human raters. This study demonstrates a clinically applicable automated video analysis system for PD clinical evaluation, which can facilitate early detection, routine monitoring, and treatment assessment.

Curriculum Meta-Learning for Next POI Recommendation

Next point-of-interest (POI) recommendation is a hot research field where a recent emerging scenario, next POI to search recommendation, has been deployed in many online map services such as Baidu Maps. One of the key issues in this scenario is providing satisfactory recommendation services for cold-start cities with a limited number of user-POI interactions, which requires transferring the knowledge hidden in rich data from many other cities to these cold-start cities. Existing literature either does not consider the city-transfer issue or cannot simultaneously tackle the data sparsity and pattern diversity issues among various users in multiple cities. To address these issues, we explore city-transfer next POI to search recommendation that transfers the knowledge from multiple cities with rich data to cold-start cities with scarce data. We propose a novel Curriculum Hardness Aware Meta-Learning (CHAML) framework, which incorporates hard sample mining and curriculum learning into a meta-learning paradigm. Concretely, the CHAML framework considers both city-level and user-level hardness to enhance the conditional sampling during meta training, and uses an easy-to-hard curriculum for the city-sampling pool to help the meta-learner converge to a better state. Extensive experiments on two real-world map search datasets from Baidu Maps demonstrate the superiority of CHAML framework.

Robust Object Detection Fusion Against Deception

Deep neural network (DNN) based object detection has become an integral part of numerous cyber-physical systems, perceiving physical environments and responding proactively to real-time events. Recent studies reveal that well-trained multi-task learners like DNN-based object detectors perform poorly in the presence of deception. This paper presents FUSE, a deception-resilient detection fusion approach with three novel contributions. First, we develop diversity-enhanced fusion teaming mechanisms, including diversity-enhanced joint training algorithms, for producing high diversity fusion detectors. Second, we introduce a three-tier detection fusion framework and a graph partitioning algorithm to construct fusion-verified detection outputs through three mutually reinforcing components: objectness fusion, bounding box fusion, and classification fusion. Third but not least, we provide a formal analysis of robustness enhancement by FUSE-protected systems. Extensive experiments are conducted on eleven detectors from three families of detection algorithms on two benchmark datasets. We show that FUSE guarantees strong robustness in mitigating the state-of-the-art deception attacks, including adversarial patches - a form of physical attacks using confined visual distortion.

FASER: Seismic Phase Identifier for Automated Monitoring

Seismic phase identification classifies the type of seismic wave received at a station based on the waveform (i.e., time series) recorded by a seismometer. Automated phase identification is an integrated component of large scale seismic monitoring applications, including earthquake warning systems and underground explosion monitoring. Accurate, fast, and fine-grained phase identification is instrumental for earthquake location estimation, understanding Earth's crustal and mantle structure for predictive modeling, etc. However, existing operational systems utilize multiple nearby stations for precise identification, which delays response time with added complexity and manual interventions. Moreover, single-station systems mostly perform coarse phase identification. In this paper, we revisit the seismic phase classification as an integrated part of a seismic processing pipeline. We develop a machine-learned model FASER, that takes input from a signal detector and produces phase types as output for a signal associator. The model is a combination of convolutional and long short-term memory networks. Our method identifies finer wave types, including crustal and mantle phases. We conduct comprehensive experiments on real datasets to show that FASER outperforms existing baselines. We evaluate FASER holding out sources and stations across the world to demonstrate consistent performance for novel sources and stations.

Theory meets Practice at the Median: A Worst Case Comparison of Relative Error Quantile Algorithms

Estimating the distribution and quantiles of data is a foundational task in data mining and data science. We study algorithms which provide accurate results for extreme quantile queries using a small amount of space, thus helping to understand the tails of the input distribution. Namely, we focus on two recent state-of-the-art solutions: t-digest and ReqSketch. While t-digest is a popular compact summary which works well in a variety of settings, ReqSketch comes with formal accuracy guarantees at the cost of its size growing as new observations are inserted. In this work, we provide insight into which conditions make one preferable to the other. Namely, we show how to construct inputs for t-digest that induce an almost arbitrarily large error and demonstrate that it fails to provide accurate results even on i.i.d. samples from a highly non-uniform distribution. We propose practical improvements to ReqSketch, making it faster than t-digest, while its error stays bounded on any instance. Still, our results confirm that t-digest remains more accurate on the "non-adversarial" data encountered in practice.

Would Your Tweet Invoke Hate on the Fly? Forecasting Hate Intensity of Reply Threads on Twitter

Curbing hate speech is undoubtedly a major challenge for online microblogging platforms like Twitter. While there have been studies around hate speech detection, it is not clear how hate speech finds its way into an online discussion. It is important for a content moderator to not only identify which tweet is hateful but also to predict which tweet will be responsible for accumulating hate speech. This would help in prioritizing tweets that need constant monitoring. Our analysis reveals that for hate speech to manifest in an ongoing discussion, the source tweet may not necessarily be hateful; rather, there are plenty of such non-hateful tweets which gradually invoke hateful replies, resulting in the entire reply threads becoming provocative.

In this paper, we define a novel problem -- given a source tweet and a few of its initial replies, the task is to forecast the hate intensity of upcoming replies. To this end, we curate a novel dataset constituting approx. 4.5k contemporary tweets and their entire reply threads. Our preliminary analysis confirms that the evolution patterns along time of hate intensity among reply threads have highly diverse patterns, and there is no significant correlation between the hate intensity of the source tweets and that of their reply threads. We employ seven state-of-the-art dynamic models (either statistical signal processing or deep learning-based) and show that they fail badly to forecast the hate intensity. We then propose DESSERT, a novel deep state-space model that leverages the function approximation capability of deep neural networks with the capacity to quantify the uncertainty of statistical signal processing models. Exhaustive experiments and ablation study show that DESSERT outperforms all the baselines substantially. Further, its deployment in an advanced AI platform designed to monitor real-world problematic hateful content has improved the aggregated insights extracted for countering the spread of online harms.

On Post-selection Inference in A/B Testing

When interpreting A/B tests, we typically focus only on the statistically significant results and take them by face value. This practice, termed post-selection inference in the statistical literature, may negatively affect both point estimation and uncertainty quantification, and therefore hinder trustworthy decision making in A/B testing. To address this issue, in this paper we explore two seemingly unrelated paths, one based on supervised machine learning and the other on empirical Bayes, and propose post-selection inferential approaches that combine the strengths of both. Through large-scale simulated and empirical examples, we demonstrate that our proposed methodologies stand out among other existing ones in both reducing post-selection biases and improving confidence interval coverage rates, and discuss how they can be conveniently adjusted to real-life scenarios.

Globally Optimized Matchmaking in Online Games

As one of the core components of online games, matchmaking is the process of arranging multiple players into matches, where the quality of matchmaking systems directly determines player satisfaction and further affects the life cycle of game products. With the number of candidate players increases, the number of possible match combinations grows exponentially, which makes the current implementation for multiplayer matchmaking can only obtain locally optimal arrangement in an inefficient fashion. In this paper, we focus on the globally optimized matchmaking problem, in which the objective is to decide an optimal matching sequence for the queuing players. To tackle this challenging problem, we propose a novel data-driven matchmaking framework, called GloMatch, based on machine learning principles. Through transforming the matchmaking problem into a sequential decision problem, we solve it with the help of an effective policy-based deep reinforcement learning algorithm. Quantitative experiments on simulation and online game environments demonstrate the effectiveness of the presented framework.

Causal and Interpretable Rules for Time Series Analysis

The number of complex infrastructures in an industrial setting is growing and is not immune to unexplained recurring events such as breakdowns or failure that can have an economic and environmental impact. To understand these phenomena, sensors have been placed on the different infrastructures to track, monitor, and control the dynamics of the systems. The causal study of these data allows predictive and prescriptive maintenance to be carried out. It helps to understand the appearance of a problem and find counterfactual outcomes to better operate and defuse the event. In this paper, we introduce a novel approach combining the case-crossover design which is used to investigate acute triggers of diseases in epidemiology, and the Apriori algorithm which is a data mining technique allowing to find relevant rules in a dataset. The resulting time series causal algorithm extracts interesting rules in our application case which is a non-linear time series dataset. In addition, a predictive rule-based algorithm demonstrates the potential of the proposed method.

Deep Learning based Crop Row Detection with Online Domain Adaptation

Detecting crop rows from video frames in real time is a fundamental challenge in the field of precision agriculture. Deep learning based semantic segmentation method, namely U-net, although successful in many tasks related to precision agriculture, performs poorly for solving this task. The reasons include paucity of large scale labeled datasets in this domain, diversity in crops, and the diversity of appearance of the same crops at various stages of their growth. In this work, we discuss the development of a practical real-life crop row detection system in collaboration with an agricultural sprayer company. Our proposed method takes the output of semantic segmentation using U-net, and then apply a clustering based probabilistic temporal calibration which can adapt to different fields and crops without the need for retraining the network. Experimental results validate that our method can be used for both refining the results of the U-net to reduce errors and also for frame interpolation of the input video stream.

Improving Protein Function Annotation via Unsupervised Pre-training: Robustness, Efficiency, and Insights

Recent work demonstrated a large ensemble of convolutional neural networks (CNNs) outperforms industry-standard approaches at annotating protein sequences that are far from the training data. These results highlight the potential of deep learning to significantly advance protein sequence annotation, but this particular system is not a practical tool for many biologists because of the computational burden of making predictions using a large ensemble. In this work, we fine-tune a transformer model that is pre-trained on millions of unlabeled natural protein sequences in order to reduce the system's compute burden at prediction time and improve accuracy. By switching from a CNN to the pre-trained transformer, we lift performance from 73.6% to 90.5% using a single model on a challenging clustering-based train-test split, where the ensemble of 59 CNNs achieved 89.0%. Through extensive stratified analysis of model performance, we provide evidence that the new model's predictions are trustworthy, even in cases known to be challenging for prior methods. Finally, we provide a case study of the biological insight enabled by this approach.

Exploration in Online Advertising Systems with Deep Uncertainty-Aware Learning

Modern online advertising systems inevitably rely on personalization methods, such as click-through rate (CTR) prediction. Recent progress in CTR prediction enjoys the rich representation capabilities of deep learning and achieves great success in large-scale industrial applications. However, these methods can suffer from lack of exploration. Another line of prior work addresses the exploration-exploitation trade-off problem with contextual bandit methods, which are recently less studied in the industry due to the difficulty in extending their flexibility with deep models. In this paper, we propose a novel Deep Uncertainty-Aware Learning (DUAL) method to learn CTR models based on Gaussian processes, which can provide predictive uncertainty estimations while maintaining the flexibility of deep neural networks. DUAL can be easily implemented on existing models and deployed in real-time systems with minimal extra computational overhead. By linking the predictive uncertainty estimation ability of DUAL to well-known bandit algorithms, we further present DUAL-based Ad-ranking strategies to boost up long-term utilities such as the social welfare in advertising systems. Experimental results on several public datasets demonstrate the effectiveness of our methods. Remarkably, an online A/B test deployed in the Alibaba display advertising platform shows an 8.2% social welfare improvement and an 8.0% revenue lift.

Clustering for Private Interest-based Advertising

We study the problem of designing privacy-enhanced solutions for interest-based advertisement (IBA). IBA is a key component of the online ads ecosystem and provides a better ad experience to users. Indeed, IBA enables advertisers to show users impressions that are relevant to them. Nevertheless, the current way ad tech companies achieve this is by building detailed interest profiles for individual users. In this work we ask whether such fine grained personalization is required, and present mechanisms that achieve competitive performance while giving privacy guarantees to the end users. More precisely we present the first detailed exploration of how to implement Chrome's Federated Learning of Cohorts (FLoC) API. We define the privacy properties required for the API and evaluate multiple hashing and clustering algorithms discussing the trade-offs between utility, privacy, and ease of implementation.

Automated Testing of Graphics Units by Deep-Learning Detection of Visual Anomalies

We present a novel system for performing real-time detection of diverse visual corruptions in videos, for validating the quality of graphics units in our company. The system is used for several types of content, including movies and 3D graphics, with strict constraints on low false alert rates and real-time processing of millions of video frames per day. These constraints required novel solutions involving both hardware and software, including new supervised and weakly-supervised methods we developed. Our deployed system has enabled a ~20X reduction of human effort and discovering new corruptions missed by humans and existing approaches.

Meta-Learned Spatial-Temporal POI Auto-Completion for the Search Engine at Baidu Maps

Point Of Interest Auto-Completion (abbr. as POI-AC) is one of the featured functions for the search engine at Baidu Maps. It can dynamically suggest a list of POI candidates within milliseconds as a user enters each character (e.g., English, Chinese, or Pinyin character) into the search box. Ideally, a user may need to provide only one character and immediately obtain the desired POI at the top of the POI list suggested by POI-AC. In this way, the user's keystrokes can be dramatically saved, which significantly reduces the time and effort of typing, especially on mobile devices that have limited space for display and user interfaces. Despite using a user's profile and input prefixes for personalized POI suggestions, however, the state-of-the-art approach, i.e., P^3AC, still has a long way to go so as to generate not only personalized but, more importantly, time- and geography-aware suggestions. In this paper, we find that 17.9% of users tend to look for diverse POIs at different times or locations using the same prefix. This insight drives us to establish an end-to-end spatial-temporal POI-AC (abbr. as ST-PAC) module to replace P3AC at Baidu Maps. To alleviate the problem of the long-tail distribution of time- and location-specific data on POI-AC, we further propose a meta-learned ST-PAC (abbr. as MST-PAC) updated by an efficient MapReduce algorithm. MST-PAC can significantly overcome the "long-tail" issue and rapidly adapt to the cold-start POI-AC tasks with fewer examples. We sample several benchmark datasets from the large-scale search logs at Baidu Maps to assess the offline performance of MST-PAC in line with multiple metrics, including Mean Reciprocal Rank (MRR), Success Rate (SR) and normalized Discounted Cumulative Gain (nDCG). The consistent improvements on these metrics give us more confidence to launch this meta-learned POI-AC module online. As a result, the critical indicator on user satisfaction online, i.e., the average number of keystrokes in a POI-AC session, significantly decreases as well. For now, MST-PAC has already been deployed in production at Baidu Maps, handling billions of POI-AC requests every day. It confirms that MST-PAC is a practical and robust industrial solution for large-scale POI Search.

Heterogeneous Temporal Graph Transformer: An Intelligent System for Evolving Android Malware Detection

The explosive growth and increasing sophistication of Android malware call for new defensive techniques to protect mobile users against novel threats. To address this challenge, in this paper, we propose and develop an intelligent system named Dr.Droid to jointly model malware propagation and evolution for their detection at the first attempt. In Dr.Droid, we first exploit higher-level semantic and social relations within the ecosystem (e.g., app-market, app-developer, market-developer relations etc.) to characterize app propagation patterns; and then we present a structured heterogeneous graph to model the complex relations among different types of entities. To capture malware evolution, we further consider the temporal dependence and introduce a heterogeneous temporal graph to jointly model malware propagation and evolution by considering heterogeneous spatial dependencies with temporal dimensions. Afterwards, we propose a novel heterogeneous temporal graph transformer framework (denoted as HTGT) to integrate both spatial and temporal dependencies while preserving the heterogeneity to learn node representations for malware detection. Specifically, in our proposed HTGT, to preserve the heterogeneity, we devise a heterogeneous spatial transformer to derive heterogeneous attentions over each node and edge to learn dedicated representations for different types of entities and relations; to model temporal dependencies, we design a temporal transformer into the HTGT to attentively aggregate its historical sequences of a given node (e.g., app); the two transformers work in an iterative manner for representation learning. Promising experimental results based on the large-scale sample collections from anti-malware industry demonstrate the performance of Dr.Droid, by comparison with state-of-the-art baselines and popular mobile security products.

SSML: Self-Supervised Meta-Learner for En Route Travel Time Estimation at Baidu Maps

Travel time estimation (TTE) is one of the most critical modules at Baidu Maps, which plays a vital role in intelligent transportation services such as route planning and navigation. During the driving en route, the navigation system of Baidu Maps can provide real-time estimations on when a user will arrive at the destination. It automatically recalculates and updates the remaining travel time from the driver's current position to the destination (hereafter referred to as remaining route) every few minutes. The previously deployed TTE model at Baidu Maps, i.e., ConSTGAT, takes the remaining route as well as the current time as input and provides the corresponding estimated time of arrival. However, it ignores the route that has been already traveled from the origin to the driver's current position (hereafter referred to as traveled route), which could contribute to improving the accuracy of time estimation. In this work, we believe that the traveled route conveys valuable evidence that could facilitate the modeling of driving preference and take that into consideration for the task of en route travel time estimation (ER-TTE). This task is non-trivial because it requires adapting fast to a user's driving preference using a few observed behaviors in the traveled route. To this end, we frame ER-TTE as a few-shot learning problem and consider the observed behaviors in the traveled route as training examples while the future behaviors in the remaining route as test examples. To tackle the few-shot learning problem, we propose a novel model-based meta-learning approach, called SSML, to learn the meta-knowledge so as to fast adapt to a user's driving preference and improve the time estimation of the remaining route. SSML leverages the technique of self-supervised learning, which is equivalent to generating a significant number of synthetic learning tasks, to further improve the performance. Extensive offline tests conducted on large-scale real-world datasets collected from Baidu Maps demonstrate the superiority of SSML. The online tests before deploying in production were successfully performed, which confirms the practical applicability of SSML.

MoCha: Large-Scale Driving Pattern Characterization for Usage-based Insurance

Given widely adopted vehicle tracking technologies, usage-based insurance has been a rising market over the past few years. With potential discounts from insurance companies, customers voluntarily install sensing devices in their vehicles for insurance companies, which are utilized to analyze their historical driving patterns to derive the risks of future driving. However, it is challenging to characterize and predict driving patterns, especially for new users with limited data. To address this issue, we propose and evaluate a system called MoCha to accurately characterize driving patterns for usage-based insurance. The key question we aim to explore with MoCha is whether we can fully explore long-term driving patterns of new users with only limited historical data of themselves by leveraging abundant data of other users and contextual information. To answer this question, we design (i) a multi-level driving pattern modeling component to capture the spatial-temporal dependency on both individual and group level, and (ii) a multi-task learning method to utilize underlying relations of driving metrics and predict multiple driving metrics simultaneously. We implement and evaluate MoCha with real-world on-board diagnostics data from a large insurance company with more than 340,000 vehicles. Further, we validate the usefulness of MoCha by predicting driving risks based on real-world claim data in a Chinese city, Shenzhen.

Time Series Anomaly Detection for Cyber-physical Systems via Neural System Identification and Bayesian Filtering

Recent advances in AIoT technologies have led to an increasing popularity of utilizing machine learning algorithms to detect operational failures for cyber-physical systems (CPS). In its basic form, an anomaly detection module monitors the sensor measurements and actuator states from the physical plant, and detects anomalies in these measurements to identify abnormal operation status. Nevertheless, building effective anomaly detection models for CPS is rather challenging as the model has to accurately detect anomalies in presence of highly complicated system dynamics and unknown amount of sensor noise. In this work, we propose a novel time series anomaly detection method called Neural System Identification and Bayesian Filtering (NSIBF) in which a specially crafted neural network architecture is posed for system identification, i.e., capturing the dynamics of CPS in a dynamical state-space model; then a Bayesian filtering algorithm is naturally applied on top of the "identified" state-space model for robust anomaly detection by tracking the uncertainty of the hidden state of the system recursively over time. We provide qualitative as well as quantitative experiments with the proposed method on a synthetic and three real-world CPS datasets, showing that NSIBF compares favorably to the state-of-the-art methods with considerable improvements on anomaly detection in CPS.

Adversarial Attacks on Deep Models for Financial Transaction Records

Machine learning models using transaction records as inputs are popular among financial institutions. The most efficient models use deep-learning architectures similar to those in the NLP community, posing a challenge due to their tremendous number of parameters and limited robustness. In particular, deep-learning models are vulnerable to adversarial attacks: a little change in the input harms the model's output. In this work, we examine adversarial attacks on transaction records data and defenses from these attacks. The transaction records data have a different structure than the canonical NLP or time-series data, as neighboring records are less connected than words in sentences, and each record consists of both discrete merchant code and continuous transaction amount. We consider a black-box attack scenario, where the attack doesn't know the true decision model and pay special attention to adding transaction tokens to the end of a sequence. These limitations provide a more realistic scenario, previously unexplored in the NLP world. The proposed adversarial attacks and the respective defenses demonstrate remarkable performance using relevant datasets from the financial industry. Our results show that a couple of generated transactions are sufficient to fool a deep-learning model. Further, we improve model robustness via adversarial training or separate adversarial examples detection. This work shows that embedding protection from adversarial attacks improves model robustness, allowing a wider adoption of deep models for transaction records in banking and finance.

A Deep Learning Method for Route and Time Prediction in Food Delivery Service

Online food ordering and delivery service has widely served people's daily demands worldwide, e.g., it has reached a number of 34.9 million online orders per day in Q3 of 2020 in Meituan food delivery platform. For the food delivery service, accurate estimation of the driver's delivery route and time, defined as the FD-RTP task, is very significant to customer satisfaction and driver experience. In the paper, we apply deep learning to the FD-RTP task for the first time, and propose a deep network named FDNET. Different from traditional heuristic search algorithms, we predict the probability of each feasible location the driver will visit next, through mining a large amount of food delivery data. Guided by the probabilities, FDNET greatly reduces the search space in delivery route generation, and the calculation times of time prediction. As a result, various kinds of information can be fully utilized in FDNET within the limited computation time. Careful consideration of the factors having effect on the driver's behaviors and introduction of more abundant spatiotemporal information both contribute to the improvements. Offline experiments over the large-scale real-world dataset, and online A/B test demonstrate the effectiveness of our proposed FDNET.

Real Negatives Matter: Continuous Training with Real Negatives for Delayed Feedback Modeling

One of the difficulties of conversion rate (CVR) prediction is that the conversions can delay and take place long after the clicks. The delayed feedback poses a challenge: fresh data are beneficial to continuous training but may not have complete label information at the time they are ingested into the training pipeline. To balance model freshness and label certainty, previous methods set a short waiting window or even do not wait for the conversion signal. If conversion happens outside the waiting window, this sample will be duplicated and ingested into the training pipeline with a positive label. However, these methods have some issues. First, they assume the observed feature distribution remains the same as the actual distribution. But this assumption does not hold due to the ingestion of duplicated samples. Second, the certainty of the conversion action only comes from the positives. But the positives are scarce as conversions are sparse in commercial systems. These issues induce bias during the modeling of delayed feedback. In this paper, we propose DElayed FEedback modeling with Real negatives (DEFER) method to address these issues. The proposed method ingests real negative samples into the training pipeline. The ingestion of real negatives ensures the observed feature distribution is equivalent to the actual distribution, thus reducing the bias. The ingestion of real negatives also brings more certainty information of the conversion. To correct the distribution shift, DEFER employs importance sampling to weigh the loss function. Experimental results on industrial datasets validate the superiority of DEFER. DEFER have been deployed in the display advertising system of Alibaba, obtaining over 6.0% improvement on CVR in several scenarios. The code and data in this paper are now open-sourced (https://github.com/gusuperstar/defer.git).

Multi-Agent Cooperative Bidding Games for Multi-Objective Optimization in e-Commercial Sponsored Search

Bid optimization for online advertising from single advertiser's perspective has been thoroughly investigated in both academic research and industrial practice. However, existing work typically assume competitors do not change their bids, i.e., the wining price is fixed, leading to poor performance of the derived solution. Although a few studies use multi-agent reinforcement learning to set up a cooperative game, they still suffer the following drawbacks: (1) They fail to avoid collusion solutions where all the advertisers involved in an auction collude to bid an extremely low price on purpose. (2) Previous works cannot well handle the underlying complex bidding environment, leading to poor model convergence. This problem could be amplified when handling multiple objectives of advertisers which are practical demands but not considered by previous work. In this paper, we propose a novel multi-objective cooperative bid optimization formulation called Multi-Agent Cooperative bidding Games (MACG). MACG sets up a carefully designed multi-objective optimization framework where different objectives of advertisers are incorporated. A global objective to maximize the overall profit of all advertisements is added in order to encourage better cooperation and also to protect self-bidding advertisers. To avoid collusion, we also introduce an extra platform revenue constraint. We analyze the optimal functional form of the bidding formula theoretically and design a policy network accordingly to generate auction-level bids. Then we design an efficient multi-agent evolutionary strategy for model optimization. Evolutionary strategy does not need to model the underlying environment explicitly and is more suitable for bid optimization. Offline experiments and online A/B tests conducted on the Taobao platform indicate both single advertiser's objective and global profit have been significantly improved compared to state-of-art methods.

An Embedding Learning Framework for Numerical Features in CTR Prediction

Click-Through Rate (CTR) prediction is critical for industrial recommender systems, where most deep CTR models follow an Embedding & Feature Interaction paradigm. However, the majority of methods focus on designing network architectures to better capture feature interactions while the feature embedding, especially for numerical features, has been overlooked. Existing approaches for numerical features are difficult to capture informative knowledge because of the low capacity or hard discretization based on the offline expertise feature engineering. In this paper, we propose a novel embedding learning framework for numerical features in CTR prediction (AutoDis) with high model capacity, end-to-end training and unique representation properties preserved. AutoDis consists of three core components: meta-embeddings, automatic discretization and aggregation. Specifically, we propose meta-embeddings for each numerical field to learn global knowledge from the perspective of field with a manageable number of parameters. Then the differentiable automatic discretization performs soft discretization and captures the correlations between the numerical features and meta-embeddings. Finally, distinctive and informative embeddings are learned via an aggregation function. Comprehensive experiments on two public and one industrial datasets are conducted to validate the effectiveness of AutoDis. Moreover, AutoDis has been deployed onto a mainstream advertising platform, where online A/B test demonstrates the improvement over the base model by 2.1% and 2.7% in terms of CTR and eCPM, respectively. In addition, the code of our framework is publicly available in MindSpore.

We Know What You Want: An Advertising Strategy Recommender System for Online Advertising

Advertising expenditures have become the major source of revenue for e-commerce platforms. Providing good advertising experiences for advertisers by reducing their costs of trial and error in discovering the optimal advertising strategies is crucial for the long-term prosperity of online advertising. To achieve this goal, the advertising platform needs to identify the advertiser's optimization objectives, and then recommend the corresponding strategies to fulfill the objectives. In this work, we first deploy a prototype of strategy recommender system on Taobao display advertising platform, which indeed increases the advertisers' performance and the platform's revenue, indicating the effectiveness of strategy recommendation for online advertising. We further augment this prototype system by explicitly learning the advertisers' preferences over various advertising performance indicators and then optimization objectives through their adoptions of different recommending advertising strategies. We use contextual bandit algorithms to efficiently learn the advertisers' preferences and maximize the recommendation adoption, simultaneously. Simulation experiments based on Taobao online bidding data show that the designed algorithms can effectively optimize the strategy adoption rate of advertisers.

Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism

In this paper, we consider hybrid parallelism---a paradigm that employs both Data Parallelism (DP) and Model Parallelism (MP)---to scale distributed training of large recommendation models. We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training. DCT filters the entities to be communicated across the network through a simple hard-thresholding function, allowing only the most relevant information to pass through. For communication efficient DP, DCT compresses the parameter gradients sent to the parameter server during model synchronization. The threshold is updated only once every few thousand iterations to reduce the computational overhead of compression. For communication efficient MP, DCT incorporates a novel technique to compress the activations and gradients sent across the network during the forward and backward propagation, respectively. This is done by identifying and updating only the most relevant neurons of the neural network for each training sample in the data. We evaluate DCT on publicly available natural language processing and recommender models and datasets, as well as recommendation systems used in production at Facebook. DCT reduces communication by at least 100x and 20x during DP and MP, respectively. The algorithm has been deployed in production, and it improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.

Budget Allocation as a Multi-Agent System of Contextual & Continuous Bandits

Budget allocation for online advertising suffers from multiple complications, including significant delay between the initial ad impression to the call to action as well as cold-start prediction problems for ad campaigns with limited or no historical performance data. To address these issues, we introduce the Contextual Budgeting System (CBS ), a budget allocation framework using a multi-agent system of contextual & continuous Multi-Armed Bandits. Our proposed solution decomposes the problem into a convex optimization problem whose objective is drawn using Thompson Sampling. In order to efficiently deal with context and cold-start, we propose a transfer learning mechanism using supervised learning methods that augment simple parametric models.

We apply an implementation of this algorithm to all spending for new driver acquisition at Lyft and measure a (22 ± 10)% improvement in the mean Cost Per user Acquisition (CPA) over a previous non-contextual model, based on Markov Chain Monte-Carlo, generating tens of millions of dollars annually in efficiency improvements while also increasing total user acquisition.

MEDTO: Medical Data to Ontology Matching Using Hybrid Graph Neural Networks

Medical ontologies are widely used to describe and organize medical terminologies and to support many critical applications on healthcare databases. These ontologies are often manually curated (e.g., UMLS, SNOMED CT, and MeSH) by medical experts. Medical databases, on the other hand, are often created by database administrators, using different terminology and structures. The discrepancies between medical ontologies and databases compromise interoperability between them. Data to ontology matching is the process of finding semantic correspondences between tables in databases to standard ontologies. Existing solutions such as ontology matching have mostly focused on engineering features from terminological, structural, and semantic model information extracted from the ontologies. However, this is often labor intensive and the accuracy varies greatly across different ontologies. Worse yet, the ontology capturing a medical database is often not given in practice. In this paper, we propose MEDTO, a novel end-to-end framework that consists of three innovative techniques: (1) a lightweight yet effective method that bootstrap a semantically rich ontology from a given medical database, (2) a hyperbolic graph convolution layer that encodes hierarchical concepts in the hyperbolic space, and (3) a heterogeneous graph layer that encodes both local and global context information of a concept. Experiments on two real-world medical datasets matching against SNOMED CT show significant improvements compared to the state-of-the-art methods. MEDTO also consistently achieves competitive results on a benchmark from the Ontology Alignment Evaluation Initiative.

Hierarchical Reinforcement Learning for Scarce Medical Resource Allocation with Imperfect Information

Facing the outbreak of COVID-19, shortage in medical resources becomes increasingly outstanding. Therefore, efficient strategies for medical resource allocation are urgently called for. Reinforcement learning (RL) is powerful for decision making, but three key challenges exist in solving this problem via RL: (1) complex situation and countless choices for decision making in the real world; (2) only imperfect information are available due to the latency of pandemic spreading; (3) limitations on conducting experiments in real world since we cannot set pandemic outbreaks arbitrarily. In this paper, we propose a hierarchical reinforcement learning method with a corresponding training algorithm. We design a decomposed action space to deal with the countless choices to ensure efficient and real time strategies. We also design a recurrent neural network based framework to utilize the imperfect information obtained from the environment. We build a pandemic spreading simulator based on real world data, serving as the experimental platform. We conduct extensive experiments and the results show that our method outperforms all the baselines, which reduces infections and deaths by 14.25% on average.

Adversarial Feature Translation for Multi-domain Recommendation

Real-world super platforms such as Google and WeChat usually have different recommendation scenarios to provide heterogeneous items for users' diverse demands. Multi-domain recommendation (MDR) is proposed to improve all recommendation domains simultaneously, where the key point is to capture informative domain-specific features from all domains. To address this problem, we propose a novel Adversarial feature translation (AFT) model for MDR, which learns the feature translations between different domains under a generative adversarial network framework. Precisely, in the multi-domain generator, we propose a domain-specific masked encoder to highlight inter-domain feature interactions, and then aggregate these features via a transformer and a domain-specific attention. In the multi-domain discriminator, we explicitly model the relationships between item, domain and users' general/domain-specific representations with a two-step feature translation inspired by the knowledge representation learning. In experiments, we evaluate AFT on a public and an industrial MDR datasets and achieve significant improvements. We also conduct an online evaluation on a real-world MDR system. We further give detailed ablation tests and model analyses to verify the effectiveness of different components. Currently, we have deployed AFT on WeChat Top Stories. The source code is in https://github.com/xiaobocser/AFT.

Amazon SageMaker Clarify: Machine Learning Bias Detection and Explainability in the Cloud

Understanding the predictions made by machine learning (ML) models and their potential biases remains a challenging and labor-intensive task that depends on the application, the dataset, and the specific model. We present Amazon SageMaker Clarify, an explainability feature for Amazon SageMaker that launched in December 2020, providing insights into data and ML models by identifying biases and explaining predictions. It is deeply integrated into Amazon SageMaker, a fully managed service that enables data scientists and developers to build, train, and deploy ML models at any scale. Clarify supports bias detection and feature importance computation across the ML lifecycle, during data preparation, model evaluation, and post-deployment monitoring. We outline the desiderata derived from customer input, the modular architecture, and the methodology for bias and explanation computations. Further, we describe the technical challenges encountered and the tradeoffs we had to make. For illustration, we discuss two customer use cases. We present our deployment results including qualitative customer feedback and a quantitative evaluation. Finally, we summarize lessons learned, and discuss best practices for the successful adoption of fairness and explanation tools in practice.

Neural Instant Search for Music and Podcast

Over recent years, podcasts have emerged as a novel medium for sharing and broadcasting information over the Internet. Audio streaming platforms originally designed for music content, such as Amazon Music, Pandora, and Spotify, have reported a rapid growth, with millions of users consuming podcasts every day. With podcasts emerging as a new medium for consuming information, the need to develop information access systems that enable efficient and effective discovery from a heterogeneous collection of music and podcasts is more important than ever. However, information access in such domains still remains understudied. In this work, we conduct a large-scale log analysis to study and compare podcast and music search behavior on Spotify, a major audio streaming platform. Our findings suggest that there exist fundamental differences in user behavior while searching for podcasts compared to music. Specifically, we identify the need to improve podcast search performance. We propose a simple yet effective transformer-based neural instant search model that retrieves items from a heterogeneous collection of music and podcast content. Our model takes advantage of multi-task learning to optimize for a ranking objective in addition to a query intent type identification objective. Our experiments on large-scale search logs show that the proposed model significantly outperforms strong baselines for both podcast and music queries.

A Unified Solution to Constrained Bidding in Online Display Advertising

In online display advertising, advertisers usually participate in real-time bidding to acquire ad impression opportunities. In most advertising platforms, a typical impression acquiring demand of advertisers is to maximize the sum value of winning impressions under budget and some key performance indicators constraints, (e.g. maximizing clicks with the constraints of budget and cost per click upper bound). The demand can be various in value type (e.g. ad exposure/click), constraint type (e.g. cost per unit value) and constraint number. Existing works usually focus on a specific demand or hardly achieve the optimum. In this paper, we formulate the demand as a constrained bidding problem, and deduce a unified optimal bidding function on behalf of an advertiser. The optimal bidding function facilitates an advertiser calculating bids for all impressions with only m parameters, where m is the constraint number. However, in real application, it is non-trivial to determine the parameters due to the non-stationary auction environment. We further propose a reinforcement learning (RL) method to dynamically adjust parameters to achieve the optimum, whose converging efficiency is significantly boosted by the recursive optimization property in our formulation. We name the formulation and the RL method, together, as Unified Solution to Constrained Bidding (USCB). USCB is verified to be effective on industrial datasets and is deployed in Alibaba display advertising platform.

Purify and Generate: Learning Faithful Item-to-Item Graph from Noisy User-Item Interaction Behaviors

Matching is almost the first and most fundamental step in recommender systems, that is to quickly select hundreds or thousands of related entities from the whole commodity pool. Among all the matching methods, item-to-item (I2I) graph based matching is a handy and highly effective approach and is widely used in most applications, owing to the essential relationships of entities described in a powerful I2I graph. Yet, the I2I graph is not a ready-made product in a data source. To obtain it from users' behaviors, a common practice in the industry is to construct the graph based on the similarity of item embeddings or co-occurrence frequency directly. However, these methods tend to lose the complicated correlations (high-ordered or nonlinear) inside decision-making actions and cannot achieve the global optimal solution. Moreover, the correlations between items are usually contained in users' short-term actions, which are full of noise information (e.g. spurious association, missing connection). It is vitally important to filter out noise while generating the graph. In this paper, we propose a novel framework called Purified Graph Generation (PGG) dedicated to learn faithful I2I graph from sparse and noisy behavior data. We capture the 'confidence value' between user and item to get rid of exception action during decision making, and leverage it to re-sample purified sets that are fed into an unsupervised I2I graph structure learning framework called GPBG. Extensive experimental results from both simulation and real data demonstrate that our method could significantly benefit the performance of I2I graph compared to the typical baselines.

Analysis of Faces in a Decade of US Cable TV News

Cable (TV) news reaches millions of US households each day. News stakeholders such as communications researchers, journalists, and media monitoring organizations are interested in the visual content of cable news, especially who is on-screen. Manual analysis, however, is labor intensive and limits the size of prior studies. We conduct a large-scale, quantitative analysis of the faces in a decade of cable news video from the top three US cable news networks (CNN, FOX, and MSNBC), totaling 244,038 hours between January 2010 and July 2019. Our work uses technologies such as automatic face and gender recognition to measure the "screen time" of faces and to enable visual analysis and exploration at scale. Our analysis method gives insight into a broad set of socially relevant topics. For instance, male-presenting faces receive much more screen time than female-presenting faces (2.4x in 2010, 1.9x in 2019). To make our dataset and annotations accessible, we release a public interface at https://tvnews.stanford.edu that allows the general public to write queries and to perform their own analyses.

Markdowns in E-Commerce Fresh Retail: A Counterfactual Prediction and Multi-Period Optimization Approach

While markdowns in retail have been studied for decades in traditional business, nowadays e-commerce fresh retail brings much more challenges. Due to the limited shelf life of perishable products and the limited opportunity of price changes, it is difficult to predict sales of a product at a counterfactual price, and therefore it is hard to determine the optimal discount price to control inventory and to maximize future revenue. Traditional machine learning-based methods have high predictability but they can not reveal the relationship between sales and price properly. Traditional economic models have high interpretability but their prediction accuracy is low. In this paper, by leveraging abundant observational transaction data, we propose a novel data-driven and interpretable pricing approach for markdowns, consisting of counterfactual prediction and multi-period price optimization. Firstly, we build a semi-parametric structural model to learn individual price elasticity and predict counterfactual demand. This semi-parametric model takes advantage of both the predictability of nonparametric machine learning model and the interpretability of economic model. Secondly, we propose a multi-period dynamic pricing algorithm to maximize the overall profit of a perishable product over its finite selling horizon. Different with the traditional approaches that use the deterministic demand, we model the uncertainty of counterfactual demand since it inevitably has randomness in the prediction process. Based on the stochastic model, we derive a sequential pricing strategy by Markov decision process, and design a two-stage algorithm to solve it. The proposed algorithm is very efficient. It reduces the time complexity from exponential to polynomial. Experimental results show the advantages of our pricing algorithm, and the proposed framework has been successfully deployed to the well-known e-commerce fresh retail scenario - Freshippo.

HGAMN: Heterogeneous Graph Attention Matching Network for Multilingual POI Retrieval at Baidu Maps

The increasing interest in international travel has raised the demand of retrieving point of interests (POIs) in multiple languages. This is even superior to find local venues such as restaurants and scenic spots in unfamiliar languages when traveling abroad. Multilingual POI retrieval, enabling users to find desired POIs in a demanded language using queries in numerous languages, has become an indispensable feature of today's global map applications such as Baidu Maps. This task is non-trivial because of two key challenges: (1) visiting sparsity and (2) multilingual query-POI matching. To this end, we propose a Heterogeneous Graph Attention Matching Network (HGAMN) to concurrently address both challenges. Specifically, we construct a heterogeneous graph that contains two types of nodes: POI node and query node using the search logs of Baidu Maps. First, to alleviate challenge #1, we construct edges between different POI nodes to link the low-frequency POIs with the high-frequency ones, which enables the transfer of knowledge from the latter to the former. Second, to mitigate challenge #2, we construct edges between POI and query nodes based on the co-occurrences between queries and POIs, where queries in different languages and formulations can be aggregated for individual POIs. Moreover, we develop an attention-based network to jointly learn node representations of the heterogeneous graph and further design a cross-attention module to fuse the representations of both types of nodes for query-POI relevance scoring. In this way, the relevance ranking between multilingual queries and POIs with different popularity can be better handled. Extensive experiments conducted on large-scale real-world datasets from Baidu Maps demonstrate the superiority and effectiveness of HGAMN. In addition, HGAMN has already been deployed in production at Baidu Maps, and it successfully keeps serving hundreds of millions of requests every day. Compared with the previously deployed model, HGAMN achieves significant performance improvement, which confirms that HGAMN is a practical and robust solution for large-scale real-world multilingual POI retrieval service.

Sliding Spectrum Decomposition for Diversified Recommendation

Content feed, a type of product that recommends a sequence of items for users to browse and engage with, has gained tremendous popularity among social media platforms. In this paper, we propose to study the diversity problem in such a scenario from an item sequence perspective using time series analysis techniques. We derive a method calledsliding spectrum decomposition (SSD) that captures users' perception of diversity in browsing a long item sequence. We also share our experiences in designing and implementing a suitable item embedding method for accurate similarity measurement under long tail effect. Combined together, they are now fully implemented and deployed in Xiaohongshu App's production recommender system that serves the main Explore Feed product for tens of millions of users every day. We demonstrate the effectiveness and efficiency of the method through theoretical analysis, offline experiments and online A/B tests.

Hierarchical Training: Scaling Deep Recommendation Models on Large CPU Clusters

Neural network based recommendation models are widely used to power many internet-scale applications including product recommendation and feed ranking. As the models become more complex and more training data is required during training, improving the training scalability of these recommendation models becomes an urgent need. However, improving the scalability without sacrificing the model quality is challenging. In this paper, we conduct an in-depth analysis of the scalability bottleneck in existing training architecture on large scale CPU clusters. Based on these observations, we propose a new training architecture called Hierarchical Training, which exploits both data parallelism and model parallelism for the neural network part of the model within a group. We implement hierarchical training with a two-layer design: a tagging system that decides the operator placement and a net transformation system that materializes the training plans, and integrate hierarchical training into existing training stack. We propose several optimizations to improve the scalability of hierarchical training including model architecture optimization, communication compression, and various system-level improvements. Extensive experiments at massive scale demonstrate that hierarchical training can speed up distributed recommendation model training by 1.9x without model quality drop.

Deep Inclusion Relation-aware Network for User Response Prediction at Fliggy

User response prediction plays a crucial role in many applications (e.g. search ranking and personalized recommendation) at online travel platforms. Although existing methods have made a great success by focusing on feature interaction or user behaviors, they cannot synthetically exploit item inclusion relations describing relationships of an item including or being included by another one, which are important components among travel items. To this end, in this paper, we propose a novel Deep Inclusion Relation-aware Network (DIRN) for user response prediction by synthetically exploiting inclusion relations among travel items. Specifically, on the item graph constructed with inclusion relations, we first leverage a node embedding approach to learn the item graph-based embedding. Then, we design Representation-based Interest Layer and Relation Path Interest Layer to extract user latent interest with user behaviors in two ways. Representation-based Interest Layer models the item-to-item similarity based on item representations containing the graph-based embedding with an attention mechanism and obtains user temporal interest by summing up representations of interacted items with similarities. Relation Path Interest Layer measures item-to-item realistic associations to extract user interest with inclusion relation paths. Offline experiments on a real-world data from Fliggy clearly validate the effectiveness of DIRN. Furthermore, DIRN has been successfully deployed online in search ranking at Fliggy and achieves significant improvement.

MPCSL - A Modular Pipeline for Causal Structure Learning

The examination of causal structures is crucial for data scientists in a variety of machine learning application scenarios. In recent years, the corresponding interest in methods of causal structure learning has led to a wide spectrum of independent implementations, each having specific accuracy characteristics and introducing implementation-specific overhead in the runtime. Hence, considering a selection of algorithms or different implementations in different programming languages utilizing different hardware setups becomes a tedious manual task with high setup costs. Consequently, a tool that enables to plug in existing methods from different libraries into a single system to compare and evaluate the results is substantial support for data scientists in their research efforts.

In this work, we propose an architectural blueprint of a pipeline for causal structure learning and outline our reference implementation MPCSL that addresses the requirements towards platform independence and modularity while ensuring the comparability and reproducibility of experiments. Moreover, we demonstrate the capabilities of MPCSL within a case study, where we evaluate existing implementations of the well-known PC-Algorithm concerning their runtime performance characteristics.

Knowledge-Guided Efficient Representation Learning for Biomedical Domain

Pre-trained concept representations are essential to many biomedical text mining and natural language processing tasks. As such, various representation learning approaches have been proposed in the literature. More recently, contextualized embedding approaches (i.e., BERT based models) that capture the implicit semantics of concepts at a granular level have significantly outperformed the conventional word embedding approaches (i.e., Word2Vec/GLoVE based models). Despite significant accuracy gains achieved, these approaches are often computationally expensive and memory inefficient. To address this issue, we propose a new representation learning approach that efficiently adapts the concept representations to the newly available data. Specifically, the proposed approach develops a knowledge-guided continual learning strategy wherein the accurate/stable context-information present in human-curated knowledge-bases is exploited to continually identify and retrain the representations of those concepts whose corpus-based context evolved coherently over time. Different from previous studies that mainly leverage the curated knowledge to improve the accuracy of embedding models, the proposed research explores the usefulness of semantic knowledge from the perspective of accelerating the training efficiency of embedding models. Comprehensive experiments under various efficiency constraints demonstrate that the proposed approach significantly improves the computational performance of biomedical word embedding models.

Bootstrapping for Batch Active Sampling

The goal of active learning is to select the best examples from an unlabeled pool of data to label to improve a model trained with the addition of these labeled examples. We discuss a real-world use case for batch active sampling that works at larger scales. The standard margin algorithm has repeatedly been shown difficult to beat in practice for the classic active sampling set-up, but for larger batches and candidate pools, we show that margin sampling may not provide enough diversity. We present a simple variant of margin sampling for the batch setting that scores candidate samples by their minimum margin to a set of bootstrapped margins, and explain how this proposal increases diversity in a supervised and efficient way, and why it differs from the usual ensemble methods for active sampling. Experiments on benchmark datasets show that the proposed min-margin sampling consistently works better than margin as the batch size grows, and better than the five other diversity-encouraging active sampling methods we tested. Two real-world case studies illustrate the practical value, and help highlight challenges of applying and deploying batch active sampling.

FleetRec: Large-Scale Recommendation Inference on Hybrid GPU-FPGA Clusters

We present FleetRec, a high-performance and scalable recommendation inference system within tight latency constraints. FleetRec takes advantage of heterogeneous hardware including GPUs and the latest FPGAs equipped with high-bandwidth memory. By disaggregating computation and memory to different types of hardware and bridging their connections by high-speed network, FleetRec gains the best of both worlds, and can naturally scale out by adding nodes to the cluster. Experiments on three production models up to 114 GB show that FleetRec outperforms optimized CPU baseline by more than one order of magnitude in terms of throughput while achieving significantly lower latency.

Network Experimentation at Scale

We describe our network experimentation framework, deployed at Facebook, which accounts for interference between experimental units. We document this system, including the design and estimation procedures, and detail insights we have gained from the many experiments that have used this system at scale. In our estimation procedure, we introduce a cluster-based regression adjustment that substantially improves precision for estimating global treatment effects, as well as a procedure to test for interference. With our regression adjustment, we find that imbalanced clusters can better account for interference than balanced clusters without sacrificing accuracy. In addition, we show that logging exposure to a treatment can result in additional variance reduction. Interference is a widely acknowledged issue in online field experiments, yet there is less evidence from real-world experiments demonstrating interference in online settings. We fill this gap by describing two case studies that capture significant network effects and highlight the value of this experimentation framework.

Addressing Non-Representative Surveys using Multiple Instance Learning

In recent years, non representative survey sampling and non response bias constitute major obstacles in obtaining a reliable population quantity estimate from finite survey samples. As such, researchers have been focusing on identifying methods to resolve these biases. In this paper, we look at this well known problem from a fresh perspective, and formulate it as a learning problem. To meet this challenge, we suggest solving the learning problem using a multiple instance learning (MIL) paradigm. We devise two different MIL based neural network topologies, each based on a different implementation of an attention pooling layer. These models are trained to accurately infer the population quantity of interest even when facing a biased sample. To the best of our knowledge, this is the first time MIL has ever been suggested as a solution to this problem. In contrast to commonly used statistical methods, this approach can be accomplished without having to collect sensitive personal data of the respondents and without having to access population level statistics of the same sensitive data. To validate the effectiveness of our approaches, we test them on a real-world movie rating dataset which is used to mimic a biased survey by experimentally contaminating it with different kinds of survey bias. We show that our suggested topologies outperform other MIL architectures, and are able to partly counter the adverse effect of biased sampling on the estimation quality. We also demonstrate how these methods can be easily adapted to perform well even when part of the survey is based on a small number of respondents.

Micro-climate Prediction - Multi Scale Encoder-decoder based Deep Learning Framework

This paper presents a deep learning approach for a versatile Micro-climate prediction framework (DeepMC). Micro climate predictions are of critical importance across various applications, such as Agriculture, Forestry, Energy, Search & Rescue, etc. To the best of our knowledge, there is no other single framework which can accurately predict various micro-climate entities using Internet of Things (IoT)data. We present a generic framework (DeepMC) which predicts various climatic parameters such as soil moisture, humidity, windspeed, radiation, temperature based on the requirement over a period of 12 hours - 120 hours with a varying resolution of 1 hour - 6hours, respectively. This framework proposes the following new ideas: 1) Localization of weather forecast to IoT sensors by fusing weather station forecasts with the decomposition of IoT data at multiple scales and 2) A multi-scale encoder and two levels of attention mechanisms which learns a latent representation of the interaction between various resolutions of the IoT sensor data and weather station forecasts. We present multiple real-world agricultural and energy scenarios, and report results with uncertainty estimates from the live deployment of DeepMC, which demonstrate that DeepMC outperforms various baseline methods and reports 90%+ accuracy with tight error bounds.

Architecture and Operation Adaptive Network for Online Recommendations

Learning feature interactions is crucial for model performance in online recommendations. Extensive studies are devoted to designing effective structures for learning interactive information in an explicit way and tangible progress has been made. However, the core interaction calculations of these models are artificially specified, such as inner product, outer product and self-attention, which results in high dependence on domain knowledge. Hence model effect is bounded by both restriction of human experience and the finiteness of candidate operations. In this paper, we propose a generalized interaction paradigm to lift the limitation, where operations adopted by existing models can be regarded as its special form. Based on this paradigm, we design a novel model to adaptively explore and optimize the operation itself according to data, named generalized interaction network(GIN). We proved that GIN is a generalized form of a wide range of state-of-the-art models, which means GIN can automatically search for the best operation among these models as well as a broader underlying architecture space. Finally, an architecture adaptation method is introduced to further boost the performance of GIN by discriminating important interactions. Thereby, architecture and operation adaptive network(AOANet) is presented. Experiment results on two large scale datasets show the superiority of our model. AOANet has been deployed to industrial production. In a 7-day A/B test, the click-through rate increased by 10.94%, which represents considerable business benefits.

Diet Planning with Machine Learning: Teacher-forced REINFORCE for Composition Compliance with Nutrition Enhancement

Diet planning is a basic and regular human activity. Previous studies have considered diet planning a combinatorial optimization problem to generate solutions that satisfy a diet's nutritional requirements. However, this approach does not consider the composition of diets, which is critical for diet recipients' to accept and enjoy menus with high nutritional quality. Without this consideration, feasible solutions for diet planning could not be provided in practice. This suggests the necessity of diet planning with machine learning, which extracts implicit composition patterns from real diet data and applies these patterns when generating diets. This work is original research that defines diet planning as a machine learning problem; we describe diets as sequence data and solve a controllable sequence generation problem. Specifically, we develop the Teacher-forced REINFORCE algorithm to connect neural machine translation and reinforcement learning for composition compliance with nutrition enhancement in diet generation. Through a real-world application to diet planning for children, we validated the superiority of our work over the traditional combinatorial optimization and modern machine learning approaches, as well as human (i.e., professional dietitians) performance. In addition, we construct and open the databases of menus and diets to motivate and promote further research and development of diet planning with machine learning. We believe this work with data science will contribute to solving economic and social problems associated with diet planning.

SEMI: A Sequential Multi-Modal Information Transfer Network for E-Commerce Micro-Video Recommendations

The micro-video recommendation system becomes an essential part of the e-commerce platform, which helps disseminate micro-videos to potentially interested users. Existing micro-video recommendation methods only focus on users' browsing behaviors on micro-videos, but ignore their purchasing intentions in the e-commerce environment. Thus, they usually achieve unsatisfied e-commerce micro-video recommendation performances. To address this problem, we design a sequential multi-modal information transfer network (SEMI), which utilizes product-domain user behaviors to assist micro-video recommendations. SEMI effectively selects relevant items (i.e., micro-videos and products) with multi-modal features in the micro-video domain and product domain to characterize users' preferences. Moreover, we also propose a cross-domain contrastive learning (CCL) algorithm to pre-train sequence encoders for modeling users' sequential behaviors in these two domains. The objective of CCL is to maximize a lower bound of the mutual information between different domains. We have performed extensive experiments on a large-scale dataset collected from Taobao, a world-leading e-commerce platform. Experimental results show that the proposed method achieves significant improvements over state-of-the-art recommendation methods. Moreover, the proposed method has also been deployed on Taobao, and the online A/B testing results further demonstrate its practical value.

Dual Attentive Sequential Learning for Cross-Domain Click-Through Rate Prediction

Cross domain recommender system constitutes a powerful method to tackle the cold-start and sparsity problem by aggregating and transferring user preferences across multiple category domains. Therefore, it has great potential to improve click-through-rate prediction performance in online commerce platforms having many domains of products. While several cross domain sequential recommendation models have been proposed to leverage information from a source domain to improve CTR predictions in a target domain, they did not take into account bidirectional latent relations of user preferences across source-target domain pairs. As such, they cannot provide enhanced cross-domain CTR predictions for both domains simultaneously. In this paper, we propose a novel approach to cross-domain sequential recommendations based on the dual learning mechanism that simultaneously transfers information between two related domains in an iterative manner until the learning process stabilizes. In particular, the proposed Dual Attentive Sequential Learning (DASL) model consists of two novel components Dual Embedding and Dual Attention, which jointly establish the two-stage learning process: we first construct dual latent embeddings that extract user preferences in both domains simultaneously, and subsequently provide cross-domain recommendations by matching the extracted latent embeddings with candidate items through dual-attention learning mechanism. We conduct extensive offline experiments on three real-world datasets to demonstrate the superiority of our proposed model, which significantly and consistently outperforms several state-of-the-art baselines across all experimental settings. We also conduct an online A/B test at a major video streaming platform Alibaba-Youku, where our proposed model significantly improves business performance over the latest production system in the company.

Embedding-based Product Retrieval in Taobao Search

Nowadays, the product search service of e-commerce platforms has become a vital shopping channel in people's life. The retrieval phase of products determines the search system's quality and gradually attracts researchers' attention. Retrieving the most relevant products from a large-scale corpus while preserving personalized user characteristics remains an open question. Recent approaches in this domain have mainly focused on embedding-based retrieval (EBR) systems. However, after a long period of practice on Taobao, we find that the performance of the EBR system is dramatically degraded due to its: (1) low relevance with a given query and (2) discrepancy between the training and inference phases. Therefore, we propose a novel and practical embedding-based product retrieval model, named Multi-Grained Deep Semantic Product Retrieval (MGDSPR). Specifically, we first identify the inconsistency between the training and inference stages, and then use the softmax cross-entropy loss as the training objective, which achieves better performance and faster convergence. Two efficient methods are further proposed to improve retrieval relevance, including smoothing noisy training data and generating relevance-improving hard negative samples without requiring extra knowledge and training procedures. We evaluate MGDSPR on Taobao Product Search with significant metrics gains observed in offline experiments and online A/B tests. MGDSPR has been successfully deployed to the existing multi-channel retrieval system in Taobao Search. We also introduce the online deployment scheme and share practical lessons of our retrieval system to contribute to the community.

Debiasing Learning based Cross-domain Recommendation

As it becomes prevalent that user information exists in multiple platforms or services, cross-domain recommendation has been an important task in industry. Although it is well known that users tend to show different preferences in different domains, existing studies seldom model how domain biases affect user preferences. Focused on this issue, we develop a casual-based approach to mitigating the domain biases when transferring the user information cross domains. To be specific, this paper presents a novel debiasing learning based cross-domain recommendation framework with causal embedding. In this framework, we design a novel Inverse-Propensity-Score (IPS) estimator designed for cross-domain scenario, and further propose three kinds of restrictions for propensity score learning. Our framework can be generally applied to various recommendation algorithms for cross-domain recommendation. Extensive experiments on both public and industry datasets have demonstrated the effectiveness of the proposed framework.

An Experimental Study of Quantitative Evaluations on Saliency Methods

It has been long debated that eXplainable AI (XAI) is an important technology for model and data exploration, validation, and debugging. To deploy XAI into actual systems, an executable and comprehensive evaluation of the quality of generated explanation is highly in demand. In this paper, we briefly summarize the status quo of the quantitative metrics of different properties of XAI including evaluation on faithfulness, localization, sensitivity check, and stability. With an exhaustive experimental study based on them, we conclude that among all the typical methods we compare, no single explanation method dominates others in all metrics. Nonetheless, Gradient-weighted Class Activation Mapping (Grad-CAM) and Randomly Input Sampling for Explanation (RISE) perform fairly well in most of the metrics. We further present a novel utilization of the evaluation results to diagnose the classification bases for models. Hopefully, this valuable work could serve as a guide for future research.

OpenBox: A Generalized Black-box Optimization Service

Black-box optimization (BBO) has a broad range of applications, including automatic machine learning, engineering, physics, and experimental design. However, it remains a challenge for users to apply BBO methods to their problems at hand with existing software packages, in terms of applicability, performance, and efficiency. In this paper, we build OpenBox, an open-source and general-purpose BBO service with improved usability. The modular design behind OpenBox also facilitates flexible abstraction and optimization of basic BBO components that are common in other existing systems. OpenBox is distributed, fault-tolerant, and scalable. To improve efficiency, OpenBox further utilizes "algorithm agnostic" parallelization and transfer learning. Our experimental results demonstrate the effectiveness and efficiency of OpenBox compared to existing systems.

Multivariate Time Series Anomaly Detection and Interpretation using Hierarchical Inter-Metric and Temporal Embedding

Anomaly detection is a crucial task for monitoring various status (i.e., metrics) of entities (e.g., manufacturing systems and Internet services), which are often characterized by multivariate time series (MTS). In practice, it's important to precisely detect the anomalies, as well as to interpret the detected anomalies through localizing a group of most anomalous metrics, to further assist the failure troubleshooting. In this paper, we propose InterFusion, an unsupervised method that simultaneously models the inter-metric and temporal dependency for MTS. Its core idea is to model the normal patterns inside MTS data through hierarchical Variational AutoEncoder with two stochastic latent variables, each of which learns low-dimensional inter-metric or temporal embeddings. Furthermore, we propose an MCMC-based method to obtain reasonable embeddings and reconstructions at anomalous parts for MTS anomaly interpretation. Our evaluation experiments are conducted on four real-world datasets from different industrial domains (three existing and one newly published dataset collected through our pilot deployment of InterFusion). InterFusion achieves an average anomaly detection F1-Score higher than 0.94 and anomaly interpretation performance of 0.87, significantly outperforming recent state-of-the-art MTS anomaly detection methods.

Reinforced Iterative Knowledge Distillation for Cross-Lingual Named Entity Recognition

Named entity recognition (NER) is a fundamental component in many applications, such as Web Search and Voice Assistants. Although deep neural networks greatly improve the performance of NER, due to the requirement of large amounts of training data, deep neural networks can hardly scale out to many languages in an industry setting. To tackle this challenge, cross-lingual NER transfers knowledge from a rich-resource language to languages with low resources through pre-trained multilingual language models. Instead of using training data in target languages, cross-lingual NER has to rely on only training data in source languages, and optionally adds the translated training data derived from source languages. However, the existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages, which is relatively easy to collect in industry applications. To address the opportunities and challenges, in this paper we describe our novel practice in Microsoft to leverage such large amounts of unlabeled data in target languages in real production settings. To effectively extract weak supervision signals from the unlabeled data, we develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning. The empirical study on three benchmark data sets verifies that our approach establishes the new state-of-the-art performance with clear edges. Now, the NER techniques reported in this paper are on their way to become a fundamental component for Web ranking, Entity Pane, Answers Triggering, and Question Answering in the Microsoft Bing search engine. Moreover, our techniques will also serve as part of the Spoken Language Understanding module for a commercial voice assistant. We plan to open source the code of the prototype framework after deployment.

Unveiling Fake Accounts at the Time of Registration: An Unsupervised Approach

Online social networks (OSNs) are plagued by fake accounts. Existing fake account detection methods either require a manually labeled training set, which is time-consuming and costly, or rely on rich information of OSN accounts, e.g., content and behaviors, which incurs significant delay in detecting fake accounts. In this work, we propose UFA (Unveiling Fake Accounts) to detect fake accounts immediately after they are registered in an unsupervised fashion. First, through a measurement study on the registration patterns on a real-world registration dataset, we observe that fake accounts tend to cluster on outlier registration patterns, e.g., IP and phone numbers. Then, we design an unsupervised learning algorithm to learn weights for all registration accounts and their features that reveal outlier registration patterns. Next, we construct a registration graph to capture the correlation between registration accounts, and utilize a community detection method to detect fake accounts via analyzing the registration graph structure. We evaluate UFA using real-world WeChat datasets. Our results demonstrate that UFA achieves a precision 94% with a recall ~80%, while a supervised variant requires 600K manual labels to obtain the comparable performance. Moreover, UFA has been deployed by WeChat to detect fake accounts for more than one year. UFA detects 500K fake accounts per day with a precision ~93% on average, via manual verification by the WeChat security team.

M6: Multi-Modality-to-Multi-Modality Multitask Mega-transformer for Unified Pretraining

Multimodal pretraining has demonstrated success in the downstream tasks of cross-modal representation learning. However, it is limited to the English data, and there is still a lack of large-scale dataset for multimodal pretraining in Chinese. In this work, we propose the largest dataset for pretraining in Chinese, which consists of over 1.9TB images and 292GB texts. The dataset has large coverage over domains, including encyclopedia, question answering, forum discussion, etc. Besides, we propose a method called M6, referring to Multi-Modality-to-Multi-Modality Multitask Mega-transformer, for unified pretraining on the data of single modality and multiple modalities. The model is pretrained with our proposed tasks, including text-to-text transfer, image-to-text transfer, as well as multi-modality-to-text transfer. The tasks endow the model with strong capability of understanding and generation. We scale the model to 10 billion parameters, and build the largest pretrained model in Chinese. Experimental results show that our proposed M6 outperforms the baseline in a number of downstream tasks concerning both single modality and multiple modalities, and the 10B-parameter pretrained model demonstrates strong potential in the setting of zero-shot learning.

PAM: Understanding Product Images in Cross Product Category Attribute Extraction

Understanding product attributes plays an important role in improving online shopping experience for customers and serves asan integral part for constructing a product knowledge graph. Most existing methods focus on attribute extraction from text description or utilize visual information from product images such as shape and color. Compared to the inputs considered in prior works, a product image in fact contains more information, represented by a rich mixture of words and visual clues with a layout carefully designed to impress customers. This work proposes a more inclusive framework that fully utilizes these different modalities for attribute extraction.Inspired by recent works in visual question answering, we use a transformer based sequence to sequence model to fuse representations of product text, Optical Character Recognition (OCR) tokens and visual objects detected in the product image. The framework is further extended with the capability to extract attribute value across multiple product categories with a single model, by training the decoder to predict both product category and attribute value and conditioning its output on product category. The model provides a unified attribute extraction solution desirable at an e-commerce platform that offers numerous product categories with a diverse body of product attributes. We evaluated the model on two product attributes, one with many possible values and one with a small set of possible values, over 14 product categories and found the model could achieve 15% gain on the Recall and 10% gain on the F1 score compared to existing methods using text-only features.

Large-Scale Network Embedding in Apache Spark

Network embedding has been widely used in social recommendation and network analysis, such as recommendation systems and anomaly detection with graphs. However, most of previous approaches cannot handle large graphs efficiently, due to that (i) computation on graphs is often costly and (ii) the size of graph or the intermediate results of vectors could be prohibitively large, rendering it difficult to be processed on a single machine. In this paper, we propose an efficient and effective distributed algorithm for network embedding on large graphs using Apache Spark, which recursively partitions a graph into several small-sized subgraphs to capture the internal and external structural information of nodes, and then computes the network embedding for each subgraph in parallel. Finally, by aggregating the outputs on all subgraphs, we obtain the embeddings of nodes in a linear cost. After that, we demonstrate in various experiments that our proposed approach is able to handle graphs with billions of edges within a few hours and is at least 4 times faster than the state-of-the-art approaches. Besides, it achieves up to 4.25% and 4.27% improvements on link prediction and node classification tasks respectively. In the end, we deploy the proposed algorithms in two online games of Tencent with the applications of friend recommendation and item recommendation, which improve the competitors by up to 91.11% in running time and up to 12.80% in the corresponding evaluation metrics.

Intention-aware Heterogeneous Graph Attention Networks for Fraud Transactions Detection

Fraud transactions have been the major threats to the healthy development of e-commerce platforms, which not only damage the user experience but also disrupt the orderly operation of the market. User behavioral data is widely used to detect fraud transactions, and recent works show that accurate modeling of user intentions in behavioral sequences can propel further improvements on the performances. However, most existing methods treat each transaction as an independent data instance without considering the transaction-level interactions accessed by transaction attributes, e.g., information on remark, logistics, payment, device and etc., which may fail to achieve satisfactory results in more complex scenarios. In this paper, a novel heterogeneous transaction-intention network is devised to leverage the cross-interaction information over transactions and intentions, which consists of two types of nodes, namely transaction and intention nodes, and two types of edges, i.e., transaction-intention and transaction-transaction edges. Then we propose a graph neural method coined IHGAT(Intention-aware Heterogeneous Graph ATtention networks) that not only perceives sequence-like intentions, but also encodes the relationship among transactions. Extensive experiments on a real-world dataset of Alibaba platform show that our proposed algorithm outperforms state-of-the-art methods in both offline and online modes.

JIZHI: A Fast and Cost-Effective Model-As-A-Service System for Web-Scale Online Inference at Baidu

In modern internet industries, deep learning based recommender systems have became an indispensable building block for a wide spectrum of applications, such as search engine, news feed, and short video clips. However, it remains challenging to carry the well-trained deep models for online real-time inference serving, with respect to the time-varying web-scale traffics from billions of users, in a cost-effective manner. In this work, we present JIZHI - a Model-as-a-Service system - that per second handles hundreds of millions of online inference requests to huge deep models with more than trillions of sparse parameters, for over twenty real-time recommendation services at Baidu, Inc. In JIZHI, the inference workflow of every recommendation request is transformed to a Staged Event-Driven Pipeline (SEDP), where each node in the pipeline refers to a staged computation or I/O intensive task processor. With traffics of real-time inference requests arrived, each modularized processor can be run in a fully asynchronized way and managed separately. Besides, JIZHI introduces the heterogeneous and hierarchical storage to further accelerate the online inference process by reducing unnecessary computations and potential data access latency induced by ultra-sparse model parameters. Moreover, an intelligent resource manager has been deployed to maximize the throughput of JIZHI over the shared infrastructure by searching the optimal resource allocation plan from historical logs and fine-tuning the load shedding policies over intermediate system feedback. Extensive experiments have been done to demonstrate the advantages of JIZHI from the perspectives of end-to-end service latency, system-wide throughput, and resource consumption. Since launched in July 2019, JIZHI has helped Baidu saved more than ten million US dollars in hardware and utility costs per year while handling 200% more traffics without sacrificing the inference efficiency.

Categorization of Financial Transactions in QuickBooks

This paper shares our work on building a machine learning system to categorize transactions for Intuit's QuickBooks product. Transaction categorization is challenging due to the complexity of accounting, the need for personalization, and the diversity of customers. We have broken down this monolithic problem into smaller pieces based on customers' life-cycle stages, and tailored solutions to address customer pain-points for each. Modern machine learning technologies such as deep neural networks, transfer learning, and few-shot learning are adopted to enable accurate transaction categorization. Furthermore our system learns user actions in real-time to provide relevant and in-time category recommendations. This in-session learning capability reduces user workload, improves customer experience, and helps to cultivate confidence.

KompaRe: A Knowledge Graph Comparative Reasoning System

Reasoning is a fundamental capability for harnessing valuable insight, knowledge and patterns from knowledge graphs. Existing work has primarily been focusing on point-wise reasoning, including search, link prediction, entity prediction, subgraph matching and so on. This paper introduces comparative reasoning over knowledge graphs, which aims to infer the commonality and inconsistency with respect to multiple pieces of clues. We envision that the comparative reasoning will complement and expand the existing point-wise reasoning over knowledge graphs. In detail, we develop KompaRe, the first of its kind prototype system that provides comparative reasoning capability over large knowledge graphs. We present both the system architecture and its core algorithms, including knowledge segment extraction, pairwise reasoning and collective reasoning. Empirical evaluations demonstrate the efficacy of the proposed KompaRe.

Trustworthy and Powerful Online Marketplace Experimentation with Budget-split Design

Online experimentation, also known as A/B testing, is the gold standard for measuring product impacts and making business decisions in the tech industry. The validity and utility of experiments, however, hinge on unbiasedness and sufficient power. In two-sided online marketplaces, both requirements are called into question. The Bernoulli randomized experiments are biased because treatment units interfere with control units through market competition and violate the "stable unit treatment value assumption"(SUTVA). The experimental power on at least one side of the market is often insufficient because of disparate sample sizes on the two sides. Despite the importance of online marketplaces to the online economy and the crucial role experimentation plays in product development, there lacks an effective and practical solution to the bias and low power problems in marketplace experimentation. In this paper we address this shortcoming by proposing the budget-split design, which is unbiased in any marketplace where buyers have a finite or infinite budget. We show that it is more powerful than all other unbiased designs in the literature. We then provide a generalizable system architecture for deploying this design to online marketplaces. Finally, we confirm the effectiveness of our proposal with empirical performance from experiments run in two real-world online marketplaces. We demonstrate how it achieves over 15x gain in experimental power and removes market competition induced bias, which can be up to 230% the treatment effect size.

SESSION: ADS Track Papers

Lane Change Scheduling for Autonomous Vehicle: A Prediction-and-Search Framework

Automation in road vehicles is an emerging technology that has developed rapidly over the last decade. There have been many inter-disciplinary challenges posed on existing transportation infrastructure by autonomous vehicles (AV). In this paper, we conduct an algorithmic study on when and how an autonomous vehicle should change its lane, which is a fundamental problem in vehicle automation field and root cause of most 'phantom' traffic jams. We propose a prediction-and-search framework, called Cheetah (Change lane smart for autonomous vehicle), which aims to optimize the lane changing maneuvers of autonomous vehicle while minimizing its impact on surrounding vehicles. In the prediction phase, Cheetah learns the spatio-temporal dynamics from historical trajectories of surrounding vehicles with a deep model (GAS-LED) and predict their corresponding actions in the near future. A global attention mechanism and state sharing strategy are also incorporated to achieve higher accuracy and better convergence efficiency. Then in the search phase, Cheetah looks for optimal lane change maneuvers for the autonomous vehicle by taking into account a few factors such as speed, impact on other vehicles and safety issues. A tree-based adaptive beam search algorithm is designed to reduce the search space and improve accuracy. Extensive experiments on real and synthetic data evidence that the proposed framework excels state-of-the-art competitors with respect to both effectiveness and efficiency.

Neural Auction: End-to-End Learning of Auction Mechanisms for E-Commerce Advertising

In e-commerce advertising, it is crucial to jointly consider various performance metrics, e.g., user experience, advertiser utility, and platform revenue. Traditional auction mechanisms, such as GSP and VCG auctions, can be suboptimal due to their fixed allocation rules to optimize a single performance metric (e.g., revenue or social welfare). Recently, data-driven auctions, learned directly from auction outcomes to optimize multiple performance metrics, have attracted increasing research interests. However, the procedure of auction mechanisms involves various discrete calculation operations, making it challenging to be compatible with continuous optimization pipelines in machine learning. In this paper, we design Deep Neural Auctions (DNAs) to enable end-to-end auction learning by proposing a differentiable model to relax the discrete sorting operation, a key component in auctions. We optimize the performance metrics by developing deep models to efficiently extract contexts from auctions, providing rich features for auction design. We further integrate the game theoretical conditions within the model design, to guarantee the stability of the auctions. DNAs have been successfully deployed in the e-commerce advertising system at Taobao. Experimental evaluation results on both large-scale data set as well as online A/B test demonstrated that DNAs significantly outperformed other mechanisms widely adopted in industry.

Pre-trained Language Model for Web-scale Retrieval in Baidu Search

Retrieval is a crucial stage in web search that identifies a small set of query-relevant candidates from a billion-scale corpus. Discovering more semantically-related candidates in the retrieval stage is very promising to expose more high-quality results to the end users. However, it still remains non-trivial challenges of building and deploying effective retrieval models for semantic matching in real search engine. In this paper, we describe the retrieval system that we developed and deployed in Baidu Search. The system exploits the recent state-of-the-art Chinese pretrained language model, namely Enhanced Representation through kNowledge IntEgration (ERNIE), which facilitates the system with expressive semantic matching. In particular, we developed an ERNIE-based retrieval model, which is equipped with 1) expressive Transformer-based semantic encoders, and 2) a comprehensive multi-stage training paradigm. More importantly, we present a practical system workflow for deploying the model in web-scale retrieval. Eventually, the system is fully deployed into production, where rigorous offline and online experiments were conducted. The results show that the system can perform high-quality candidate retrieval, especially for those tail queries with uncommon demands. Overall, the new retrieval system facilitated by pretrained language model (i.e., ERNIE) can largely improve the usability and applicability of our search engine.

Que2Search: Fast and Accurate Query and Document Understanding for Search at Facebook

In this paper, we present Que2Search, a deployed query and product understanding system for search. Que2Search leverages multi-task and multi-modal learning approaches to train query and product representations. We achieve over 5% absolute offline relevance improvement and over 4% online engagement gain over state-of-the-art Facebook product understanding system by combining the latest multilingual natural language understanding architectures like XLM and XLM-R with multi-modal fusion techniques. In this paper, we describe how we deploy XLM-based search query understanding model that runs <1.5ms @P99 on CPU at Facebook scale, which has been a significant challenge in the industry. We also describe what model optimizations worked (and what did not) based on numerous offline and online A/B experiments. We deploy Que2Search to Facebook Marketplace Search and share our deployment experience to production and tuning tricks to achieve higher efficiency in online A/B experiments. Que2Search has demonstrated gains in production applications and operates at Facebook scale.

AliCoCo2: Commonsense Knowledge Extraction, Representation and Application in E-commerce

Commonsense knowledge used by humans while doing online shopping is valuable but difficult to be captured by existing systems running on e-commerce platforms. While construction of common- sense knowledge graphs in e-commerce is non-trivial, representation learning upon such graphs poses unique challenge compared to well-studied open-domain knowledge graphs (e.g., Freebase). By leveraging the commonsense knowledge and representation techniques, various applications in e-commerce can be benefited. Based on AliCoCo, the large-scale e-commerce concept net assisting a series of core businesses in Alibaba, we further enrich it with more commonsense relations and present AliCoCo2, the first commonsense knowledge graph constructed for e-commerce use. We propose a multi-task encoder-decoder framework to provide effective representations for nodes and edges from AliCoCo2. To explore the possibility of improving e-commerce businesses with commonsense knowledge, we apply newly mined commonsense relations and learned embeddings to e-commerce search engine and recommendation system in different ways. Experimental results demonstrate that our proposed representation learning method achieves state-of-the-art performance on the task of knowledge graph completion (KGC), and applications on search and recommendation indicate great potential value of the construction and use of commonsense knowledge graph in e-commerce. Besides, we propose an e-commerce QA task with a new benchmark during the construction of AliCoCo2, for testing machine common sense in e-commerce, which can benefit research community in exploring commonsense reasoning.

What Happened Next? Using Deep Learning to Value Defensive Actions in Football Event-Data

Objectively quantifying the value of player actions in football (soccer) is a challenging problem. To date, studies in football analytics have mainly focused on the attacking side of the game, while there has been less work on event-driven metrics for valuing defensive actions (e.g., tackles and interceptions). Therefore in this paper, we use deep learning techniques to define a novel metric that values such defensive actions by studying the threat of passages of play that preceded them. By doing so, we are able to value defensive actions based on what they prevented from happening in the game. Our Defensive Action Expected Threat (DAxT) model has been validated using real-world event-data from the 2017/2018 and 2018/2019 English Premier League seasons, and we combine our model outputs with additional features to derive an overall rating of defensive ability for players. Overall, we find that our model is able to predict the impact of defensive actions allowing us to better value defenders using event-data.

VisualTextRank: Unsupervised Graph-based Content Extraction for Automating Ad Text to Image Search

Numerous online stock image libraries offer high quality yet copyright free images for use in marketing campaigns. To assist advertisers in navigating such third party libraries, we study the problem of automatically fetching relevant ad images given the ad text (via a short textual query for images). Motivated by our observations in logged data on ad image search queries (given ad text), we formulate a keyword extraction problem, where a keyword extracted from the ad text (or its augmented version) serves as the ad image query. In this context, we propose VisualTextRank: an unsupervised method to (i) augment input ad text using semantically similar ads, and (ii) extract the image query from the augmented ad text. VisualTextRank builds on prior work on graph based context extraction (biased TextRank in particular) by leveraging both the text and image of similar ads for better keyword extraction, and using advertiser category specific biasing with sentence-BERT embeddings. Using data collected from the Verizon Media Native (Yahoo Gemini) ad platform's stock image search feature for onboarding advertisers, we demonstrate the superiority of VisualTextRank compared to competitive keyword extraction baselines (including an 11% accuracy lift over biased TextRank). For the case when the stock image library is restricted to English queries, we show the effectiveness of VisualTextRank on multilingual ads (translated to English) while leveraging semantically similar English ads. Online tests with a simplified version of VisualTextRank led to a 28.7% increase in the usage of stock image search, and a 41.6% increase in the advertiser onboarding rate in the Verizon Media Native ad platform.

Zero-shot Multi-lingual Interrogative Question Generation for "People Also Ask" at Bing

Multi-lingual question generation (QG) is the task of generating natural language questions for single answer passage in any given language. In this paper, we design a system for supporting multi-lingual QG in the "People Also Ask" (PAA) module for Bing. For zero shot setting, the primary challenge is to transfer the knowledge from trained QG model in the pivot language to other languages without further addition of training data in these languages. Compared to other zero-shot tasks, the differentiating and challenging aspect in QG is to preserve the question structure so that the resulting output is interrogative. Existing models for similar tasks tend to generate natural language queries or copy sub-span of the passage, failing to preserve the question structure. In our work, we demonstrate how knowledge transfer in multi-lingual IQG (Interrogative QG) can be significantly improved using auxiliary tasks either in multi-task or pre-training task setting. We explore two kinds of tasks - cross-lingual translation and multi-lingual denoising auto-encoding of questions, especially when using translate-train. Using data for 13 languages from Bing PAA as well as online A/B tests, we show that both of these tasks significantly improve the quality of zero-shot IQG on non-trained languages.

Diversity driven Query Rewriting in Search Advertising

Retrieving keywords (bidwords) with the same intent as query, referred to as close variant keywords, is of prime importance for effective targeted search advertising. For head and torso search queries, sponsored search engines use a huge repository of same intent queries and keywords, mined ahead of time. Online, this repository is used to rewrite the query and then lookup the rewrite in a repository of bid keywords contributing to significant revenue. Recently generative retrieval models have been shown to be effective at the task of generating such query rewrites. We observe two main limitations of such generative models. First, rewrites generated by these models exhibit low lexical diversity, and hence the rewrites fail to retrieve relevant keywords that have diverse linguistic variations. Second, there is a misalignment between the training objective - the likelihood of training data, v/s what we desire - improved quality and coverage of rewrites. In this work, we introduce CLOVER, a framework to generate both high-quality and diverse rewrites by optimizing for human assessment of rewrite quality using our diversity-driven reinforcement learning algorithm. We use an evaluation model, trained to predict human judgments, as the reward function to finetune the generation policy. We empirically show the effectiveness of our proposed approach through offline experiments on search queries across geographies spanning three major languages. We also perform online A/B experiments on Bing, a large commercial search engine, which shows (i) better user engagement with an average increase in clicks by 12.83% accompanied with an average defect reduction by 13.97%, and (ii) improved revenue by 21.29%.

SizeFlags: Reducing Size and Fit Related Returns in Fashion E-Commerce

E-commerce is growing at an unprecedented rate and the fashion industry has recently witnessed a noticeable shift in customers' order behaviour towards stronger online shopping. However, fashion articles ordered online do not always find their way to a customer's wardrobe. In fact, a large share of them end up being returned. Finding clothes that fit online is very challenging and accounts for one of the main drivers of increased return rates in fashion e-commerce. Size and fit related returns severely impact 1. the customers experience and their dissatisfaction with online shopping, 2. the environment through an increased carbon footprint, and 3. the profitability of online fashion platforms. Due to poor fit, customers often end up returning articles that they like but do not fit them, which they have to re-order in a different size. To tackle this issue we introduce SizeFlags, a probabilistic Bayesian model based on weakly annotated large-scale data from customers. Leveraging the advantages of the Bayesian framework, we extend our model to successfully integrate rich priors from human experts feedback and computer vision intelligence. Through extensive experimentation, large-scale A/B testing and continuous evaluation of the model in production, we demonstrate the strong impact of the proposed approach in robustly reducing size-related returns in online fashion over~14~countries.

A Multi-Graph Attributed Reinforcement Learning based Optimization Algorithm for Large-scale Hybrid Flow Shop Scheduling Problem

Hybrid Flow Shop Scheduling Problem (HFSP) is an essential problem in the automated warehouse scheduling, aiming at optimizing the sequence of jobs and the assignment of machines to utilize the makespan or other objectives. Existing algorithms adopt fixed search paradigm based on expert knowledge to seek satisfactory solutions. However, considering the varying data distribution and large scale of the practical HFSP, these methods fail to guarantee the quality of the obtained solution under the real-time requirement, especially facing extremely different data distribution. To address this challenge, we propose a novel Multi-Graph Attributed Reinforcement Learning based Optimization (MGRO) algorithm to better tackle the practical large-scale HFSP and improve the existing algorithm. Owing to incorporating the reinforcement learning-based policy search approach with classic search operators and the powerful multi-graph based representation, MGRO is capable of adjusting the search paradigm according to specific instances and enhancing the search efficiency. Specifically, we formulate the Gantt chart of the instance into the multi-graph-structured data. Then Graph Neural Network (GNN) and attention-based adaptive weighted pooling are employed to represent the state and make MGRO size-agnostic across arbitrary sizes of instances. In addition, a useful reward shaping approach is designed to facilitate model convergence. Extensive numerical experiments on both the publicly available dataset and real industrial dataset from Huawei Supply Chain Business Unit demonstrate the superiority of MGRO over existing baselines.

AttDMM: An Attentive Deep Markov Model for Risk Scoring in Intensive Care Units

Clinical practice in intensive care units (ICUs) requires early warnings when a patient's condition is about to deteriorate so that preventive measures can be undertaken. To this end, prediction algorithms have been developed that estimate the risk of mortality in ICUs. In this work, we propose a novel generative deep probabilistic model for real-time risk scoring in ICUs. Specifically, we develop an attentive deep Markov model called AttDMM. To the best of our knowledge, AttDMM is the first ICU prediction model that jointly learns both long-term disease dynamics (via attention) and different disease states in health trajectory (via a latent variable model). Our evaluations were based on an established baseline dataset (MIMIC-III) with 53,423 ICU stays. The results confirm that compared to state-of-the-art baselines, our AttDMM was superior: AttDMM achieved an area under the receiver operating characteristic curve (AUROC) of 0.876, which yielded an improvement over the state-of-the-art method by 2.2%. In addition, the risk score from the AttDMM provided warnings several hours earlier. Thereby, our model shows a path towards identifying patients at risk so that health practitioners can intervene early and save patient lives.

Amazon SageMaker Automatic Model Tuning: Scalable Gradient-Free Optimization

Tuning complex machine learning systems is challenging. Machine learning typically requires to set hyperparameters, be it regularization, architecture, or optimization parameters, whose tuning is critical to achieve good predictive performance. To democratize access to machine learning systems, it is essential to automate the tuning. This paper presents Amazon SageMaker Automatic Model Tuning (AMT), a fully managed system for gradient-free optimization at scale. AMT finds the best version of a trained machine learning model by repeatedly evaluating it with different hyperparameter configurations. It leverages either random search or Bayesian optimization to choose the hyperparameter values resulting in the best model, as measured by the metric chosen by the user. AMT can be used with built-in algorithms, custom algorithms, and Amazon SageMaker pre-built containers for machine learning frameworks. We discuss the core functionality, system architecture, our design principles, and lessons learned. We also describe more advanced features of AMT, such as automated early stopping and warm-starting, showing in experiments their benefits to users.

User Consumption Intention Prediction in Meituan

For online life service platforms, such as Meituan, user consumption intention, as the internal driving force of consumption behaviors, plays a significant role in understanding and predicting users' demand and purchase. However, user consumption intention prediction is quite challenging. Different from consumption behaviors, consumption intention is implicit and always not reflected by behavioral data. Moreover, it is affected by both user intrinsic preference and spatio-temporal context. To overcome these challenges, in Meituan, we design a real-world system consisting of two stages, intention detection and prediction. Specifically, at the intention-detection stage, we combine the knowledge of human experts and consumption information to obtain explicit intentions and match consumption with intentions based on user review data. At the intention-prediction stage, to collectively exploit the rich heterogeneous influencing factors, we design a graph neural network-based intention prediction model GRIP, which can capture user intrinsic preference and spatio-temporal context. Extensive offline evaluations demonstrate that our prediction model outperforms the best baseline by 10.26% and 33.28% for two metrics and online A/B tests on millions of users validate the effectiveness of our system.

Bootstrapping Recommendations at Chrome Web Store

Google Chrome, one of the world's most popular web browsers, features an extension framework allowing third-party developers to enhance Chrome's functionality. Chrome extensions are distributed through the Chrome Web Store (CWS), a Google-operated online marketplace. In this paper, we describe how we developed and deployed three recommender systems for discovering relevant extensions in CWS, namely non-personalized recommendations, related extension recommendations, and personalized recommendations. Unlike most existing papers that focus on novel algorithms, this paper focuses on sharing practical experiences when building large-scale recommender systems under various real-world constraints, such as privacy constraints, data sparsity and skewness issues, and product design choices (e.g., user interface). We show how these constraints make standard approaches difficult to succeed in practice. We share success stories that turn negative live metrics to positive ones, including: 1) how we use interpretable neural models to bootstrap the systems, help identifying pipeline issues, and pave the way for more advanced models; 2) a new item-item based algorithm for related recommendations that works under highly skewed data distributions; and 3) how the previous two techniques can help bootstrapping the personalized recommendations, which significantly reduces development cycles and bypasses various real-world difficulties. All the explorations in this work are verified in live traffic on millions of users. We believe that the findings in this paper can help practitioners to build better large-scale recommender systems.

Lambda Learner: Fast Incremental Learning on Data Streams

One of the most well-established applications of machine learning is in deciding what content to show website visitors. When observation data comes from high-velocity, user-generated data streams, machine learning methods perform a balancing act between model complexity, training time, and computational costs. Furthermore, when model freshness is critical, the training of models becomes time-constrained. Parallelized batch offline training, although horizontally scalable, is often not time-considerate or cost-effective. In this paper, we propose Lambda Learner, a new framework for training models by incremental updates in response to mini-batches from data streams. We show that the resulting model of our framework closely estimates a periodically updated model trained on offline data and outperforms it when model updates are time-sensitive. We provide theoretical proof that the incremental learning updates improve the loss-function over a stale batch model. We present a large-scale deployment on the sponsored content platform for a large social network, serving hundreds of millions of users across different channels (e.g., desktop, mobile). We address challenges and complexities from both algorithms and infrastructure perspectives, illustrate the system details for computation, storage, stream processing training data, and open-source the system.

RAPT: Pre-training of Time-Aware Transformer for Learning Robust Healthcare Representation

With the development of electronic health records (EHRs), prenatal care examination records have become available for developing automatic prediction or diagnosis approaches with machine learning methods. In this paper, we study how to effectively learn representations applied to various downstream tasks for EHR data. Although several methods have been proposed in this direction, they usually adapt classic sequential models to solve one specific diagnosis task or address unique EHR data issues. This makes it difficult to reuse these existing methods for the early diagnosis of pregnancy complications or provide a general solution to address the series of health problems caused by pregnancy complications. In this paper, we propose a novel model RAPT, which stands for RepresentAtion by Pre-training time-aware Transformer. To associate pre-training and EHR data, we design an architecture that is suitable for both modeling EHR data and pre-training, namely time-aware Transformer. To handle various characteristics in EHR data, such as insufficiency, we carefully devise three pre-training tasks to handle data insufficiency, data incompleteness and short sequence problems, namely similarity prediction, masked prediction and reasonability check. In this way, our representations can capture various EHR data characteristics. Extensive experimental results for four downstream tasks have shown the effectiveness of the proposed approach. We also introduce sensitivity analysis to interpret the model and design an interface to show results and interpretation for doctors. Finally, we implement a diagnosis system for pregnancy complications based on our pre-training model. Doctors and pregnant women can benefit from the diagnosis system in early diagnosis of pregnancy complications.

A Bayesian Approach to In-Game Win Probability in Soccer

In-game win probability models, which provide a sports team's likelihood of winning at each point in a game based on historical observations, are becoming increasingly popular. In baseball, basketball and American football, they have become important tools to enhance fan experience, to evaluate in-game decision-making, and to inform coaching decisions. While equally relevant in soccer, the adoption of these models is held back by technical challenges arising from the low-scoring nature of the sport.

In this paper, we introduce an in-game win probability model for soccer that addresses the shortcomings of existing models. First, we demonstrate that in-game win probability models for other sports struggle to provide accurate estimates for soccer, especially towards the end of a game. Second, we introduce a novel Bayesian statistical framework that estimates running win, tie and loss probabilities by leveraging a set of contextual game state features. An empirical evaluation on eight seasons of data for the top-five soccer leagues demonstrates that our framework provides well-calibrated probabilities. Furthermore, two use cases show its ability to enhance fan experience and to evaluate performance in crucial game situations.

Contextual Bandit Applications in a Customer Support Bot

Virtual support agents have grown in popularity as a way for businesses to provide better and more accessible customer service. Some challenges in this domain include ambiguous user queries as well as changing support topics and user behavior (non-stationarity). We do, however, have access to partial feedback provided by the user (clicks, surveys, and other events) which can be leveraged to improve the user experience. Adaptable learning techniques, like contextual bandits, are a natural fit for this problem setting. In this paper, we discuss real-world implementations of contextual bandits (CB) for the Microsoft virtual agent. It includes intent disambiguation based on neural-linear bandits (NLB) and contextual recommendations based on a collection of multi-armed bandits (MAB). Our solutions have been deployed to production and have improved key business metrics of the Microsoft virtual agent, as confirmed by A/B experiments. Results include a relative increase of over 12% in problem resolution rate and relative decrease of over 4% in escalations to a human operator. While our current use cases focus on intent disambiguation and contextual recommendation for support bots, we believe our methods can be extended to other domains.

Predicting COVID-19 Spread from Large-Scale Mobility Data

To manage the COVID-19 epidemic effectively, decision-makers in public health need accurate forecasts of case numbers. A potential near real-time predictor of future case numbers is human mobility; however, research on the predictive power of mobility is lacking. To fill this gap, we introduce a novel model for epidemic forecasting based on mobility data, called mobility marked Hawkes model. The proposed model consists of three components: (1) A Hawkes process captures the transmission dynamics of infectious diseases. (2) A mark modulates the rate of infections, thus accounting for how the reproduction number R varies across space and time. The mark is modeled using a regularized Poisson regression based on mobility covariates. (3) A correction procedure incorporates new cases seeded by people traveling between regions. Our model was evaluated on the COVID-19 epidemic in Switzerland. Specifically, we used mobility data from February through April 2020, amounting to approximately 1.5 billion trips. Trip counts were derived from large-scale telecommunication data, i.e., cell phone pings from the Swisscom network, the largest telecommunication provider in Switzerland. We compared our model against various state-of-the-art baselines in terms of out-of-sample root mean squared error. We found that our model outperformed the baselines by 15.52%. The improvement was consistently achieved across different forecast horizons between 5 and 21 days. In addition, we assessed the predictive power of conventional point of interest data, confirming that telecommunication data is superior. To the best of our knowledge, our work is the first to predict the spread of COVID-19 from telecommunication data. Altogether, our work contributes to previous research by developing a scalable early warning system for decision-makers in public health tasked with controlling the spread of infectious diseases.

Does Air Quality Really Impact COVID-19 Clinical Severity: Coupling NASA Satellite Datasets with Geometric Deep Learning

Given that persons with a prior history of respiratory diseases tend to demonstrate more severe illness from COVID-19 and, hence, are at higher risk of serious symptoms, ambient air quality data from NASA's satellite observations might provide a critical insight into which geographical areas may exhibit higher numbers of hospitalizations due to COVID-19, how the expected severity of COVID-19 and associated survival rates may vary across space in the future, and most importantly how given this information, health professionals can distribute vaccines in a more efficient, timely, and fair manner.

Despite the utmost urgency of this problem, there yet exists no systematic analysis on linkages among COVID-19 clinical severity, air quality, and other atmospheric conditions, beyond relatively simplistic regression-based models.

The goal of this project is to glean a deeper insight into sophisticated spatio-temporal dependencies among air quality, atmospheric conditions, and COVID-19 clinical severity using the machinery of Geometric Deep Learning (GDL), while providing quantitative uncertainty estimates. Our results based on the GDL model on a county level in three US states, California, Pennsylvania and Texas, indicate that AOD attributes to COVID-19 clinical severity in 39, 30, and 132 counties out of 58, 67, and 254 total counties, respectively. In turn, relative humidity is another important factor for understanding dynamics of clinical course and mortality risks due COVID-19, but predictive utility of temperature is noticeably lower. Our findings do not only contribute to understanding of latent factors behind COVID-19 progression but open new perspectives for innovative use of NASA's datasets for biosurveillance and social good.

Learning to Assign: Towards Fair Task Assignment in Large-Scale Ride Hailing

Ride hailing is a widespread shared mobility application where the central issue is to assign taxi requests to drivers with various objectives. Despite extensive research on task assignment in ride hailing, the fairness of earnings among drivers is largely neglected. Pioneer studies on fair task assignment in ride hailing are ineffective and inefficient due to their myopic optimization perspective and time-consuming assignment techniques. In this work, we propose LAF, an effective and efficient task assignment scheme that optimizes both utility and fairness. We adopt reinforcement learning to make assignments in a holistic manner and propose a set of acceleration techniques to enable fast fair assignment on large-scale data. Experiments show that LAF outperforms the state-of-the-arts by up to 86.7%, 29.1%, 797% on fairness, utility and efficiency, respectively.

Interpretable Drug Response Prediction using a Knowledge-based Neural Network

Predicting drug response based on the genomic profile of a cancer patient is one of the hallmarks of precision oncology. Despite current methods for drug response prediction becoming more accurate, there is still a need to switch from 'black box' predictions to methods that offer high accuracy as well as interpretable predictions. This is of particular importance in real-world applications such as drug response prediction in cancer patients. In this paper, we propose BDKANN, a novel knowledge-based method that employs the hierarchical information on how proteins form complexes and act together in pathways to form the architecture of a deep neural network. We employ BDKANN to predict cancer drug response from cell line gene expression data and our experimental results demonstrate that not only does BDKANN have a low prediction error compared to baseline models but it also allows meaningful interpretation of the network. These interpretations can both explain predictions made and discover novel connections in the biological knowledge that may lead to new hypotheses about mechanisms of drug action.

Mondegreen: A Post-Processing Solution to Speech Recognition Error Correction for Voice Search Queries

As more and more online search queries come from voice, automatic speech recognition becomes a key component to deliver relevant search results. Errors introduced by automatic speech recognition (ASR) lead to irrelevant search results returned to the user, thus causing user dissatisfaction. In this paper, we introduce an approach, "Mondegreen", to correct voice queries in text space without depending on audio signals, which may not always be available due to system constraints or privacy or bandwidth (for example, some ASR systems run on-device) considerations. We focus on voice queries transcribed via several proprietary commercial ASR systems. These queries come from users making internet, or online service search queries. We first present an analysis showing how different the language distribution coming from user voice queries is from that in traditional text corpora used to train off-the-shelf ASR systems. We then demonstrate that Mondegreen can achieve significant improvements in increased user interaction by correcting user voice queries in one of the largest search systems in Google. Finally, we see Mondegreen as complementing existing highly-optimized production ASR systems, which may not be frequently retrained and thus lag behind due to vocabulary drifts.

Dynamic Social Media Monitoring for Fast-Evolving Online Discussions

Tracking and collecting fast-evolving online discussions provides vast data for studying social media usage and its role in people's public lives. However, collecting social media data using a static set of keywords fails to satisfy the growing need to monitor dynamic conversations and to study fast-changing topics. We propose a dynamic keyword search method to maximize the coverage of relevant information in fast-evolving online discussions. The method uses word embedding models to represent the semantic relations between keywords and predictive models to forecast the future trajectory of keywords. We also implement a visual user interface to aid in the decision making process in each round of keyword updates. This allows for both human-assisted tracking and fully-automated data collection. In simulations using historical #MeToo data in 2017, our human-assisted tracking method outperforms the traditional static baseline method significantly, achieving 37.1% improvement in F-1 score in the task of tracking the top trending keywords. We conduct a contemporary case study to cover dynamic conversations about the recent Presidential Inauguration and to test the dynamic data collection system. Our case studies reflect the effectiveness of our process and also points to the potential challenges in future deployment.

MoCL: Data-driven Molecular Fingerprint via Knowledge-aware Contrastive Learning from Molecular Graph

Recent years have seen a rapid growth of utilizing graph neural networks (GNNs) in the biomedical domain for tackling drug-related problems. However, like any other deep architectures, GNNs are data hungry. While requiring labels in real world is often expensive, pretraining GNNs in an unsupervised manner has been actively explored. Among them, graph contrastive learning, by maximizing the mutual information between paired graph augmentations, has been shown to be effective on various downstream tasks. However, the current graph contrastive learning framework has two limitations. First, the augmentations are designed for general graphs and thus may not be suitable or powerful enough for certain domains. Second, the contrastive scheme only learns representations that are invariant to local perturbations and thus does not consider the global structure of the dataset, which may also be useful for downstream tasks. In this paper, we study graph contrastive learning designed specifically for the biomedical domain, where molecular graphs are present. We propose a novel framework called MoCL, which utilizes domain knowledge at both local- and global-level to assist representation learning. The local-level domain knowledge guides the augmentation process such that variation is introduced without changing graph semantics. The global-level knowledge encodes the similarity information between graphs in the entire dataset and helps to learn representations with richer semantics. The entire model is learned through a double contrast objective. We evaluate MoCL on various molecular datasets under both linear and semi-supervised settings and results show that MoCL achieves state-of-the-art performance.

A PLAN for Tackling the Locust Crisis in East Africa: Harnessing Spatiotemporal Deep Models for Locust Movement Forecasting

East Africa is experiencing the worst locust infestation in over 25 years, which has severely threatened the food security of millions of people across the region. The primary strategy adopted by human experts at the United Nations Food and Agricultural Organization (UN-FAO) to tackle locust outbreaks involves manually surveying at-risk geographical areas, followed by allocating and spraying pesticides in affected regions. In order to augment and assist human experts at the UN-FAO in this task, we utilize crowdsourced reports of locust observations collected by PlantVillage (the world's leading knowledge delivery system for East African farmers) and develop PLAN, a Machine Learning (ML) algorithm for forecasting future migration patterns of locusts at high spatial and temporal resolution across East Africa. PLAN's novel spatio-temporal deep learning architecture enables representing PlantVillage's crowdsourced locust observation data using novel image-based feature representations, and its design is informed by several unique insights about this problem domain. Experimental results show that PLAN achieves superior predictive performance against several baseline models - it achieves an AUC score of 0.9 when used with a data augmentation method. PLAN represents a first step in using deep learning to assist and augment human expertise at PlantVillage (and UN-FAO) in locust prediction, and its real-world usability is currently being evaluated by domain experts (including a potential idea to use the heatmaps created by PLAN in a Kenyan TV show). The source code is available at https://github.com/maryam-tabar/PLAN.

Value Function is All You Need: A Unified Learning Framework for Ride Hailing Platforms

Large ride-hailing platforms, such as DiDi, Uber and Lyft, connect tens of thousands of vehicles in a city to millions of ride demands throughout the day, providing great promises for improving transportation efficiency through the tasks of order dispatching and vehicle repositioning. Existing studies, however, usually consider the two tasks in simplified settings that hardly address the complex interactions between the two, the real-time fluctuations between supply and demand, and the necessary coordinations due to the large-scale nature of the problem. In this paper we propose a unified value-based dynamic learning framework (V1D3) for tackling both tasks. At the center of the framework is a globally shared value function that is updated continuously using online experiences generated from real-time platform transactions. To improve the sample-efficiency and the robustness, we further propose a novel periodic ensemble method combining the fast online learning with a large-scale offline training scheme that leverages the abundant historical driver trajectory data. This allows the proposed framework to adapt quickly to the highly dynamic environment, to generalize robustly to recurrent patterns and to drive implicit coordinations among the population of managed vehicles. Extensive experiments based on real-world datasets show considerably improvements over other recently proposed methods on both tasks. Particularly, V1D3 outperforms the first prize winners of both dispatching and repositioning tracks in the KDD Cup 2020 RL competition, achieving state-of-the-art results on improving both total driver income and user experience related metrics.

Recommending the Most Effective Intervention to Improve Employment for Job Seekers with Disability

In Disability Employment Services (DES), a growing problem is recommending to disabled job seekers which skill should be upgraded and the best level for upgrading this skill to increase their employment potential most. This problem involves counterfactual reasoning to infer causal effect of factors on employment status to recommend the most effective intervention. Related methods cannot solve our problem adequately since they are developed for non-counterfactual challenges, for binary causal factors, or for randomized trials. In this paper, we present a causality-based method to tackle the problem. The method includes two stages where causal factors of employment status are first detected from data. We then combine a counterfactual reasoning framework with a machine learning approach to build an interpretable model for generating personalized recommendations. Experiments on both synthetic datasets and a real case study from a DES provider show consistent promising performance of improving employability of disabled job seekers. Results from the case study disclose effective factors and their best levels for intervention to increase employability. The most effective intervention varies among job seekers. Our model can separate job seekers by degree of employability increase. This is helpful for DES providers to allocate resources for employment assistance. Moreover, causal interpretability makes our recommendations actionable in DES business practice.

Clockwork: A Delay-Based Global Scheduling Framework for More Consistent Landing Times in the Data Warehouse

Recurring batch data pipelines are a staple of the modern enterprise-scale data warehouse. As a data warehouse scales to support more products and services, a growing number of interdependent pipelines running at various cadences can give rise to periodic resource bottlenecks for the cluster. This resource contention results in pipelines starting at unpredictable times each day and consequently variable landing times for the data artifacts they produce. The variability gets compounded by the dependency structure of the workload, and the resulting unpredictability can disrupt the project workstreams which consume this data. We present Clockwork, a delay-based global scheduling framework for data pipelines which improves landing time stability by spreading out tasks throughout the day. Whereas most scheduling algorithms optimize for makespan or average job completion times, Clockwork's execution plan optimizes for stability in task completion times while also targeting predefined pipeline SLOs. We present this new problem formulation and design a list scheduling algorithm based on its analytic properties. We also discuss how we estimate the resource requirements for our recurring pipelines, and the architecture for integrating Clockwork with Dataswarm, Facebook's existing data workflow management service. Online experiments comparing this novel scheduling algorithm and a previously proposed greedy procrastinating heuristic show tasks complete almost an hour earlier on average, while exhibiting lower landing time variance and producing significantly less competition for resources in the cluster.

Bipartite Dynamic Representations for Abuse Detection

Abusive behavior in online retail websites and communities threatens the experience of regular community members. Such behavior often takes place within a complex, dynamic, and large-scale network of users interacting with items. Detecting abuse is challenging due to the scarcity of labeled abuse instances and complexity of combining temporal and network patterns while operating at a massive scale. Previous approaches to dynamic graph modeling either do not scale, do not effectively generalize from a few labeled instances, or compromise performance for scalability. Here we present BiDyn, a general method to detect abusive behavior in dynamic bipartite networks at scale, while generalizing from limited training labels. BiDyn develops an efficient hybrid RNN-GNN architecture trained via a novel stacked ensemble training scheme. We also propose a novel pre-training framework for dynamic graphs that helps to achieve superior performance at scale. Our approach outperforms recent large-scale dynamic graph baselines in an abuse classification task by up to 14% AUROC while requiring 10x less memory per training batch in both open and proprietary datasets.

MeLL: Large-scale Extensible User Intent Classification for Dialogue Systems with Meta Lifelong Learning

User intent detection is vital for understanding their demands in dialogue systems. Although the User Intent Classification (UIC) task has been widely studied, for large-scale industrial applications, the task is still challenging. This is because user inputs in distinct domains may have different text distributions and target intent sets. When the underlying application evolves, new UIC tasks continuously emerge in a large quantity. Hence, it is crucial to develop a framework for large-scale extensible UIC that continuously fits new tasks and avoids catastrophic forgetting with an acceptable parameter growth rate. In this paper, we introduce the Meta Lifelong Learning (MeLL) framework to address this task. In MeLL, a BERT-based text encoder is employed to learn robust text representations across tasks, which is slowly updated for lifelong learning. We design global and local memory networks to capture the cross-task prototype representations of different classes, treated as the meta-learner quickly adapted to different tasks. Additionally, the Least Recently Used replacement policy is applied to manage the global memory such that the model size does not explode through time. Finally, each UIC task has its own task-specific output layer, with the attentive summarization of various features. We have conducted extensive experiments on both open-source and real industry datasets. Results show that MeLL improves the performance compared with strong baselines and also reduces the number of total parameters. We have also deployed MeLL on a real-world e-commerce dialogue system AliMe and observed significant improvements in terms of both F1 and the resources usage.

Record: Joint Real-Time Repositioning and Charging for Electric Carsharing with Dynamic Deadlines

Electric carsharing, i.e., electric vehicle sharing, as an emerging mobility-on-demand service, has been proliferating worldwide recently. Though providing convenient, low-cost, and environmentally-friendly mobility, there are also some potential roadblocks in electric carsharing services due to existing inefficient fleet management strategies, which relocate the vehicles using predefined periodic schedules without self-adapting to the highly dynamic user demand, and many practical factors like time-variant charging pricing also have not been fully considered. To remedy these problems, in this paper, we design Record, an effective fleet management system with joint Repositioning and Charging for electric carsharing based on dynamic deadlines to improve its operating profits and also satisfy users' real-time pickup and return demand. Record considers not only the highly dynamic user demand for vehicle repositioning (i.e., where to relocate) but also the time-varying charging pricing for charging scheduling (i.e., where to charge). To perform the two tasks efficiently, in Record, we design a dynamic deadline-based distributed deep reinforcement learning algorithm, which generates dynamic deadlines via usage prediction combined with an error compensation mechanism to adaptively search and learn the optimal locations for satisfying highly dynamic and unbalanced user demand in real time. We implement and evaluate the Record system with 10-month real-world electric carsharing data, and the extensive experimental results show that our Record effectively reduces 25.8% of charging costs and reduces 30.2% of vehicle movements by workers, and it also satisfies user demand and achieves a small runtime overhead at the same time.

Live-Streaming Fraud Detection: A Heterogeneous Graph Neural Network Approach

Live-streaming platforms have recently gained significant popularity by attracting an increasing number of young users and have become a very promising form of online shopping. Similar to the traditional online shopping platforms such as Taobao, live-streaming platforms also suffer from online malicious fraudulent behaviors where many transactions are not genuine. The existing anti-fraud models proposed to recognize fraudulent transactions on traditional online shopping platforms are inapplicable on live-streaming platforms. This is mainly because live-streaming platforms are characterized by a unique type of heterogeneous live-streaming networks where multiple heterogeneous types of nodes such as users, live-streamers, and products are connected with multiple different types of edges associated with edge features. In this paper, we propose a new approach based on a heterogeneous graph neural network for LIve-streaming Fraud dEtection (called LIFE). LIFE designs an innovative heterogeneous graph learning model that fully utilizes various heterogeneous information of shopping transactions, users, streamers, and items from a given live-streaming platform. Moreover, a label propagation algorithm is employed within our LIFE framework to handle the limited number of labeled fraudulent transactions for model training. Extensive experimental results on a large-scale Taobao live-streaming platform demonstrate that the proposed method is superior to the baseline models in terms of fraud detection effectiveness on live-streaming platforms. Furthermore, we conduct a case study to show that the proposed method is able to effectively detect fraud communities for live-streaming e-commerce platforms.

Energy-Efficient 3D Vehicular Crowdsourcing for Disaster Response by Distributed Deep Reinforcement Learning

Fast and efficient access to environmental and life data is key to the successful disaster response. Vehicular crowdsourcing (VC) by a group of unmanned vehicles (UVs) like drones and unmanned ground vehicles to collect these data from Point-of-Interests (PoIs) e.g., possible survivor spots and fire site, provides an efficient way to assist disaster rescue. In this paper, we explicitly consider to navigate a group of UVs in a 3-dimensional (3D) disaster workzone to maximize the amount of collected data, geographical fairness, energy efficiency, while minimizing data dropout due to limited transmission rate. We propose DRL-DisasterVC(3D), a distributed deep reinforcement learning framework, with a repetitive experience replay (RER) to improve learning efficiency, and a clipped target network to increase learning stability. We also use a 3D convolutional neural network (3D CNN) with multi-head-relational attention (MHRA) for spatial modeling, and add auxiliary pixel control (PC) for spatial exploration. We designed a novel disaster response simulator, called "DisasterSim", and conduct extensive experiments to show that DRL-DisasterVC(3D) outperforms all five baselines in terms of energy efficiency when varying the numbers of UVs, PoIs and SNR threshold.

Tac-Valuer: Knowledge-based Stroke Evaluation in Table Tennis

Stroke evaluation is critical for coaches to evaluate players' performance in table tennis matches. However, current methods highly demand proficient knowledge in table tennis and are time-consuming. We collaborate with the Chinese national table tennis team and propose Tac-Valuer, an automatic stroke evaluation framework for analysts in table tennis teams. In particular, to integrate analysts' knowledge into the machine learning model, we employ the latest effective framework named abductive learning, showing promising performance. Based on abductive learning, Tac-Valuer combines the state-of-the-art computer vision algorithms to extract and embed stroke features for evaluation. We evaluate the design choices of the approach and present Tac-Valuer's usability through use cases that analyze the performance of the top table tennis players in world-class events.

Reinforcing Pretrained Models for Generating Attractive Text Advertisements

We study how pretrained language models can be enhanced by using deep reinforcement learning to generate attractive text advertisements that reach the high quality standard of real-world advertiser mediums. To improve ad attractiveness without hampering user experience, we propose a model-based reinforcement learning framework for text ad generation, which constructs a model for the environment dynamics and avoids large sample complexity. Based on the framework, we develop Masked-Sequence Policy Gradient, a reinforcement learning algorithm that integrates efficiently with pretrained models and explores the action space effectively. Our method has been deployed to production in Microsoft Bing. Automatic offline experiments, human evaluation, and online experiments demonstrate the superior performance of our method.

Multimodal Emergent Fake News Detection via Meta Neural Process Networks

Fake news travels at unprecedented speeds, reaches global audiences and puts users and communities at great risk via social media platforms. Deep learning based models show good performance when trained on large amounts of labeled data on events of interest, whereas the performance of models tends to degrade on other events due to domain shift. Therefore, significant challenges are posed for existing detection approaches to detect fake news on emergent events, where large-scale labeled datasets are difficult to obtain. Moreover, adding the knowledge from newly emergent events requires to build a new model from scratch or continue to fine-tune the model, which can be challenging, expensive, and unrealistic for real-world settings. In order to address those challenges, we propose an end-to-end fake news detection framework named MetaFEND, which is able to learn quickly to detect fake news on emergent events with a few verified posts. Specifically, the proposed model integrates meta-learning and neural process methods together to enjoy the benefits of these approaches. In particular, a label embedding module and a hard attention mechanism are proposed to enhance the effectiveness by handling categorical information and trimming irrelevant posts. Extensive experiments are conducted on multimedia datasets collected from Twitter and Weibo. The experimental results show our proposed MetaFEND model can detect fake news on never-seen events effectively and outperform the state-of-the-art methods.

Domain-Specific Pretraining for Vertical Search: Case Study on Biomedical Literature

Information overload is a prevalent challenge in many high-value domains. A prominent case in point is the explosion of the biomedical literature on COVID-19, which swelled to hundreds of thousands of papers in a matter of months. In general, biomedical literature expands by two papers every minute, totalling over a million new papers every year. Search in the biomedical realm, and many other vertical domains is challenging due to the scarcity of direct supervision from click logs. Self-supervised learning has emerged as a promising direction to overcome the annotation bottleneck. We propose a general approach for vertical search based on domain-specific pretraining and present a case study for the biomedical domain. Despite being substantially simpler and not using any relevance labels for training or development, our method performs comparably or better than the best systems in the official TREC-COVID evaluation, a COVID-related biomedical search competition. Using distributed computing in modern cloud infrastructure, our system can scale to tens of millions of articles on PubMed and has been deployed as Microsoft Biomedical Search, a new search experience for biomedical literature: https://aka.ms/biomedsearch.

Multi-Scale One-Class Recurrent Neural Networks for Discrete Event Sequence Anomaly Detection

Discrete event sequences are ubiquitous, such as an ordered event series of process interactions in Information and Communication Technology systems. Recent years have witnessed increasing efforts in detecting anomalies with discrete event sequences. However, it remains an extremely difficult task due to several intrinsic challenges including data imbalance issues, discrete property of the events, and sequential nature of the data. To address these challenges, in this paper, we propose OC4Seq, a multi-scale one-class recurrent neural network for detecting anomalies in discrete event sequences. Specifically, OC4Seq integrates the anomaly detection objective with recurrent neural networks (RNNs) to embed the discrete event sequences into latent spaces, where anomalies can be easily detected. In addition, given that an anomalous sequence could be caused by either individual events, subsequences of events, or the whole sequence, we design a multi-scale RNN framework to capture different levels of sequential patterns simultaneously. We fully implement and evaluate OC4Seq on three real-world system log datasets. The results show that OC4Seq consistently outperforms various representative baselines by a large margin. Moreover, through both quantitative and qualitative analysis, the importance of capturing multi-scale sequential patterns for event anomaly detection is verified. To encourage reproducibility, we make the code and data publicly available.

Representation Learning for Predicting Customer Orders

The ability to predict future customer orders is of significant value to retailers in making many crucial operational decisions. Different from next basket prediction or temporal set prediction, which focuses on predicting a subset of items for a single user, this paper aims for the distributional information of future orders, i.e., the possible subsets of items and their frequencies (probabilities), which is required for decisions such as assortment selection for front-end warehouses and capacity evaluation for fulfillment centers. Based on key statistics of a real order dataset from Tmall supermarket, we show the challenges of order prediction. Motivated by our analysis that biased models of order distribution can still help improve the quality of order prediction, we design a generative model to capture the order distribution for customer order prediction. Our model utilizes representation learning to embed items into a Euclidean space and design a highly efficient SGD algorithm to learn the item embeddings. Future order prediction is done by calibrating orders obtained by random walks over the embedding graph. The experiments show that our model outperforms all the existing methods. The benefit of our model is also illustrated with an application to assortment selection for front-end warehouses.

Modeling the Sequential Dependence among Audience Multi-step Conversions with Multi-task Learning in Targeted Display Advertising

In most real-world large-scale online applications (e.g., e-commerce or finance), customer acquisition is usually a multi-step conversion process of audiences. For example, an impression->click->purchase process is usually performed of audiences for e-commerce platforms. However, it is more difficult to acquire customers in financial advertising (e.g., credit card advertising) than in traditional advertising. On the one hand, the audience multi-step conversion path is longer, an impression->click->application->approval->activation process usually occurs during the audience conversion for credit card business in financial advertising. On the other hand, the positive feedback is sparser (class imbalance) step by step, and it is difficult to obtain the final positive feedback due to the delayed feedback of activation. Therefore, it is necessary to use the positive feedback information of the former step to alleviate the class imbalance of the latter step. Multi-task learning is a typical solution in this direction. While considerable multi-task efforts have been made in this direction, a long-standing challenge is how to explicitly model the long-path sequential dependence among audience multi-step conversions for improving the end-to-end conversion. In this paper, we propose an Adaptive Information Transfer Multi-task (AITM) framework, which models the sequential dependence among audience multi-step conversions via the Adaptive Information Transfer (AIT) module. The AIT module can adaptively learn what and how much information to transfer for different conversion stages. Besides, by combining the Behavioral Expectation Calibrator in the loss function, the AITM framework can yield more accurate end-to-end conversion identification. The proposed framework is deployed in Meituan app, which utilizes it to real-timely show a banner to the audience with a high end-to-end conversion rate for Meituan Co-Branded Credit Cards. Offline experimental results on both industrial and public real-world datasets clearly demonstrate that the proposed framework achieves significantly better performance compared with state-of-the-art baselines. Besides, online experiments also demonstrate significant improvement compared with existing online models. Furthermore,we have released the source code of the proposed framework at https://github.com/xidongbo/AITM.

Tolerating Data Missing in Breast Cancer Diagnosis from Clinical Ultrasound Reports via Knowledge Graph Inference

Medical diagnosis through artificial intelligence has been drawing increasing attention currently. For breast lesions, the clinical ultrasound reports are the most commonly used data in the diagnosis of breast cancer. Nevertheless, the input reports always encounter the inevitable issue of data missing. Unfortunately, despite the efforts made in previous approaches that made progress on tackling data imprecision, nearly all of these approaches cannot accept inputs with data missing. A common way to alleviate the data missing issue is to fill the missing values with artificial data. However, the data filling strategy actually brings in additional noises that do not exist in the raw data. Inspired by the advantage of open world assumption, we regard the missing data in clinical ultrasound reports as non-observed terms of facts, and propose a Knowledge Graph embedding based model KGSeD with the capability of tolerating data missing, which can successfully circumvent the pollution caused by data filling. Our KGSeD is designed via an encoder-decoder framework, where the encoder incorporates structural information of the graph via embedding, and the decoder diagnose patients by inferring their links to clinical outcomes. Comparative experiments show that KGSeD achieves noticeable diagnosis performances. When data missing occurred, KGSeD yields the most stable performance over those of existing approaches, showing better tolerance to data missing.

Medical Entity Relation Verification with Large-scale Machine Reading Comprehension

Medical entity relation verification is a crucial step to build a practical and enterprise medical knowledge graph (MKG) because high-precision medical entity relation is a key requirement for many MKG-based applications. Existing relation verification approaches for general knowledge graphs are not designed for considering medical domain knowledge, although it is central to achieve high-quality entity relation verification for MKG. To this end, in this paper, we introduce a system for medical entity relation verification with large-scale machine reading comprehension. The proposed system is tailored to overcome the unique challenges of medical relation verification including high variants of medical terms, the high difficulty of evidence searching in complex medical documents, and the lack of evidence labels for supervision. To deal with the problem of variants of medical terms, we introduce a synonym-aware retrieve model to retrieve the potential evidence implicitly verifying the given claim. To better utilize the medical domain knowledge, a relation-aware evidence detector and a medical ontology-enhanced aggregator are developed to improve the performance of the relation verification module. Moreover, to overcome the challenge of providing high-quality evidence due to the lack of labels, we introduce an interactive collaborative-training method to iteratively improve the evidence accuracy. Finally, we conduct extensive experiments to demonstrate that the performance of our proposed system is superior to all comparable models. We also demonstrate that our system can significantly reduce the annotation time by medical experts in real-world verification tasks. It can help to improve the efficiency by nearly 300%. In particular, our system has been embedded into the Baidu Clinical Decision Support System.

EXACTA: Explainable Column Annotation

Column annotation, the process of annotating tabular columns with labels, plays a fundamental role in digital marketing data governance. It has a direct impact on how customers manage their data and facilitates compliance with regulations, restrictions, and policies applicable to data use. Despite substantial gains in accuracy brought by recent deep learning-driven column annotation methods, their incapability of explaining why columns are matched with particular target labels has drawn concern, due to the black-box nature of deep neural networks. Such explainability is of particular importance in industrial marketing scenarios, where data stewards need to quickly verify and calibrate the annotation results to ascertain the correctness of downstream applications. This work sheds new light on the explainable column annotation problem, the first of its kind column annotation task. To achieve this, we propose a new approach called EXACTA, which conducts multi-hop knowledge graph reasoning using inverse reinforcement learning to find a path from a column to a potential target label while ensuring both annotation performance and explainability. We experiment on four benchmarks, both publicly available and real-world ones, and undertake a comprehensive analysis on the explainability. The results suggest that our method not only provides competitive annotation performance compared with existing deep learning-based models, but more importantly, produces faithfully explainable paths for annotated columns to facilitate human examination.

DMBGN: Deep Multi-Behavior Graph Networks for Voucher Redemption Rate Prediction

In E-commerce, vouchers are important marketing tools to enhance users' engagement and boost sales and revenue. The likelihood that a user redeems a voucher is a key factor in voucher distribution decision. User-item Click-Through-Rate (CTR) models are often applied to predict the user-voucher redemption rate. However, the voucher scenario involves more complicated relations among users, items and vouchers. The users' historical behavior in a voucher collection activity reflects users' voucher usage patterns, which is nevertheless overlooked by the CTR-based solutions. In this paper, we propose a Deep Multi-behavior Graph Networks (DMBGN) to shed light on this field for the voucher redemption rate prediction. The complex structural user-voucher-item relationships are captured by a User-Behavior Voucher Graph (UVG). User behavior happening both before and after voucher collection is taken into consideration, and a high-level representation is extracted by Higher-order Graph Neural Networks. On top of a sequence of UVGs, an attention network is built which can help to learn users' long-term voucher redemption preference. Extensive experiments on three large-scale production datasets demonstrate the proposed DMBGN model is effective, with 10% to 16% relative AUC improvement over Deep Neural Networks (DNN), and 2% to 4% AUC improvement over Deep Interest Network (DIN). Source code and a sample dataset are made publicly available to facilitate future research.

FIVES: Feature Interaction Via Edge Search for Large-Scale Tabular Data

High-order interactive features capture the correlation between different columns and thus are promising to enhance various learning tasks on ubiquitous tabular data. To automate the generation of interactive features, existing works either explicitly traverse the feature space or implicitly express the interactions via intermediate activations of some designed models. These two kinds of methods show that there is essentially a trade-off between feature interpretability and search efficiency. To possess both of their merits, we propose a novel method named Feature Interaction Via Edge Search (FIVES), which formulates the task of interactive feature generation as searching for edges on the defined feature graph. Specifically, we first present our theoretical evidence that motivates us to search for useful interactive features with increasing order. Then we instantiate this search strategy by optimizing both a dedicated graph neural network (GNN) and the adjacency tensor associated with the defined feature graph. In this way, the proposed FIVES method simplifies the time-consuming traversal as a typical training course of GNN and enables explicit feature generation according to the learned adjacency tensor. Experimental results on both benchmark and real-world datasets show the advantages of FIVES over several state-of-the-art methods. Moreover, the interactive features identified by FIVES are deployed on the recommender system of Taobao, a worldwide leading e-commerce platform. Results of an online A/B testing further verify the effectiveness of the proposed method FIVES, and we further provide FIVES as AI utilities for the customers of Alibaba Cloud.

Learning Reliable User Representations from Volatile and Sparse Data to Accurately Predict Customer Lifetime Value

In industry, customer lifetime value (LTV) prediction is a challenging task, since user consumption data is usually volatile, noisy, or sparse. To address these issues, this paper presents a novel Temporal-Structural User Representation (named TSUR) network to predict LTV. We utilize historical revenue time series and user attributes to learn both temporal and structural user representations, respectively. Specifically, the temporal representation is learned with a temporal trend encoder based on a novel multi-channel Discrete Wavelet Transform~(DWT) module, while the structural representation is derived with Graph Attention Network (GAT) on an attribute similarity graph. Furthermore, a novel cluster-alignment regularization method is employed to align and enhance these two kinds of representations. In essence, such a fusion way can be considered as the association of temporal and structural representations in the low-pass representation space, which is also useful to prevent the data noise from being transferred across different views. To our knowledge, it is the first time that temporal and structural user representations are jointly learned for LTV prediction. Extensive offline experiments on two large-scale real-world datasets and online A/B tests have shown the superiority of our approach over a number of competitive baselines.

Towards the D-Optimal Online Experiment Design for Recommender Selection

Selecting the optimal recommender via online exploration-exploitation is catching increasing attention where the traditional A/B testing can be slow and costly, and offline evaluations are prone to the bias of history data. Finding the optimal online experiment is nontrivial since both the users and displayed recommendations carry contextual features that are informative to the reward. While the problem can be formalized via the lens of multi-armed bandits, the existing solutions are found less satisfactorily because the general methodologies do not account for the case-specific structures, particularly for the e-commerce recommendation we study. To fill in the gap, we leverage the D-optimal design from the classical statistics literature to achieve the maximum information gain during exploration, and reveal how it fits seamlessly with the modern infrastructure of online inference. To demonstrate the effectiveness of the optimal designs, we provide semi-synthetic simulation studies with published code and data for reproducibility purposes. We then use our deployment example on Walmart.com to fully illustrate the practical insights and effectiveness of the proposed methods.

PAMI: A Computational Module for Joint Estimation and Progression Prediction of Glaucoma

Glaucoma, which can cause irreversible damage to the sight of human eyes, is conventionally diagnosed by visual field (VF) sensitivity. However, it is labor-intensive and time-consuming to measure VF. Recently, optical coherence tomography (OCT) has been adopted to measure retinal layers thickness (RT) for assisting the diagnosis because glaucoma makes structural changes to RT and it is much less costly to obtain RT. In particular, RT can assist in mainly two manners. One is to estimate a VF from an RT such that clinical doctors only need to obtain an RT of a patient and then convert it to a VF for the diagnosis. The other is to predict future VFs by utilizing both past VFs and RTs, i.e., the prediction of progression of VF over time. The two computational tasks are performed as two data mining tasks because currently there is no knowledge about the exact form of the computations involved. In this paper, we study a novel problem which is the integration of the two data mining tasks. The motivation is that both the two data mining tasks deal with transforming information from the RT domain to the VF domain such that the knowledge discovered in one task can be useful for another. The integration is non-trivial because the two tasks do not share the way of transformation. To address this issue, we design a progression-agnostic and mode-independent (PAMI) module which facilitates cross-task knowledge utilization. We empirically demonstrate that our proposed method outperforms the state-of-the-art method for the estimation by 6.33% in terms of mean of the root mean square error on a real dataset, and outperforms the state-of-the-art method for the progression prediction by 3.49% for the best case.

Session-Aware Query Auto-completion using Extreme Multi-Label Ranking

Query auto-completion (QAC) is a fundamental feature in search engines where the task is to suggest plausible completions of a prefix typed in the search bar. Previous queries in the user session can provide useful context for the user's intent and can be leveraged to suggest auto-completions that are more relevant while adhering to the user's prefix. Such session-aware QACs can be generated by recent sequence-to-sequence deep learning models; however, these generative approaches often do not meet the stringent latency requirements of responding to each user keystroke. Moreover, these generative approaches pose the risk of showing nonsensical queries. One can pre-compute a relatively small subset of relevant queries for common prefixes and rank them based on the context. However, such an approach fails when no relevant queries for the current context are present in the pre-computed set.

In this paper, we provide a solution to this problem: we take the novel approach of modeling session-aware QAC as an eXtreme Multi-Label Ranking (XMR) problem where the input is the previous query in the session and the user's current prefix, while the output space is the set of tens of millions of queries entered by users in the recent past. We adapt a popular XMR algorithm for this purpose by proposing several modifications to the key steps in the algorithm. The proposed modifications yield a 10x improvement in terms of Mean Reciprocal Rank (MRR) over the baseline XMR approach on a public search logs dataset. We are able to maintain an inference latency of less than 10 ms while still using session context. When compared against baseline models of acceptable latency, we observed a 33% improvement in MRR for short prefixes of up to 3 characters. Moreover, our model yielded a statistically significant improvement of 2.81% over a production QAC system in terms of suggestion acceptance rate, when deployed on the search bar of an online shopping store as part of an A/B test.

FLOP: Federated Learning on Medical Datasets using Partial Networks

The outbreak of COVID-19 Disease due to the novel coronavirus has caused a shortage of medical resources. To aid and accelerate the diagnosis process, automatic diagnosis of COVID-19 via deep learning models has recently been explored by researchers across the world. While different data-driven deep learning models have been developed to mitigate the diagnosis of COVID-19, the data itself is still scarce due to patient privacy concerns. Federated Learning (FL) is a natural solution because it allows different organizations to cooperatively learn an effective deep learning model without sharing raw data. However, recent studies show that FL still lacks privacy protection and may cause data leakage. We investigate this challenging problem by proposing a simple yet effective algorithm, named Federated Learning on Medical Datasets using Partial Networks (FLOP), that shares only a partial model between the server and clients. Extensive experiments on benchmark data and real-world healthcare tasks show that our approach achieves comparable or better performance while reducing the privacy and security risks. Of particular interest, we conduct experiments on the COVID-19 dataset and find that our FLOP algorithm can allow different hospitals to collaboratively and effectively train a partially shared model without sharing local patients' data.

Improving the Information Disclosure in Mobility-on-Demand Systems

Nowadays, the ubiquity of sharing economy and the booming of ride-sharing services prompt Mobility-on-Demand (MoD) platforms to explore and develop new business modes. Different from forcing full-time drivers to serve the dispatched orders, these modes usually aim to attract part-time drivers to share their vehicles and employ a 'driver-choose-order' pattern by displaying a sequence of orders to drivers as a candidate set. A key issue here is to determine which orders should be displayed to each driver. In this work, we propose a novel framework to tackle this issue, known as the Information Disclosure problem in MoD systems. The problem is solved in two steps combining estimation with optimization: 1) in the estimation step, we investigate the drivers' choice behavior and estimate the probability of choosing an order or ignoring the displayed candidate set. 2) in the optimization step, we transform the problem into determining the optimal edge configuration in a bipartite graph, then we develop a Minimal-Loss Edge Cutting (MLEC) algorithm to solve it. Through extensive experiments on both the simulation and the real-world data from Huolala business, the proposed method remarkably improves users experience and platform efficiency. Based on these promising results, the proposed framework has been successfully deployed in the real-world MoD system in Huolala.

Device-Cloud Collaborative Learning for Recommendation

With the rapid development of storage and computing power on mobile devices, it becomes critical and popular to deploy models on devices to save onerous communication latencies and to capture real-time features. While quite a lot of works have explored to facilitate on-device learning and inference, most of them focus on dealing with response delay or privacy protection. Little has been done to model the collaboration between the device and the cloud modeling and benefit both sides jointly. To bridge this gap, we are among the first attempts to study the Device-Cloud Collaborative Learning (DCCL) framework. Specifically, we propose a novel MetaPatch learning approach on the device side to efficiently achieve "thousands of people with thousands of models'' given a centralized cloud model. Then, with billions of updated personalized device models, we propose a "model-over-models'' distillation algorithm, namely MoMoDistill, to update the centralized cloud model. Our extensive experiments over a range of datasets with different settings demonstrate the effectiveness of such collaboration on both cloud and devices, especially its superiority to model long-tailed users.

Semi-supervised Bearing Fault Diagnosis with Adversarially-Trained Phase-Consistent Network

In this study, we propose an adversarially-trained phase-consistent network (APCNet), which is a semi-supervised signal classification approach. The proposed classification model is trained with datasets that contain a small fraction of labeled output so as to design (1) an effective representation of the input time series (vibration signal) to extract important factors for the model to discriminate between different bearing conditions, and (2) a latent representation for the data to reflect the true data distribution precisely. To achieve these goals, APCNet suggests three novelties: the vibration-specific encoder, the phase-consistency regularization, and the adversarially-trained latent distribution alignment of the labeled and unlabeled distributions. We conduct experiments on two public bearing datasets and one public motor operating dataset to evaluate the performance of APCNet. We interpret the model's capabilities with different data label ratios and latent distribution analysis. The results show that APCNet performs well on datasets with small labeled to unlabeled data ratio. Also, we show that APCNet achieves our objectives of capturing important vibration signals features and modeling the true data distribution effectively.

Leveraging Tripartite Interaction Information from Live Stream E-Commerce for Improving Product Recommendation

Recently, a new form of online shopping becomes more and more popular, which combines live streaming with E-Commerce activity. The streamers introduce products and interact with their audiences, and hence greatly improve the performance of selling products. Despite of the successful applications in industries, the live stream E-commerce has not been well studied in the data science community. To fill this gap, we investigate this brand-new scenario and collect a real-world Live Stream E-Commerce (LSEC) dataset. Different from conventional E-commerce activities, the streamers play a pivotal role in the LSEC events. Hence, the key is to make full use of rich interaction information among streamers, users, and products. We first conduct data analysis on the tripartite interaction data and quantify the streamer's influence on users' purchase behavior. Based on the analysis results, we model the tripartite information as a heterogeneous graph, which can be decomposed to multiple bipartite graphs in order to better capture the influence. We propose a novel Live Stream E-Commerce Graph Neural Network framework (LSEC-GNN) to learn the node representations of each bipartite graph, and further design a multi-task learning approach to improve product recommendation. Extensive experiments on two real-world datasets with different scales show that our method can significantly outperform various baseline approaches.

AliCG: Fine-grained and Evolvable Conceptual Graph Construction for Semantic Search at Alibaba

Conceptual graphs, which is a particular type of Knowledge Graphs, play an essential role in semantic search. Prior conceptual graph construction approaches typically extract high-frequent, coarse-grained, and time-invariant concepts from formal texts such as Wikipedia. In real applications, however, it is necessary to extract less-frequent, fine-grained, and time-varying conceptual knowledge and build taxonomy in an evolving manner. In this paper, we introduce an approach to implementing and deploying the conceptual graph at Alibaba. Specifically, We propose a framework called AliCG which is capable of a) extracting fine-grained concepts by a novel bootstrapping with alignment consensus approach, b) mining long-tail concepts with a novel low-resource phrase mining approach, c) updating the graph dynamically via a concept distribution estimation method based on implicit and explicit user behaviors. We have deployed the conceptual graph at Alibaba UC Browser. Extensive offline evaluation as well as online A/B testing demonstrate the efficacy of our approach.

Talent Demand Forecasting with Attentive Neural Sequential Model

To cope with the fast-evolving business trend, it becomes critical for companies to continuously review their talent recruitment strategies by the timely forecast of talent demand in recruitment market. While many efforts have been made on recruitment market analysis, due to the sparsity of fine-grained talent demand time series and the complex temporal correlation of the recruitment market, there is still no effective approach for fine-grained talent demand forecast, which can quantitatively model the dynamics of the recruitment market. To this end, in this paper, we propose a data-driven neural sequential approach, namely Talent Demand Attention Network (TDAN), for forecasting fine-grained talent demand in the recruitment market. Specifically, we first propose to augment the univariate time series of talent demand at multiple grained levels and extract intrinsic attributes of both companies and job positions with matrix factorization techniques. Then, we design a Mixed Input Attention module to capture company trends and industry trends to alleviate the sparsity of fine-grained talent demand. Meanwhile, we design a Relation Temporal Attention module for modeling the complex temporal correlation that changes with the company and position. Finally, extensive experiments on a real-world recruitment dataset clearly validate the effectiveness of our approach for fine-grained talent demand forecast, as well as its interpretability for modeling recruitment trends. In particular, TDAN has been deployed as an important functional component of intelligent recruitment system of cooperative partner.

AsySQN: Faster Vertical Federated Learning Algorithms with Better Computation Resource Utilization

Vertical federated learning (VFL) is an effective paradigm of training the emerging cross-organizational (e.g., different corporations, companies and organizations) collaborative learning with privacy preserving. Stochastic gradient descent (SGD) methods are the popular choices for training VFL models because of the low per-iteration computation. However, existing SGD-based VFL algorithms are communication-expensive due to a large number of communication rounds. Meanwhile, most existing VFL algorithms use synchronous computation which seriously hamper the computation resource utilization in real-world applications. To address the challenges of communication and computation resource utilization, we propose an asynchronous stochastic quasi-Newton (AsySQN) framework for VFL, under which three algorithms, i.e. AsySQN-SGD, -SVRG and -SAGA, are proposed. The proposed AsySQN-type algorithms making descent steps scaled by approximate (without calculating the inverse Hessian matrix explicitly) Hessian information convergence much faster than SGD-based methods in practice and thus can dramatically reduce the number of communication rounds. Moreover, the adopted asynchronous computation can make better use of the computation resource. We theoretically prove the convergence rates of our proposed algorithms for strongly convex problems. Extensive numerical experiments on real-word datasets demonstrate the lower communication costs and better computation resource utilization of our algorithms compared with state-of-the-art VFL algorithms.

MEOW: A Space-Efficient Nonparametric Bid Shading Algorithm

Bid Shading has become increasingly important in Online Advertising, with a large amount of commercial [4,12,13,29] and research work [11,20,28] recently published. Most approaches for solving the bid shading problem involve estimating the probability of win distribution, and then maximizing surplus [28]. These generally use parametric assumptions for the distribution, and there has been some discussion as to whether Log-Normal, Gamma, Beta, or other distributions are most effective [8,38,41,44]. In this paper, we show evidence that online auctions generally diverge in interesting ways from classic distributions. In particular, real auctions generally exhibit significant structure, due to the way that humans set up campaigns and inventory floor prices [16,26]. Using these insights, we present a nonparametric method for Bid Shading which enables the exploitation of this deep structure. The algorithm has low time and space complexity, and is designed to operate within the challenging millisecond Service Level Agreements of Real-Time Bid Servers. We deploy it in one of the largest Demand Side Platforms in the United States, and show that it reliably out-performs best in class Parametric benchmarks. We conclude by suggesting some ways that the best aspects of parametric and nonparametric approaches could be combined.

MugRep: A Multi-Task Hierarchical Graph Representation Learning Framework for Real Estate Appraisal

Real estate appraisal refers to the process of developing an unbiased opinion for real property's market value, which plays a vital role in decision-making for various players in the marketplace (e.g., real estate agents, appraisers, lenders, and buyers). However, it is a non-trivial task for accurate real estate appraisal because of three major challenges: (1) The complicated influencing factors for property value; (2) The asynchronously spatiotemporal dependencies among real estate transactions; (3) The diversified correlations between residential communities. To this end, we propose a Multi-Task Hierarchical Graph Representation Learning (MugRep) framework for accurate real estate appraisal. Specifically, by acquiring and integrating multi-source urban data, we first construct a rich feature set to profile the real estate from multiple perspectives~(e.g., geographical distribution, human mobility distribution, and resident demographics distribution). Then, an evolving real estate transaction graph and a corresponding event graph convolution module are proposed to incorporate asynchronously spatiotemporal dependencies among real estate transactions. Moreover, to further incorporate valuable knowledge from the view of residential communities, we devise a hierarchical heterogeneous community graph convolution module to capture diversified correlations between residential communities. Finally, an urban district partitioned multi-task learning module is introduced to generate differently distributed value opinions for real estate. Extensive experiments on two real-world datasets demonstrate the effectiveness of MugRep and its components and features.

HALO: Hierarchy-aware Fault Localization for Cloud Systems

A typical cloud system has a large amount of telemetry data collected by pervasive software monitors that keep tracking the health status of the system. The telemetry data is essentially multi-dimensional data, which contains attributes and failure/success status of the system being monitored. By identifying the attribute value combinations where the failures are mostly concentrated (which we call fault-indicating combination), we can localize the cause of system failures into a smaller scope, thus facilitating fault diagnosis. However, due to the combinatorial explosion problem and the latent hierarchical structure in cloud telemetry data, it is still intractable to localize the fault to a proper granularity in an efficient way. In this paper, we propose HALO, a hierarchy-aware fault localization approach for locating the fault-indicating combinations from telemetry data. Our approach automatically learns the hierarchical relationship among attributes and leverages the hierarchy structure for precise and efficient fault localization. We have evaluated HALO on both industrial and synthetic datasets and the results confirm that HALO outperforms the existing methods. Furthermore, we have successfully deployed HALO to different services in Microsoft Azure and Microsoft 365, witnessed its impact in real-world practice.

AutoLoss: Automated Loss Function Search in Recommendations

Designing an effective loss function plays a crucial role in training deep recommender systems. Most existing works often leverage a predefined and fixed loss function that could lead to suboptimal recommendation quality and training efficiency. Some recent efforts rely on exhaustively or manually searched weights to fuse a group of candidate loss functions, which is exceptionally costly in computation and time. They also neglect the various convergence behaviors of different data examples. In this work, we propose an AutoLoss framework that can automatically and adaptively search for the appropriate loss function from a set of candidates. To be specific, we develop a novel controller network, which can dynamically adjust the loss probabilities in a differentiable manner. Unlike existing algorithms, the proposed controller can adaptively generate the loss probabilities for different data examples according to their varied convergence behaviors. Such design improves the model's generalizability and transferability between deep recommender systems and datasets. We evaluate the proposed framework on two benchmark datasets. The results show that AutoLoss outperforms representative baselines. Further experiments have been conducted to deepen our understandings of AutoLoss, including its transferability, components and training efficiency.

Incorporating Prior Financial Domain Knowledge into Neural Networks for Implied Volatility Surface Prediction

In this paper we develop a novel neural network model for predicting implied volatility surface. Prior financial domain knowledge is taken into account. A new activation function that incorporates volatility smile is proposed, which is used for the hidden nodes that process the underlying asset price. In addition, financial conditions, such as the absence of arbitrage, the boundaries and the asymptotic slope, are embedded into the loss function. This is one of the very first studies which discuss a methodological framework that incorporates prior financial domain knowledge into neural network architecture design and model training. The proposed model outperforms the benchmarked models with the option data on the S&P 500 index over 20 years. More importantly, the domain knowledge is satisfied empirically, showing the model is consistent with the existing financial theories and conditions related to implied volatility surface.

AutoSmart: An Efficient and Automatic Machine Learning Framework for Temporal Relational Data

Temporal relational data, perhaps the most commonly used data type in industrial machine learning applications, needs labor-intensive feature engineering and data analyzing for giving precise model predictions. An automatic machine learning framework is needed to ease the manual efforts in fine-tuning the models so that the experts can focus more on other problems that really need humans' engagement such as problem definition, deployment, and business services. However, there are three main challenges for building automatic solutions for temporal relational data: 1) how to effectively and automatically mining useful information from the multiple tables and the relations from them? 2) how to be self-adjustable to control the time and memory consumption within a certain budget? and 3) how to give generic solutions to a wide range of tasks? In this work, we propose our solution that successfully addresses the above issues in an end-to-end automatic way. The proposed framework, AutoSmart, is the winning solution to the KDD Cup 2019 of the AutoML Track, which is one of the largest AutoML competition to date (860 teams with around 4,955 submissions). The framework includes automatic data processing, table merging, feature engineering, and model tuning, with a time and memory controller for efficiently and automatically formulating the models. The proposed framework outperforms the baseline solution significantly on several datasets in various domains.

Contrastive Learning for Debiased Candidate Generation in Large-Scale Recommender Systems

Deep candidate generation (DCG) that narrows down the collection of relevant items from billions to hundreds via representation learning has become prevalent in industrial recommender systems. Standard approaches approximate maximum likelihood estimation (MLE) through sampling for better scalability and address the problem of DCG in a way similar to language modeling. However, live recommender systems face severe exposure bias and have a vocabulary several orders of magnitude larger than that of natural language, implying that MLE will preserve and even exacerbate the exposure bias in the long run in order to faithfully fit the observed samples. In this paper, we theoretically prove that a popular choice of contrastive loss is equivalent to reducing the exposure bias via inverse propensity weighting, which provides a new perspective for understanding the effectiveness of contrastive learning. Based on the theoretical discovery, we design CLRec, a contrastive learning method to improve DCG in terms of fairness, effectiveness and efficiency in recommender systems with extremely large candidate size. We further improve upon CLRec and propose Multi-CLRec, for accurate multi-intention aware bias reduction. Our methods have been successfully deployed in Taobao, where at least four-month online A/B tests and offline analyses demonstrate its substantial improvements, including a dramatic reduction in the Matthew effect.

An Efficient Deep Distribution Network for Bid Shading in First-Price Auctions

Since 2019, most ad exchanges and sell-side platforms (SSPs), in the online advertising industry, shifted from second to first price auctions. Due to the fundamental difference between these auctions, demand-side platforms (DSPs) have had to update their bidding strategies to avoid bidding unnecessarily high and hence overpaying. Bid shading was proposed to adjust the bid price intended for second-price auctions, in order to balance cost and winning probability in a first-price auction setup. In this study, we introduce a novel deep distribution network for optimal bidding in both open (non-censored) and closed (censored) online first-price auctions. Offline and online A/B testing results show that our algorithm outperforms previous state-of-art algorithms in terms of both surplus and effective cost per action (eCPX) metrics. Furthermore, the algorithm is optimized in run-time and has been deployed into VerizonMedia DSP as production algorithm, serving hundreds of billions of bid requests per day. Online A/B test shows that advertiser's ROI are improved by +2.4%, +2.4%, and +8.6% for impression based (CPM), click based (CPC), and conversion based (CPA) campaigns respectively.

Learning to Expand Audience via Meta Hybrid Experts and Critics for Recommendation and Advertising

In recommender systems and advertising platforms, marketers always want to deliver products, contents, or advertisements to potential audiences over media channels such as display, video, or social. Given a set of audiences or customers (seed users), the audience expansion technique (look-alike modeling) is a promising solution to identify more potential audiences, who are similar to the seed users and likely to finish the business goal of the target campaign. However, look-alike modeling faces two challenges: (1) In practice, a company could run hundreds of marketing campaigns to promote various contents within completely different categories every day, e.g., sports, politics, society. Thus, it is difficult to utilize a common method to expand audiences for all campaigns. (2) The seed set of a certain campaign could only cover limited users. Therefore, a customized approach based on such a seed set is likely to be overfitting.

In this paper, to address these challenges, we propose a novel two-stage framework named Meta Hybrid Experts and Critics (MetaHeac) which has been deployed in WeChat Look-alike System. In the offline stage, a general model which can capture the relationships among various tasks is trained from a meta-learning perspective on all existing campaign tasks. In the online stage, for a new campaign, a customized model is learned with the given seed set based on the general model. According to both offline and online experiments, the proposed MetaHeac shows superior effectiveness for both content marketing campaigns in recommender systems and advertising campaigns in advertising platforms. Besides, MetaHeac has been successfully deployed in WeChat for the promotion of both contents and advertisements, leading to great improvement in the quality of marketing. The code has been available at https://github.com/easezyc/MetaHeac.

Pre-trained Language Model based Ranking in Baidu Search

As the heart of a search engine, the ranking system plays a crucial role in satisfying users' information demands. More recently, neural rankers fine-tuned from pre-trained language models (PLMs) establish state-of-the-art ranking effectiveness. However, it is nontrivial to directly apply these PLM-based rankers to the large-scale web search system due to the following challenging issues: (1) the prohibitively expensive computations of massive neural PLMs, especially for long texts in the web document, prohibit their deployments in an online ranking system that demands extremely low latency; (2) the discrepancy between existing ranking-agnostic pre-training objectives and the ad-hoc retrieval scenarios that demand comprehensive relevance modeling is another main barrier for improving the online ranking system; (3) a real-world search engine typically involves a committee of ranking components, and thus the compatibility of the individually fine-tuned ranking model is critical for a cooperative ranking system. In this work, we contribute a series of successfully applied techniques in tackling these exposed issues when deploying the state-of-the-art Chinese pre-trained language model, i.e., ERNIE, in the online search engine system. We first articulate a novel practice to cost-efficiently summarize the web document and contextualize the resultant summary content with the query using a cheap yet powerful Pyramid-ERNIE architecture. Then we endow an innovative paradigm to finely exploit the large-scale noisy and biased post-click behavioral data for relevance-oriented pre-training. We also propose a human-anchored fine-tuning strategy tailored for the online ranking system, aiming to stabilize the ranking signals across various online components. Extensive offline and online experimental results show that the proposed techniques significantly boost the search engine's performance.

SESSION: Tutorial Overviews

Software as a Medical Device: Regulating AI in Healthcare via Responsible AI

With the increased adoption of AI in healthcare, there is a growing recognition and demand to regulate AI in healthcare to avoid potential harm and unfair bias against vulnerable populations. Around a hundred governmental bodies and commissions as well as leaders in the tech sector have proposed principles to create responsible AI systems. However, most of these proposals are short on specifics which has led to charges of ethics washing. In this tutorial we offer a guide to help navigate through complex governmental regulations and explain the various constituent practical elements of a responsible AI system in healthcare in the light of proposed regulations. Additionally, we breakdown and emphasize that the recommendations from regulatory bodies like FDA or the EU are necessary but not sufficient elements of creating a responsible AI system. We elucidate how regulations and guidelines often focus on epistemic concerns to the detriment of practical concerns e.g., requirement for fairness without explicating what fairness constitutes for a use case. FDA's Software as a medical device document and EU's GDPR among other AI governance documents talk about the need for implementing sufficiently good machine learning practices. In this tutorial we elucidate what that would mean from a practical perspective for real world use cases in healthcare throughout the machine learning cycle i.e., Data Management, Data Specification, Feature Engineering, Model Evaluation, Model Specification, Model Explainability, Model Fairness, Reproducibility, checks for data leakage and model leakage. We note that conceptualizing responsible AI as a process rather than an end goal accords well with how AI systems are used in practice. We also discuss how a domain centric stakeholder perspective translates into balancing requirements for multiple competing optimization criteria.

Data Science on Blockchains

Blockchain technology garners an ever-increasing interest of researchers in various domains that benefit from scalable cooperation among trust-less parties. As blockchains and their applications proliferate, so do the complexity and volume of data stored by Blockchains. Analyzing this data has emerged as an important research topic, already leading to methodological advancements in information sciences.

In this tutorial, we offer a holistic view of applied Data Science on Blockchains. Starting with the core components of Blockchain, we will detail the state of art in Blockchain data analytics for graph, security, and finance domains. Our examples will answer questions, such as, how to parse, extract and clean the data stored in blockchains?, how to store and query Blockchain data? and what features we could compute from blockchains?

We will share tutorial notes, collected meta-information, and further reading pointers on our tutorial website at https://blockchaintutorial.github.io/

Online Advertising Incrementality Testing And Experimentation: Industry Practical Lessons

Online advertising has historically been approached as user targeting and ad-to-user matching problems within sophisticated optimization algorithms. As the research area and ad tech industry have progressed over the last couple of decades, advertisers have increasingly emphasized the causal effect estimation of their ads (aka incrementality) using controlled experiments (or A/B testing). Even though observational approaches have been derived in marketing science since the 80s including media mix models, the availability of online advertising personalization has enabled the deployment of more rigorous randomized controlled experiments with millions of individuals. These evolutions in marketing science, online advertising, and the ad tech industry have posed incredible challenges for engineers, data scientists, and marketers alike. With low effect percentage differences (or lift) and often sparse conversion rates, the development of incrementality testing platforms at scale suggests tremendous engineering challenges in the measurement precision and detailed implementation. Similarly, the correct interpretation of results addressing a business goal within the marketing science domain requires significant data science and experimentation research expertise. All these challenges on the ongoing evolution of the online advertising industry and the heterogeneity of its sources (social, paid search, native, programmatic, etc). In the current tutorial, we propose a practical, grounded view in the incrementality testing landscape, including: The business need Solutions in the literature Design and choices in the development of incrementality testing platform The testing cycle, case studies, and recommendations to effective results delivery Incrementality testing evolution in the industry We will provide first-hand lessons on developing and operationalizing such a platform in a major combined DSP and ad network; these are based on running tens of experiments for up to two months each over the last couple of years.

Creating Recommender Systems Datasets in Scientific Fields

Recommender systems (RS) have been successfully explored in a vast number of domains, e.g. movies and tv shows, music, or e-commerce. In these domains we have a large number of datasets freely available for testing and evaluating new recommender algorithms. For example, Movielens and Netflix datasets for movies, Spotify for music, and Amazon for e-commerce, which translates into a large number of algorithms applied to these fields. In scientific fields, such as Health and Chemistry, standard and open access datasets with the information about the preferences of the users are scarce. First, it is important to understand the application domain, i.e. "what the recommended item is". Second, who are the end users: researchers, pharmacists, clinicians or policy makers. Third, the availability of data. Thus, if we wish to develop an algorithm for recommending scientific items, we do not have access to datasets with information about the past preferences of a group of users. Given this limitation, we developed a methodology, called LIBRETTI - LIterature Based RecommEndaTion of scienTific Items, whose goal is the creation of <user, item, rating> datasets, related with scientific fields. These datasets are created based on the major resource of knowledge that Science has: scientific literature. We consider the users as the authors of the publications, the items as the scientific entities (for example chemical compounds or diseases), and the ratings as the number of publications an author wrote about an entity. In this tutorial we will approach state-of-the-art recommender systems in scientific fields, explain what is Named Entity Recognition/Linking (NER/NEL) in research literature, and to demonstrate how to create a dataset for recommending drugs and diseases through research literature related to COVID-19. Our goal is to spread the use of LIBRETTI methodology in order to help in the development of recommender algorithms in scientific fields. These datasets are created based on the major resource of knowledge that Science has: scientific literature. We consider the users as the authors of the publications, the items as the scientific entities (for example chemical compounds or diseases), and the ratings as the number of publications an author wrote about an entity. In this tutorial we will approach state-of-the-art recommender systems in scientific fields, explain what is Named Entity Recognition/Linking (NER/NEL) in research literature, and to demonstrate how to create a dataset for recommending drugs and diseases through research literature related to COVID-19. Our goal is to spread the use of LIBRETTI methodology in order to help in the development of recommender algorithms in scientific fields. More info about the tutorial at https://lasigebiotm.github.io/RecSys.Scifi/.

Challenges in KDD and ML for Sustainable Development

Artificial Intelligence and machine learning techniques can offer powerful tools for addressing the greatest challenges facing humanity and helping society adapt to a rapidly changing climate, respond to disasters and pandemic crisis, and reach the United Nations (UN) Sustainable Development Goals (SDGs) by 2030. In recent approaches for mitigation and adaptation, data analytics and ML are only one part of the solution that requires interdisciplinary and methodological research and innovations. For example, challenges include multi-modal and multi-source data fusion to combine satellite imagery with other relevant data, handling noisy and missing ground data at various spatio-temporal scales, and ensembling multiple physical and ML models to improve prediction accuracy. Despite recognized successes, there are many areas where ML is not applicable, performs poorly or gives insights that are not actionable. This tutorial will survey the recent and significant contributions in KDD and ML for sustainable development and will highlight current challenges that need to be addressed to transform and equip engaged sustainability science with robust ML-based tools to support actionable decision-making for a more sustainable future.

Explainability for Natural Language Processing

This lecture-style tutorial, which mixes in an interactive literature browsing component, is intended for the many researchers and practitioners working with text data and on applications of natural language processing (NLP) in data science and knowledge discovery. The focus of the tutorial is on the issues of transparency and interpretability as they relate to building models for text and their applications to knowledge discovery. As black-box models have gained popularity for a broad range of tasks in recent years, both the research and industry communities have begun developing new techniques to render them more transparent and interpretable. Reporting from an interdisciplinary team of social science, human-computer interaction (HCI), and NLP/knowledge management researchers, our tutorial has two components: an introduction to explainable AI (XAI) in the NLP domain and a review of the state-of-the-art research; and findings from a qualitative interview study of individuals working on real-world NLP projects as they are applied to various knowledge extraction and discovery at a large, multinational technology and consulting corporation. The first component will introduce core concepts related to explainability in NLP. Then, we will discuss explainability for NLP tasks and report on a systematic literature review of the state-of-the-art literature in AI, NLP and HCI conferences. The second component reports on our qualitative interview study, which identifies practical challenges and concerns that arise in real-world development projects that require the modeling and understanding of text data.

Machine Learning Explainability and Robustness: Connected at the Hip

This tutorial examines the synergistic relationship between explainability methods for machine learning and a significant problem related to model quality: robustness against adversarial perturbations. We begin with a broad overview of approaches to explainable AI, before narrowing our focus to post-hoc explanation methods for predictive models. We discuss perspectives on what constitutes a "good'' explanation in various settings, with an emphasis on axiomatic justifications for various explanation methods. In doing so, we will highlight the importance of an explanation method's faithfulness to the target model, as this property allows one to distinguish between explanations that are unintelligible because of the method used to produce them, and cases where a seemingly poor explanation points to model quality issues. Next, we introduce concepts surrounding adversarial robustness, including adversarial attacks as well as a range of corresponding state-of-the-art defenses. Finally, building on the knowledge presented thus far, we present key insights from the recent literature on the connections between explainability and robustness, showing that many commonly-perceived explainability issues may be caused by non-robust model behavior. Accordingly, a careful study of adversarial examples and robustness can lead to models whose explanations better appeal to human intuition and domain knowledge.

Fairness and Explanation in Clustering and Outlier Detection

As machines move towards replacing humans in decision making the need to make intelligent systems transparent (explainable and fair) becomes paramount. However, fairness and explanation remain understudied problems for unsupervised learning with a recent survey on explanation not covering the topicand the seminal papers on fairness appearing only in 2017. The work in outlier detection is even more recent appearing only in the last year. The need for transparency in unsu- pervised learning is greater than in supervised learning as the lack of supervision means there is no extrinsic measure why a given model was chosen. Hence there is more room to be unfair and a greater demand for explanation. In this tutorial we will consider fairness and explanation for classic unsupervised learning methods that are used extensively in data-mining. The majority of published work is for clustering but we will also cover newer work on unsupervised outlier detection. We will cover both explanation and fairness from multiple perspectives. We begin with the philosophical, legal and ethical motivations of what we are trying to achieve with fairness and explanation. Then we move onto rigorous formal definitions of these problems, algorithmic solutions along with their limitations. We then overview example applications and future work.

New Frontiers of Multi-Network Mining: Recent Developments and Future Trend

Networks (i.e., graphs) are often collected from multiple sources and platforms, such as social networks extracted from multiple online platforms, team-specific collaboration networks within an organization, and inter-dependent infrastructure networks, etc. Such networks from different sources form the multi-networks, which can exhibit the unique patterns that are invisible if we mine the individual network separately. However, compared with single-network mining, multi-network mining is still under-explored due to its unique challenges. First ( multi-network models ), networks under different circumstances can be modeled