Taking into account relational structure during data mining can lead to better results, both in terms of quality and computational efficiency. This structure may be captured in the schema, in links between entities (e.g., graphs) or in rules describing the domain (e.g., knowledge graphs). Further, for richly structured prediction problems, there is often a need for a mix of both logical reasoning and statistical inference. In this talk, I will give an introduction to the field of Statistical Relational Learning (SRL), and I'll identify useful tips and tricks for exploiting structure in both the input and output space. I'll describe our recent work on highly scalable approaches for statistical relational inference. I'll close by introducing a broader interpretation of relational thinking that reveals new research opportunities (and challenges!).
With the maturing of AI and multiagent systems research, we have a tremendous opportunity to direct these advances towards addressing complex societal problems. I will focus on domains of public health and conservation, and address one key cross-cutting challenge: how to effectively deploy our limited intervention resources in these problem domains. I will present results from work around the globe in using AI for challenges in public health such as Maternal and Child care interventions, HIV prevention, and in conservation such as endangered wildlife protection. Achieving social impact in these domains often requires methodological advances. To that end, I will highlight key research advances in multiagent reasoning and learning, in particular in, restless multiarmed bandits, influence maximization in social networks, computational game theory and decision-focused learning. In pushing this research agenda, our ultimate goal is to facilitate local communities and non-profits to directly benefit from advances in AI tools and techniques
What are data and network models? What are efficient algorithms? What are meaningful solutions? Big Data, Network Sciences, and Machine Learning have fundamentally challenged the basic characterizations in computing, from the conventional graph-theoretical modeling of networks to the traditional polynomial-time worst-case measures of efficiency:
Representation learning of protein 3D structures is challenging and essential for applications, e.g., computational protein design or protein engineering. Recently, geometric deep learning has achieved great success in non-Euclidean domains. Although protein can be represented as a graph naturally, it remains under-explored mainly due to the significant challenges in modeling the complex representations and capturing the inherent correlation in the 3D structure modeling. Several challenges include: 1) It is challenging to extract and preserve multi-level rotation and translation equivariant information during learning. 2) Difficulty in developing appropriate tools to effectively leverage the input spatial representations to capture complex geometries across the spatial dimension. 3) Difficulty in incorporating various geometric features and preserving the inherent structural relations. In this work, we introduce geometric bottleneck perceptron, and a general SO(3)-equivariant message passing neural network built on top of it for protein structure representation learning. The proposed geometric bottleneck perceptron can be incorporated into diverse network architecture backbones to process geometric data in different domains. This research shed new light on geometric deep learning in 3D structure studies. Empirically, we demonstrate the strength of our proposed approach on three core downstream tasks, where our model achieves significant improvements and outperforms existing benchmarks. The implementation is available at https://github.com/sarpaykent/GBPNet.
Multi-task learning (MTL) is a framework that enforces multiple learning tasks to share their knowledge to improve their generalization abilities. While shallow multi-task learning can learn task relations, it can only handle pre-defined features. Modern deep multi-task learning can jointly learn latent features and task sharing, but they are obscure in task relation. Also, they pre-define which layers and neurons should share across tasks and cannot learn adaptively. To address these challenges, this paper proposes a new multi-task learning framework that jointly learns latent features and explicit task relations by complementing the strength of existing shallow and deep multitask learning scenarios. Specifically, we propose to model the task relation as the similarity between tasks' input gradients, with a theoretical analysis of their equivalency. In addition, we innovatively propose a multi-task learning objective that explicitly learns task relations by a new regularizer. Theoretical analysis shows that the generalizability error has been reduced thanks to the proposed regularizer. Extensive experiments on several multi-task learning and image classification benchmarks demonstrate the proposed method's effectiveness, efficiency as well as reasonableness in the learned task relation patterns.
Partial label learning induces a multi-class classifier from training examples each associated with a candidate label set where the ground-truth label is concealed. Feature selection improves the generalization ability of learning system via selecting essential features for classification from the original feature set, while the task of partial label feature selection is challenging due to ambiguous labeling information. In this paper, the first attempt towards partial label feature selection is investigated via mutual-information-based dependency maximization. Specifically, the proposed approach SAUTE iteratively maximizes the dependency between selected features and labeling information, where the value of mutual information is estimated from confidence-based latent variable inference. In each iteration, the near-optimal features are selected greedily according to properties of submodular mutual information function, while the density of latent label variable is inferred with the help of updated labeling confidences over candidate labels by resorting to kNN aggregation in the induced lower-dimensional feature space. Extensive experiments over synthetic as well as real-world partial label data sets show that the generalization ability of well-established partial label learning algorithms can be significantly improved after coupling with the proposed feature selection approach.
Link prediction is one of the central problems in graph mining. However, recent studies highlight the importance of higher-order network analysis, where complex structures called motifs are the first-class citizens. We first show that existing link prediction schemes fail to effectively predict motifs. To alleviate this, we establish a general motif prediction problem and we propose several heuristics that assess the chances for a specified motif to appear. To make the scores realistic, our heuristics consider - among others - correlations between links, i.e., the potential impact of some arriving links on the appearance of other links in a given motif. Finally, for highest accuracy, we develop a graph neural network (GNN) architecture for motif prediction. Our architecture offers vertex features and sampling schemes that capture the rich structural properties of motifs. While our heuristics are fast and do not need any training, GNNs ensure highest accuracy of predicting motifs, both for dense (e.g., k-cliques) and for sparse ones (e.g., k-stars). We consistently outperform the best available competitor by more than 10% on average and up to 32% in area under the curve. Importantly, the advantages of our approach over schemes based on uncorrelated link prediction increase with the increasing motif size and complexity. We also successfully apply our architecture for predicting more arbitrary clusters and communities, illustrating its potential for graph mining beyond motif analysis.
With the enactment of privacy-preserving regulations, e.g., GDPR, federated SVD is proposed to enable SVD-based applications over different data sources without revealing the original data. However, many SVD-based applications cannot be well supported by existing federated SVD solutions. The crux is that these solutions, adopting either differential privacy (DP) or homomorphic encryption (HE), suffer from accuracy loss caused by unremovable noise or degraded efficiency due to inflated data.
In this paper, we propose FedSVD, a practical lossless federated SVD method over billion-scale data, which can simultaneously achieve lossless accuracy and high efficiency. At the heart of FedSVD is a lossless matrix masking scheme delicately designed for SVD: 1) While adopting the masks to protect private data, FedSVD completely removes them from the final results of SVD to achieve lossless accuracy; and 2) As the masks do not inflate the data, FedSVD avoids extra computation and communication overhead during the factorization to maintain high efficiency. Experiments with real-world datasets show that FedSVD is over 10000x faster than the HE-based method and has 10 orders of magnitude smaller error than the DP-based solution (ε=0.1, δ=0.1) on SVD tasks. We further build and evaluate FedSVD over three real-world applications: principal components analysis (PCA), linear regression (LR), and latent semantic analysis (LSA), to show its superior performance in practice. On federated LR tasks, compared with two state-of-the-art solutions: FATE [17] and SecureML [19], FedSVD-LR is 100x faster than SecureML and 10x faster than FATE.
Node embeddings are vectors, one per node, that capture a graph's structure. The basic structure is the adjacency matrix of the graph. Recent methods also make assumptions about the similarity of unlinked nodes. However, such assumptions can lead to unintentional but systematic biases against groups of nodes. Calculating similarities between far-off nodes is also difficult under privacy constraints and in dynamic graphs. Our proposed embedding, called NEWS, makes no similarity assumptions, avoiding potential risks to privacy and fairness. NEWS is parameter-free, enables fast link prediction, and has linear complexity. These gains from avoiding assumptions do not significantly affect accuracy, as we show via comparisons against several existing methods on $21$ real-world networks. Code is available at https://github.com/deepayan12/news.
The aspect-opinion extraction tasks extract aspect terms and opinion terms from reviews. The supervised extraction methods achieve state-of-the-art performance but require large-scale human-annotated training data. Thus, they are restricted for open-domain tasks due to the lack of training data. This work addresses this challenge and simultaneously mines aspect terms, opinion terms, and their correspondence in a joint model. We propose an Open-Domain Aspect-Opinion Co-Mining (ODAO) method with a Double-Layer span extraction framework. Instead of acquiring human annotations, ODAO first generates weak labels for unannotated corpus by employing rules-based on universal dependency parsing. Then, ODAO utilizes this weak supervision to train a double-layer span extraction framework to extract aspect terms (ATE), opinion terms (OTE), and aspect-opinion pairs (AOPE). ODAO applies canonical correlation analysis as an early stopping indicator to avoid the model over-fitting to the noise to tackle the noisy weak supervision. ODAO applies a self-training process to gradually enrich the training data to tackle the weak supervision bias issue. We conduct extensive experiments and demonstrate the power of the proposed ODAO. The results on four benchmark datasets for aspect-opinion co-extraction and pair extraction tasks show that ODAO can achieve competitive or even better performance compared with the state-of-the-art fully supervised methods.
We formulate a new inference task in the domain of multivariate time series forecasting (MTSF), called Variable Subset Forecast (VSF), where only a small subset of the variables is available during inference. Variables are absent during inference because of long-term data loss (eg. sensor failures) or high -> low-resource domain shift between train / test. To the best of our knowledge, robustness of MTSF models in presence of such failures, has not been studied in the literature. Through extensive evaluation, we first show that the performance of state of the art methods degrade significantly in the VSF setting. We propose a non-parametric, wrapper technique that can be applied on top any existing forecast models. Through systematic experiments across 4 datasets and 5 forecast models, we show that our technique is able to recover close to 95% performance of the models even when only 15% of the original variables are present.
With the advancement of data collection techniques, end users are interested in how different types of data can collaborate to improve our life experiences. Multimodal Federated Learning (MFL) is an emerging area allowing many distributed clients, each of which can collect data from multiple types of sensors, to participate in the training of some multimodal data-related models without sharing their data. In this paper, we address a novel challenging issue in MFL, the modality incongruity, where clients may have heterogeneous setups of sensors and their local data consists of different combinations of modalities. With the modality incongruity, clients may solve different tasks on different parameter spaces, which escalates the difficulties in dealing with the statistical heterogeneity problem of federated learning; also, it would be hard to perform accurate model aggregation across different types of clients. To tackle these challenges, in this work, we propose the FedMSplit framework, which allows federated training over multimodal distributed data without assuming similar active sensors in all clients. The key idea is to employ a dynamic and multi-view graph structure to adaptively capture the correlations amongst multimodal client models. More specifically, we split client models into smaller shareable blocks and allow each type of blocks to provide a specific view on client relationships. With the graph representation, the underlying correlations between clients can be captured as the edge features in the multi-view graph, and then be utilized to promote local model relations through the neighborhood message passing in the graph. Our experimental results demonstrate the effectiveness of our method under different sensor setups with statistical heterogeneity.
Join order selection plays an important role in DBMS query optimizers. The problem aims to find the optimal join order with the minimum cost, and usually becomes an NP-hard problem due to the exponentially increasing search space. Recent advanced studies attempt to use deep reinforcement learning (DRL) to generate better join plans than the ones provided by conventional query optimizers. However, DRL-based methods require time-consuming training, which is not suitable for online applications that need frequent periodic re-training. In this paper, we propose a novel framework, namely efficient Join Order selection learninG with Graph-basEd Representation (JOGGER). We firstly construct a schema graph based on the primary-foreign key relationships, from which table representations are well learned to capture the correlations between tables. The second component is the state representation, where a graph convolutional network is utilized to encode the query graph and a tailored-tree-based attention module is designed to encode the join plan. To speed up the convergence of DRL training process, we exploit the idea of curriculum learning, in which queries are incrementally added into the training set according to the level of difficulties. We conduct extensive experiments on JOB and TPC-H datasets, which demonstrate the effectiveness and efficiency of the proposed solutions.
Recent studies have shown that deep neural networks-based recommender systems are vulnerable to adversarial attacks, where attackers can inject carefully crafted fake user profiles (i.e., a set of items that fake users have interacted with) into a target recommender system to achieve malicious purposes, such as promote or demote a set of target items. Due to the security and privacy concerns, it is more practical to perform adversarial attacks under the black-box setting, where the architecture/parameters and training data of target systems cannot be easily accessed by attackers. However, generating high-quality fake user profiles under black-box setting is rather challenging with limited resources to target systems. To address this challenge, in this work, we introduce a novel strategy by leveraging items' attribute information (i.e., items' knowledge graph), which can be publicly accessible and provide rich auxiliary knowledge to enhance the generation of fake user profiles. More specifically, we propose a knowledge graph-enhanced black-box attacking framework (KGAttack) to effectively learn attacking policies through deep reinforcement learning techniques, in which knowledge graph is seamlessly integrated into hierarchical policy networks to generate fake user profiles for performing adversarial black-box attacks. Comprehensive experiments on various real-world datasets demonstrate the effectiveness of the proposed attacking framework under the black-box setting.
The booming of multi-modal knowledge graphs (MMKGs) has raised the imperative demand for multi-modal entity alignment techniques, which facilitate the integration of multiple MMKGs from separate data sources. Unfortunately, prior arts harness multi-modal knowledge only via the heuristic merging of uni-modal feature embeddings. Therefore, inter-modal cues concealed in multi-modal knowledge could be largely ignored. To deal with that problem, in this paper, we propose a novel Multi-modal Siamese Network for Entity Alignment (MSNEA) to align entities in different MMKGs, in which multi-modal knowledge could be comprehensively leveraged by the exploitation of inter-modal effect. Specifically, we first devise a multi-modal knowledge embedding module to extract visual, relational, and attribute features of entities to generate holistic entity representations for distinct MMKGs. During this procedure, we employ inter-modal enhancement mechanisms to integrate visual features to guide relational feature learning and adaptively assign attention weights to capture valuable attributes for alignment. Afterwards, we design a multi-modal contrastive learning module to achieve inter-modal enhancement fusion with avoiding the overwhelming impact of weak modalities. Experimental results on two public datasets demonstrate that our proposed MSNEA provides state-of-the-art performance with a large margin compared with competitive baselines.
Multi-view subspace clustering targets at clustering data lying in a union of low-dimensional subspaces. Generally, an n X n affinity graph is constructed, on which spectral clustering is then performed to achieve the final clustering. Both graph construction and graph partitioning of spectral clustering suffer from quadratic or even cubic time and space complexity, leading to difficulty in clustering large-scale datasets. Some efforts have recently been made to capture data distribution in multiple views by selecting key anchor bases beforehand with k-means or uniform sampling strategy. Nevertheless, few of them pay attention to the algebraic property of the anchors. How to learn a set of high-quality orthogonal bases in a unified framework, while maintaining its scalability for very large datasets, remains a big challenge. In view of this, we propose an Efficient Orthogonal Multi-view Subspace Clustering (OMSC) model with almost linear complexity. Specifically, the anchor learning, graph construction and partition are jointly modeled in a unified framework. With the mutual enhancement of each other, a more discriminative and flexible anchor representation and cluster indicator can be jointly obtained. An alternate minimizing strategy is developed to deal with the optimization problem, which is proved to have linear time complexity w.r.t. the sample number. Extensive experiments have been conducted to confirm the superiority of the proposed OMSC method. The source codes and data are available at https://github.com/ManshengChen/Code-for-OMSC-master.
Unbiased learning to rank (ULTR) aims to train an unbiased ranking model from biased user click logs. Most of the current ULTR methods are based on the examination hypothesis (EH), which assumes that the click probability can be factorized into two scalar functions, one related to ranking features and the other related to bias factors. Unfortunately, the interactions among features, bias factors and clicks are complicated in practice, and usually cannot be factorized in this independent way. Fitting click data with EH could lead to model misspecification and bring the approximation error.
In this paper, we propose a vector-based EH and formulate the click probability as a dot product of two vector functions. This solution is complete due to its universality in fitting arbitrary click functions. Based on it, we propose a novel model named Vectorization to adaptively learn the relevance embeddings and sort documents by projecting embeddings onto a base vector. Extensive experiments show that our method significantly outperforms the state-of-the-art ULTR methods on complex real clicks as well as simple simulated clicks.
Time series forecasting is a critical and challenging problem in many real applications. Recently, Transformer-based models prevail in time series forecasting due to their advancement in long-range dependencies learning. Besides, some models introduce series decomposition to further unveil reliable yet plain temporal dependencies. Unfortunately, few models could handle complicated periodical patterns, such as multiple periods, variable periods, and phase shifts in real-world datasets. Meanwhile, the notorious quadratic complexity of dot-product attentions hampers long sequence modeling. To address these challenges, we design an innovative framework Quaternion Transformer (Quatformer), along with three major components: 1). learning-to-rotate attention (LRA) based on quaternions which introduces learnable period and phase information to depict intricate periodical patterns. 2). trend normalization to normalize the series representations in hidden layers of the model considering the slowly varying characteristic of trend. 3). decoupling LRA using global memory to achieve linear complexity without losing prediction accuracy. We evaluate our framework on multiple real-world time series datasets and observe an average 8.1% and up to 18.5% MSE improvement over the best state-of-the-art baseline.
Empirical variance is a fundamental concept widely used in data management and data analytics, e.g., query optimization, approximate query processing, and feature selection. A direct solution to derive the empirical variance is scanning the whole data table, which is expensive when the data size is huge. Hence, most current works focus on approximate answers by sampling. For results with approximation guarantees, the samples usually need to be uniformly independent random, incurring high cache miss rates especially in compact columnar style layouts. An alternative uses block sampling to avoid this issue, which directly samples a block of consecutive records fitting page sizes instead of sampling one record each time. However, this provides no theoretical guarantee. Existing studies show that the practical estimations can be inaccurate as the records within a block can be correlated.
Motivated by this, we investigate how to provide approximation guarantees for empirical variances with block sampling from a theoretical perspective. Our results shows that if the records stored in a table are 4-wise independent to each other according to keys, a slightly modified block sampling can provide the same approximation guarantee with the same asymptotic sampling cost as that of independent random sampling. In practice, storing records via hash clusters or hash organized tables are typical scenarios in modern commercial database systems. Thus, for data analysis on tables in the data lake or OLAP stores that are exported from such hash-based storage, our strategy can be easily integrated to improve the sampling efficiency. Based on our sampling strategy, we present an approximate algorithm for empirical variance and an approximate top-k algorithm to return the k columns with the highest empirical variance scores. Extensive experiments show that our solutions outperform existing solutions by up to an order of magnitude.
Learning vectorized embeddings is at the core of various recommender systems for user-item matching. To perform efficient online inference, representation quantization, aiming to embed the latent features by a compact sequence of discrete numbers, recently shows the promising potentiality in optimizing both memory and computation overheads. However, existing work merely focuses on numerical quantization whilst ignoring the concomitant information loss issue, which, consequently, leads to conspicuous performance degradation. In this paper, we propose a novel quantization framework to learn Binarized Graph Representations for Top-K Recommendation (BiGeaR). We introduce multi-faceted quantization reinforcement at the pre-, mid-, and post-stage of binarized representation learning, which substantially retains the informativeness against embedding binarization. In addition to saving the memory footprint, it further develops solid online inference acceleration with bitwise operations, providing alternative flexibility for the realistic deployment. The empirical results over five large real-world benchmarks show that BiGeaR achieves about 22%~40% performance improvement over the state-of-the-art quantization-based recommender system, and recovers about 95%~102% of the performance capability of the best full-precision counterpart with over 8× time and space reduction.
Logical rules are widely used to represent domain knowledge and hypothesis, which is fundamental to symbolic reasoning-based human intelligence. Very recently, it has been demonstrated that integrating logical rules into regular learning tasks can further enhance learning performance in a label-efficient manner. Many attempts have been made to learn logical rules automatically from knowledge graphs (KGs). However, a majority of existing methods entirely rely on observed rule instances to define the score function for rule evaluation and thus lack generalization ability and suffer from severe computational inefficiency. Instead of completely relying on rule instances for rule evaluation, RLogic defines a predicate representation learning-based scoring model, which is trained by sampled rule instances. In addition, RLogic incorporates one of the most significant properties of logical rules, the deductive nature, into rule learning, which is critical especially when a rule lacks supporting evidence. To push deductive reasoning deeper into rule learning, RLogic breaks a big sequential model into small atomic models in a recursive way. Extensive experiments have demonstrated that RLogic is superior to existing state-of-the-art algorithms in terms of both efficiency and effectiveness.
Currently, Vision Transformer (ViT) and its variants have demonstrated promising performance on various computer vision tasks. Nevertheless, task-irrelevant information such as background nuisance and noise in patch tokens would damage the performance of ViT-based models. In this paper, we develop Sufficient Vision Transformer (Suf-ViT) as a new solution to address this issue. In our research, we propose the Sufficiency-Blocks (S-Blocks) to be applied across the depth of Suf-ViT to disentangle and discard task-irrelevant information accurately. Besides, to boost the training of Suf-ViT, we formulate a Sufficient-Reduction Loss (SRLoss) leveraging the concept of Mutual Information (MI) that enables Suf-ViT to extract more reliable sufficient representations by removing task-irrelevant information. Extensive experiments on benchmark datasets such as ImageNet, ImageNet-C, and CIFAR-10 indicate that our method can achieve state-of-the-art or competing performance over other baseline methods. Codes are available at: https://github.com/zhicheng2T0/Sufficient-Vision-Transformer.git
The problem of fitting distances by tree-metrics has received significant attention in the theoretical computer science and machine learning communities alike, due to many applications in natural language processing, phylogeny, cancer genomics and a myriad of problem areas that involve hierarchical clustering. Despite the existence of several provably exact algorithms for tree-metric fitting of data that inherently obeys tree-metric constraints, much less is known about how to best fit tree-metrics for data whose structure moderately (or substantially) differs from a tree. For such noisy data, most available algorithms perform poorly and often produce negative edge weights in representative trees. Furthermore, it is currently not known how to choose the most suitable approximation objective for noisy fitting. Our contributions are as follows. First, we propose a new approach to tree-metric denoising (HyperAid) in hyperbolic spaces which transforms the original data into data that is "more'' tree-like, when evaluated in terms of Gromov's δ hyperbolicity. Second, we perform an ablation study involving two choices for the approximation objective, lp norms and the Dasgupta loss. Third, we integrate HyperAid with schemes for enforcing nonnegative edge-weights. As a result, the HyperAid platform outperforms all other existing methods in the literature, including Neighbor Joining (NJ), TreeRep and T-REX, both on synthetic and real-world data. Synthetic data is represented by edge-augmented trees and shortest-distance metrics while the real-world datasets include Zoo, Iris, Glass, Segmentation and SpamBase; on these datasets, the average improvement with respect to NJ is $125.94%$.
Time-series data contains temporal order information that can guide representation learning for predictive end tasks (e.g., classification, regression). Recently, there are some attempts to leverage such order information to first pre-train time-series models by reconstructing time-series values of randomly masked time segments, followed by an end-task fine-tuning on the same dataset, demonstrating improved end-task performance. However, this learning paradigm decouples data reconstruction from the end task. We argue that the representations learnt in this way are not informed by the end task and may, therefore, be sub-optimal for the end-task performance. In fact, the importance of different timestamps can vary significantly in different end tasks. We believe that representations learnt by reconstructing important timestamps would be a better strategy for improving end-task performance. In this work, we propose TARNet, Task-Aware Reconstruction Network, a new model using Transformers to learn task-aware data reconstruction that augments end-task performance. Specifically, we design a data-driven masking strategy that uses self-attention score distribution from end-task training to sample timestamps deemed important by the end task. Then, we mask out data at those timestamps and reconstruct them, thereby making the reconstruction task-aware. This reconstruction task is trained alternately with the end task at every epoch, sharing parameters in a single model, allowing the representation learnt through reconstruction to improve end-task performance. Extensive experiments on tens of classification and regression datasets show that TARNet significantly outperforms state-of-the-art baseline models across all evaluation metrics.
We study the private k-median and k-means clustering problem in d dimensional Euclidean space. By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods. We prove that our method computes a solution with cost at most O(d3/2 log n)⁆ OPT + O(kd2 log2 n/ε2), where ε is the privacy guarantee. (The dimension term, d, can be replaced with O(log k) using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, Õ (nkd), time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime. Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines.
The interactive graph search (IGS) problem aims to locate an initially unknown target node leveraging human intelligence. In IGS, we can gradually find the target node by sequentially asking humans some reachability queries like "is the target node reachable from a given node x?". However, human workers may make mistakes when answering these queries. Motivated by this concern, in this paper, we study a noisy version of the IGS problem. Our objective in this problem is to minimize the query complexity while ensuring accuracy. We propose a method to select the query node such that we can push the search process as much as possible and an online method to infer which node is the target after collecting a new answer. By rigorous theoretical analysis, we show that the query complexity of our approach is near-optimal up to a constant factor. The extensive experiments on two real datasets also demonstrate the superiorities of our approach.
Federated learning (FL) refers to the paradigm of learning models over a collaborative research network involving multiple clients without sacrificing privacy. Recently, there have been rising concerns on the distributional discrepancies across different clients, which could even cause counterproductive consequences when collaborating with others. While it is not necessarily that collaborating with all clients will achieve the best performance, in this paper, we study a rational collaboration called "collaboration equilibrium'' (CE), where smaller collaboration coalitions are formed. Each client collaborates with certain members who maximally improve the model learning and isolates the others who make little contribution. We propose the concept of benefit graph which describes how each client can benefit from collaborating with other clients and advance a Pareto optimization approach to identify the optimal collaborators. Then we theoretically prove that we can reach a CE from the benefit graph through an iterative graph operation. Our framework provides a new way of setting up collaborations in a research network. Experiments on both synthetic and real world data sets are provided to demonstrate the effectiveness of our method.
Post-click conversion rate (CVR) prediction is an essential task for discovering user interests and increasing platform revenues in a range of industrial applications. One of the most challenging problems of this task is the existence of severe selection bias caused by the inherent self-selection behavior of users and the item selection process of systems. Currently, doubly robust (DR) learning approaches achieve the state-of-the-art performance for debiasing CVR prediction. However, in this paper, by theoretically analyzing the bias, variance and generalization bounds of DR methods, we find that existing DR approaches may have poor generalization caused by inaccurate estimation of propensity scores and imputation errors, which often occur in practice. Motivated by such analysis, we propose a generalized learning framework that not only unifies existing DR methods, but also provides a valuable opportunity to develop a series of new debiasing techniques to accommodate different application scenarios. Based on the framework, we propose two new DR methods, namely DR-BIAS and DR-MSE. DR-BIAS directly controls the bias of DR loss, while DR-MSE balances the bias and variance flexibly, which achieves better generalization performance. In addition, we propose a novel tri-level joint learning optimization method for DR-MSE in CVR prediction, and an efficient training algorithm correspondingly. We conduct extensive experiments on both real-world and semi-synthetic datasets, which validate the effectiveness of our proposed methods.
We are interested in discovering those patterns from data with an empirical frequency that is significantly differently than expected. To avoid spurious results, yet achieve high statistical power, we propose to sequentially control for false discoveries during the search. To avoid redundancy, we propose to update our expectations whenever we discover a significant pattern. To efficiently consider the exponentially sized search space, we employ an easy-to-compute upper bound on significance, and propose an effective search strategy for sets of significant patterns. Through an extensive set of experiments on synthetic data, we show that our method, Spass, recovers the ground truth reliably, does so efficiently, and without redundancy. On real-world data we show it works well on both single and multiple classes, on low and high dimensional data, and through case studies that it discovers meaningful results.
Bidirectional Transformer architectures are state-of-the-art sequential recommendation models that use a bi-directional representation capacity based on the Cloze task, a.k.a. Masked Language Modeling. The latter aims to predict randomly masked items within the sequence. Because they assume that the true interacted item is the most relevant one, an exposure bias results, where non-interacted items with low exposure propensities are assumed to be irrelevant. The most common approach to mitigating exposure bias in recommendation has been Inverse Propensity Scoring (IPS), which consists of down-weighting the interacted predictions in the loss function in proportion to their propensities of exposure, yielding a theoretically unbiased learning. In this work, we argue and prove that IPS does not extend to sequential recommendation because it fails to account for the temporal nature of the problem. We then propose a novel propensity scoring mechanism, which can theoretically debias the Cloze task in sequential recommendation. Finally we empirically demonstrate the debiasing capabilities of our proposed approach and its robustness to the severity of exposure bias.
The problem of algorithmic recourse has been explored for supervised machine learning models, to provide more interpretable, transparent and robust outcomes from decision support systems. An unexplored area is that of algorithmic recourse for anomaly detection, specifically for tabular data with only discrete feature values. Here the problem is to present a set of counterfactuals that are deemed normal by the underlying anomaly detection model so that applications can utilize this information for explanation purposes or to recommend countermeasures. We present an approach-Context preserving Algorithmic Recourse for Anomalies in Tabular data(CARAT), that is effective, scalable, and agnostic to the underlying anomaly detection model. CARAT uses a transformer based encoder-decoder model to explain an anomaly by finding features with low likelihood. Subsequently semantically coherent counterfactuals are generated by modifying the highlighted features, using the overall context of features in the anomalous instance(s). Extensive experiments help demonstrate the efficacy of CARAT.
Data-driven societal event forecasting methods exploit relevant historical information to predict future events. These methods rely on historical labeled data and cannot accurately predict events when data are limited or of poor quality. Studying causal effects between events goes beyond correlation analysis and can contribute to a more robust prediction of events. However, incorporating causality analysis in data-driven event forecasting is challenging due to several factors: (i) Events occur in a complex and dynamic social environment. Many unobserved variables, i.e., hidden confounders, affect both potential causes and outcomes. (ii) Given spatiotemporal non-independent and identically distributed (non-IID) data, modeling hidden confounders for accurate causal effect estimation is not trivial. In this work, we introduce a deep learning framework that integrates causal effect estimation into event forecasting. We first study the problem of Individual Treatment Effect (ITE) estimation from observational event data with spatiotemporal attributes and present a novel causal inference model to estimate ITEs. We then incorporate the learned event-related causal information into event prediction as prior knowledge. Two robust learning modules, including a feature reweighting module and an approximate constraint loss, are introduced to enable prior knowledge injection. We evaluate the proposed causal inference model on real-world event datasets and validate the effectiveness of proposed robust learning modules in event prediction by feeding learned causal information into different deep learning methods. Experimental results demonstrate the strengths of the proposed causal inference model for ITE estimation in societal events and showcase the beneficial properties of robust learning modules in societal event forecasting.
Recommender systems should answer the intervention question "if recommending an item to a user, what would the feedback be", calling for estimating the causal effect of a recommendation on user feedback. Generally, this requires blocking the effect of confounders that simultaneously affect the recommendation and feedback. To mitigate the confounding bias, a strategy is incorporating propensity into model learning. However, existing methods forgo possible unmeasured confounders (e.g., user financial status), which can result in biased propensities and hurt recommendation performance. This work combats the risk of unmeasured confounders in recommender systems.
Towards this end, we propose Robust Deconfounder (RD) that accounts for the effect of unmeasured confounders on propensities, under the mild assumption that the effect is bounded. It estimates the bound with sensitivity analysis, learning a recommender model robust to unmeasured confounders within the bound by adversarial learning. However, pursuing robustness within a bound may restrict model accuracy. To avoid the trade-off between robustness and accuracy, we further propose Benchmarked RD (BRD) that incorporates a pre-trained model into the learning as the benchmark. Theoretical analyses prove the stronger robustness of our methods compared to existing propensity-based deconfounders, and also prove the no-harm property of BRD. Our methods are applicable to any propensity-based estimators, where we select three representative ones: IPS, Doubly Robust, and AutoDebias. We conduct experiments on three real-world datasets to demonstrate the effectiveness of our methods.
Graph Neural Networks (GNNs) have shown satisfying performance in various graph analytical problems. Hence, they have become the de facto solution in a variety of decision-making scenarios. However, GNNs could yield biased results against certain demographic subgroups. Some recent works have empirically shown that the biased structure of the input network is a significant source of bias for GNNs. Nevertheless, no studies have systematically scrutinized which part of the input network structure leads to biased predictions for any given node. The low transparency on how the structure of the input network influences the bias in GNN outcome largely limits the safe adoption of GNNs in various decision-critical scenarios. In this paper, we study a novel research problem of structural explanation of bias in GNNs. Specifically, we propose a novel post-hoc explanation framework to identify two edge sets that can maximally account for the exhibited bias and maximally contribute to the fairness level of the GNN prediction for any given node, respectively. Such explanations not only provide a comprehensive understanding of bias/fairness of GNN predictions but also have practical significance in building an effective yet fair GNN model. Extensive experiments on real-world datasets validate the effectiveness of the proposed framework towards delivering effective structural explanations for the bias of GNNs. Open-source code can be found at https://github.com/yushundong/REFEREE.
The widespread use of machine learning algorithms in settings that directly affect human lives has instigated significant interest in designing variants of these algorithms that are provably fair. Recent work in this direction has produced numerous algorithms for the fundamental problem of clustering under many different notions of fairness. Perhaps the most common family of notions currently studied is group fairness, in which proportional group representation is ensured in every cluster. We extend this direction by considering the downstream application of clustering and how group fairness should be ensured for such a setting. Specifically, we consider a common setting in which a decision-maker runs a clustering algorithm, inspects the center of each cluster, and decides an appropriate outcome (label) for its corresponding cluster. In hiring for example, there could be two outcomes, positive (hire) or negative (reject), and each cluster would be assigned one of these two outcomes. To ensure group fairness in such a setting, we would desire proportional group representation in every label but not necessarily in every cluster as is done in group fair clustering. We provide algorithms for such problems and show that in contrast to their NP-hard counterparts in group fair clustering, they permit efficient solutions. We also consider a well-motivated alternative setting where the decision-maker is free to assign labels to the clusters regardless of the centers' positions in the metric space. We show that this setting exhibits interesting transitions from computationally hard to easy according to additional constraints on the problem. Moreover, when the constraint parameters take on natural values we show a randomized algorithm for this setting that always achieves an optimal clustering and satisfies the fairness constraints in expectation. Finally, we run experiments on real world datasets that validate the effectiveness of our algorithms.
Regression models are learned over multiple variables, e.g., using engine torque and speed to predict its fuel consumption. In practice, the values of these variables are often collected separately, e.g., by different sensors in a vehicle, and need to be aligned first in a tuple before learning. Unfortunately, flowing to various issues like network delays, values generated at the same time could be recorded with different timestamps, making the alignment diffcult. According to our study in a vehicle manufacturer, engine torque, speed and fuel consumption values are mostly not recorded with the same timestamps. Aligning tuples by simply concatenating values of variables with equal timestamps leads to limited data for learning regression model. To deal with timestamp variations, existing time series matching techniques rely on the similarity of values and timestamps, which unfortunately are very likely to be absent among the variables in regression (no similarity between engine torque and speed values). In this sense, we propose to bridge tuple alignment and regression. Rather than similar values and timestamps, we align the values of different variables in a tuple that (i) are recorded in a short period, i.e., time constraint, and more importantly (ii) coincide well with the regression model, known as model constraint. Our theoretical and technical contributions include (1) formulating the problem of tuple alignment with time and model constraints, (2) proving NP-completeness of the problem, (3) devising an approximation algorithm with performance guarantee, and (4) proposing efficient pruning strategies for the algorithm. Experiments over real world datasets, including the aforesaid engine data collected by a vehicle manufacturer, demonstrate that our proposal outperforms the existing methods on alignment accuracy and improves regression precision.
Deep learning based trajectory similarity computation holds the potential for improved efficiency and adaptability over traditional similarity computation. However, existing learning-based trajectory similarity learning solutions prioritize spatial similarity over temporal similarity, making them suboptimal for time-aware analyses. To this end, we propose ST2Vec, a representation learning based solution that considers fine-grained spatial and temporal relations between trajectories to enable spatio-temporal similarity computation in road networks. Specifically, ST2Vec encompasses two steps: (i) spatial and temporal modeling that encode spatial and temporal information of trajectories, where a generic temporal modeling module is proposed for the first time; and (ii) spatio-temporal co-attention fusion, where two fusion strategies are designed to enable the generation of unified spatio-temporal embeddings of trajectories. Further, under the guidance of triplet loss, ST2Vec employs curriculum learning in model optimization to improve convergence and effectiveness. An experimental study offers evidence that ST2Vec outperforms state-of-the-art competitors substantially in terms of effectiveness and efficiency, while showing low parameter sensitivity and good model robustness. Moreover, similarity involved case studies including top-k querying and DBSCAN clustering offer further insight into the capabilities of ST2Vec.
Knowledge distillation (KD) has demonstrated its effectiveness to boost the performance of graph neural networks (GNNs), where its goal is to distill knowledge from a deeper teacher GNN into a shallower student GNN. However, it is actually difficult to train a satisfactory teacher GNN due to the well-known over-parametrized and over-smoothing issues, leading to invalid knowledge transfer in practical applications. In this paper, we propose the first Free-direction Knowledge Distillation framework via Reinforcement learning for GNNs, called FreeKD, which is no longer required to provide a deeper well-optimized teacher GNN. The core idea of our work is to collaboratively build two shallower GNNs in an effort to exchange knowledge between them via reinforcement learning in a hierarchical way. As we observe that one typical GNN model often has better and worse performances at different nodes during training, we devise a dynamic and free-direction knowledge transfer strategy that consists of two levels of actions: 1) node-level action determines the directions of knowledge transfer between the corresponding nodes of two networks; and then 2) structure-level action determines which of the local structures generated by the node-level actions to be propagated. In essence, our FreeKD is a general and principled framework which can be naturally compatible with GNNs of different architectures. Extensive experiments on five benchmark datasets demonstrate our FreeKD outperforms two base GNNs in a large margin, and shows its efficacy to various GNNs. More surprisingly, our FreeKD has comparable or even better performance than traditional KD algorithms that distill knowledge from a deeper and stronger teacher GNN.
Graph metric learning methods aim to learn the distance metric over graphs such that similar (e.g., same class) graphs are closer and dissimilar (e.g., different class) graphs are farther apart. This is of critical importance in many graph classification applications such as drug discovery and epidemics categorization. Most, if not all, graph metric learning techniques consider the input graph as static, and largely ignore the intrinsic dynamics of temporal graphs. However, in practice, a graph typically has heterogeneous dynamics (e.g., microscopic and macroscopic evolution patterns). As such, labeling a temporal graph is usually expensive and also requires background knowledge. To learn a good metric over temporal graphs, we propose a temporal graph metric learning framework, Temp-GFSM. With only a few labeled temporal graphs, Temp-GFSM outputs a good metric that can accurately classify different temporal graphs and be adapted to discover new subspaces for unseen classes. Each proposed component in Temp-GFSM answers the following questions: What patterns are evolving in a temporal graph? How to weigh these patterns to represent the characteristics of different temporal classes? And how to learn the metric with the guidance from only a few labels? Finally, the experimental results on real-world temporal graph classification tasks from various domains show the effectiveness of our Temp-GFSM.
Protein engineering has important applications in drug discovery. Among others, inverse protein folding is a fundamental task in protein design, which aims at generating protein's amino acid sequence given a 3D graph structure. However, most existing methods for inverse protein folding are based on sequential generative models and therefore limited in uncertainty quantification and exploration ability to the entire protein space. To address the issues, we propose a sampling method for inverse protein folding (SIPF). Specifically, we formulate inverse protein folding as a sampling problem and design two pretrained neural networks as Markov Chain Monte Carlo (MCMC) proposal distribution. To ensure sampling efficiency, we further design (i) an adaptive sampling scheme to select variables for sampling and (ii) an approximate target distribution as a surrogate of the unavailable target distribution. Empirical studies have been conducted to validate the effectiveness of SIPF, achieving 7.4% relative improvement on recovery rate and 6.4% relative reduction in perplexity compared to the best baseline.
In recent years, therapeutic antibodies have become one of the fastest-growing classes of drugs and have been approved for the treatment of a wide range of indications, from cancer to autoimmune diseases. Complementarity-determining regions (CDRs) are part of the variable chains in antibodies and determine specific antibody-antigen binding. Some explorations use in silicon methods to design antibody CDR loops. However, the existing methods faced the challenges of maintaining the specific geometry shape of the CDR loops. This paper proposes a Constrained Energy Model (CEM) to address this issue. Specifically, we design a constrained manifold to characterize the geometry constraints of the CDR loops. Then we design the energy model in the constrained manifold and only depict the energy landscape of the manifold instead of the whole space in the vanilla energy model. The geometry shape of the generated CDR loops is automatically preserved. Theoretical analysis shows that learning on the constrained manifold requires less sample complexity than the unconstrained method. CEM's superiority is validated via thorough empirical studies, achieving consistent and significant improvement with up to 33.4% relative reduction in terms of 3D geometry error (Root Mean Square Deviation, RMSD) and 8.4% relative reduction in terms of amino acid sequence metric (perplexity) compared to the best baseline method. The code is publicly available at https://github.com/futianfan/energy_model4antibody_design
Recent years have seen a renewed interest in interpretable machine learning, which seeks insight into how a model achieves a prediction. Here, we focus on the relatively unexplored case of interpretable clustering. In our approach, the cluster assignments of the training instances are constrained to be the output of a decision tree. This has two advantages: 1) it makes it possible to understand globally how an instance is mapped to a cluster, in particular to see which features are used for which cluster; 2) it forces the clusters to respect a hierarchical structure while optimizing the original clustering objective function. Rather than the traditional axis-aligned trees, we use sparse oblique trees, which have far more modelling power, particularly with high-dimensional data, while remaining interpretable. Our approach applies to any clustering method which is defined by optimizing a cost function and we demonstrate it with two k-means variants.
The lottery ticket hypothesis (LTH) states that a randomly initialized dense network contains sub-networks that can be trained in isolation to the performance of the dense network. In this paper, to achieve rapid learning with less computational cost, we explore LTH in the context of meta learning. First, we experimentally show that there are sparse sub-networks, known as meta winning tickets, which can be meta-trained to few-shot classification accuracy to the original backbone. The application of LTH in meta learning enables the adaptation of meta-trained networks on various IoT devices with fewer computation. However, the status quo to identify winning tickets requires iterative training and pruning, which is particularly expensive for finding meta winning tickets. To this end, then we investigate the inter- and intra-layer patterns among different meta winning tickets, and propose a scheme for early detection of a meta winning ticket. The proposed scheme enables efficient training in resource-limited devices. Besides, it also designs a lightweight solution to search the meta winning ticket. Evaluations on standard few-shot classification benchmarks show that we can find competitive meta winning tickets with 20% weights of the original backbone, while incurring only 8%-14% (Conv-4) and 19%-29% (ResNet-12) computation overhead (measured by FLOPs) of the standard winning ticket finding scheme.
Entity alignment (EA) aims at finding equivalent entities in different knowledge graphs (KGs). Embedding-based approaches have dominated the EA task in recent years. Those methods face problems that come from the geometric properties of embedding vectors, including hubness and isolation. To solve these geometric problems, many normalization approaches have been adopted for EA. However, the increasing scale of KGs renders it hard for EA models to adopt the normalization processes, thus limiting their usage in real-world applications. To tackle this challenge, we present ClusterEA, a general framework that is capable of scaling up EA models and enhancing their results by leveraging normalization methods on mini-batches with a high entity equivalent rate. ClusterEA contains three components to align entities between large-scale KGs, including stochastic training, ClusterSampler, and SparseFusion. It first trains a large-scale Siamese GNN for EA in a stochastic fashion to produce entity embeddings. Based on the embeddings, a novel ClusterSampler strategy is proposed for sampling highly overlapped mini-batches. Finally, ClusterEA incorporates SparseFusion, which normalizes local and global similarity and then fuses all similarity matrices to obtain the final similarity matrix. Extensive experiments with real-life datasets on EA benchmarks offer insight into the proposed framework, and suggest that it is capable of outperforming the state-of-the-art scalable EA framework by up to 8 times in terms of Hits@1.
Despite the fast progress of explanation techniques in modern Deep Neural Networks (DNNs) where the main focus is handling "how to generate the explanations", advanced research questions that examine the quality of the explanation itself (e.g., "whether the explanations are accurate") and improve the explanation quality (e.g., "how to adjust the model to generate more accurate explanations when explanations are inaccurate") are still relatively under-explored. To guide the model toward better explanations, techniques in explanation supervision - which add supervision signals on the model explanation - have started to show promising effects on improving both the generalizability as and intrinsic interpretability of Deep Neural Networks. However, the research on supervising explanations, especially in vision-based applications represented through saliency maps, is in its early stage due to several inherent challenges: 1) inaccuracy of the human explanation annotation boundary, 2) incompleteness of the human explanation annotation region, and 3) inconsistency of the data distribution between human annotation and model explanation maps. To address the challenges, we propose a generic RES framework for guiding visual explanation by developing a novel objective that handles inaccurate boundary, incomplete region, and inconsistent distribution of human annotations, with a theoretical justification on model generalizability. Extensive experiments on two real-world image datasets demonstrate the effectiveness of the proposed framework on enhancing both the reasonability of the explanation and the performance of the backbone DNNs model.
Knowledge Graph (KG) and its variant of ontology have been widely used for knowledge representation, and have shown to be quite effective in augmenting Zero-shot Learning (ZSL). However, existing ZSL methods that utilize KGs all neglect the intrinsic complexity of inter-class relationships represented in KGs. One typical feature is that a class is often related to other classes in different semantic aspects. In this paper, we focus on ontologies for augmenting ZSL, and propose to learn disentangled ontology embeddings guided by ontology properties to capture and utilize more fine-grained class relationships in different aspects. We also contribute a new ZSL framework named DOZSL, which contains two new ZSL solutions based on generative models and graph propagation models, respectively, for effectively utilizing the disentangled ontology embeddings. Extensive evaluations have been conducted on five benchmarks across zero-shot image classification (ZS-IMGC) and zero-shot KG completion (ZS-KGC). DOZSL often achieves better performance than the state-of-the-art, and its components have been verified by ablation studies and case studies. Our codes and datasets are available at https://github.com/zjukg/DOZSL.
The emerging meta- and multi-verse landscape is yet another step towards the more prevalent use of already ubiquitous online markets. In such markets, recommender systems play critical roles by offering items of interest to the users, thereby narrowing down a vast search space that comprises hundreds of thousands of products. Recommender systems are usually designed to learn common user behaviors and rely on them for inference. This approach, while effective, is oblivious to subtle idiosyncrasies that differentiate humans from each other. Focusing on this observation, we propose an architecture that relies on common patterns as well as individual behaviors to tailor its recommendations for each person. Simulations under a controlled environment show that our proposed model learns interpretable personalized user behaviors. Our empirical results on Nielsen Consumer Panel dataset indicate that the proposed approach achieves up to 27.9% performance improvement compared to the state-of-the-art.
Machine Learning is beginning to provide state-of-the-art performance in a range of environmental applications such as streamflow prediction in a hydrologic basin. However, building accurate broad-scale models for streamflow remains challenging in practice due to the variability in the dominant hydrologic processes, which are best captured by sets of process-related basin characteristics. Existing basin characteristics suffer from noise and uncertainty, among many other things, which adversely impact model performance. To tackle the above challenges, in this paper, we propose a novel Knowledge-guided Self-Supervised Learning (KGSSL) inverse framework to extract system characteristics from driver(input) and response(output) data. This first-of-its-kind framework achieves robust performance even when characteristics are corrupted or missing. We evaluate the KGSSL framework in the context of stream flow modeling using CAMELS (Catchment Attributes and MEteorology for Large-sample Studies) which is a widely used hydrology benchmark dataset. Specifically, KGSSL outperforms baseline by 16% in predicting missing characteristics. Furthermore, in the context of forward modelling, KGSSL inferred characteristics provide a 35% improvement in performance over a standard baseline when the static characteristic are unknown.
Tracking a targeted subset of nodes in an evolving graph is important for many real-world applications. Existing methods typically focus on identifying anomalous edges or finding anomaly graph snapshots in a stream way. However, edge-oriented methods cannot quantify how individual nodes change over time while others need to maintain representations of the whole graph all the time, thus computationally inefficient.
This paper proposes DynAnom, an efficient framework to quantify the changes and localize per-node anomalies over large dynamic weighted-graphs. Thanks to recent advances in dynamic representation learning based on Personalized PageRank, DynAnom is 1) efficient: the time complexity is linear to the number of edge events and independent of node size of the input graph; 2) effective: DynAnom can successfully track topological changes reflecting real-world anomaly; 3) flexible: different type of anomaly score functions can be defined for various applications. Experiments demonstrate these properties on both benchmark graph datasets and a new large real-world dynamic graph. Specifically, an instantiation method based on DynAnom achieves the accuracy of 0.5425 compared with 0.2790, the best baseline, on the task of node-level anomaly localization while running 2.3 times faster than the baseline. We present a real-world case study and further demonstrate the usability of DynAnom for anomaly discovery over large-scale graphs.
Representation learning has transformed the problem of information retrieval into one of finding the approximate set of nearest neighbors in a high dimensional vector space. With limited hardware resources and time-critical queries, the retrieval engines face an inherent tension between latency, accuracy, scalability, compactness, and the ability to load balance in distributed settings. To improve the trade-off, we propose a new algorithm, called BaLanced Index for Scalable Search (BLISS), a highly tunable indexing algorithm with enviably small index sizes, making it easy to scale to billions of vectors. It iteratively refines partitions of items by learning the relevant buckets directly from the query-item relevance data. To ensure that the buckets are balanced, BLISS uses the power-of-K choices strategy. We show that BLISS provides superior load balancing with high probability (and under very benign assumptions). Due to its design, BLISS can be employed for both near-neighbor retrieval (ANN problem) and extreme classification (XML problem). For the case of ANN, we train and index 4 datasets with billion vectors each. We compare the recall, inference time, indexing time, and index size for BLISS with the two most popular and well-optimized libraries- Hierarchical Navigable Small World (HNSW) graph and Facebook's FAISS. BLISS requires 100x lesser RAM than HNSW, making it fit in memory on commodity machines while taking a similar inference time as HNSW for the same recall. Against FAISS-IVF, BLISS achieves similar performance with 3-4x less memory requirement. BLISS is both data and model parallel, making it ideal for distributed implementation for training and inference. For the case of XML, BLISS surpasses the best baselines' precision while being 5x faster for inference on popular multi-label datasets with half a million classes.
Any human activity can be represented as a temporal sequence of actions performed to achieve a certain goal. Unlike machine-made time series, these action sequences are highly disparate as the time taken to finish a similar action might vary between different persons. Therefore, understanding the dynamics of these sequences is essential for many downstream tasks such as activity length prediction, goal prediction, etc. Existing neural approaches that model an activity sequence are either limited to visual data or are task-specific, i.e., limited to next action or goal prediction. In this paper, we present ProActive, a neural marked temporal point process (MTPP) framework for modeling the continuous-time distribution of actions in an activity sequence while simultaneously addressing three high-impact problems - next action prediction, sequence-goal prediction, and end-to-end sequence generation. Specifically, we utilize a self-attention module with temporal normalizing flows to model the influence and the inter-arrival times between actions in a sequence. Moreover, for time-sensitive prediction, we perform an early detection of sequence goal via a constrained margin-based optimization procedure. This in-turn allows ProActive to predict the sequence goal using a limited number of actions. Extensive experiments on sequences derived from three activity recognition datasets show the significant accuracy boost of ProActive over the state-of-the-art in terms of action and goal prediction, and the first-ever application of end-to-end action sequence generation.
Due to the curse of statistical heterogeneity across clients, adopting a personalized federated learning method has become an essential choice for the successful deployment of federated learning-based services. Among diverse branches of personalization techniques, a model mixture-based personalization method is preferred as each client has their own personalized model as a result of federated learning. It usually requires a local model and a federated model, but this approach is either limited to partial parameter exchange or requires additional local updates, each of which is helpless to novel clients and burdensome to the client's computational capacity. As the existence of a connected subspace containing diverse low-loss solutions between two or more independent deep networks has been discovered, we combined this interesting property with the model mixture-based personalized federated learning method for improved performance of personalization. We proposed SuPerFed, a personalized federated learning method that induces an explicit connection between the optima of the local and the federated model in weight space for boosting each other. Through extensive experiments on several benchmark datasets, we demonstrated that our method achieves consistent gains in both personalization performance and robustness to problematic scenarios possible in realistic services.
Traffic demand forecasting by deep neural networks has attracted widespread interest in both academia and industry society. Among them, the pairwise Origin-Destination (OD) demand prediction is a valuable but challenging problem due to several factors: (i) the large number of possible OD pairs, (ii) implicitness of spatial dependence, and (iii) complexity of traffic states. To address the above issues, this paper proposes a Continuous-time and Multi-level dynamic graph representation learning method for Origin-Destination demand prediction (CMOD). Firstly, a continuous-time dynamic graph representation learning framework is constructed, which maintains a dynamic state vector for each traffic node (metro stations or taxi zones). The state vectors keep historical transaction information and are continuously updated according to the most recently happened transactions. Secondly, a multi-level structure learning module is proposed to model the spatial dependency of station-level nodes. It can not only exploit relations between nodes adaptively from data, but also share messages and representations via cluster-level and area-level virtual nodes. Lastly, a cross-level fusion module is designed to integrate multi-level memories and generate comprehensive node representations for the final prediction. Extensive experiments are conducted on two real-world datasets from Beijing Subway and New York Taxi, and the results demonstrate the superiority of our model against the state-of-the-art approaches.
Hierarchical clustering produces a cluster tree with different granularities. As a result, hierarchical clustering provides richer information and insight into a dataset than partitioning clustering. However, hierarchical clustering algorithms often have two weaknesses: scalability and the capacity to handle clusters of varying densities. This is because they rely on pairwise point-based similarity calculations and the similarity measure is independent of data distribution. In this paper, we aim to overcome these weaknesses and propose a novel efficient hierarchical clustering called StreaKHC that enables massive streaming data to be mined. The enabling factor is the use of a scalable point-set kernel to measure the similarity between an existing cluster in the cluster tree and a new point in the data stream. It also has an efficient mechanism to update the hierarchical structure so that a high-quality cluster tree can be maintained in real-time. Our extensive empirical evaluation shows that StreaKHC is more accurate and more efficient than existing hierarchical clustering algorithms.
Deep graph neural networks (GNNs) have been shown to be expressive for modeling graph-structured data. Nevertheless, the overstacked architecture of deep graph models makes it difficult to deploy and rapidly test on mobile or embedded systems. To compress over-stacked GNNs, knowledge distillation via a teacher-student architecture turns out to be an effective technique, where the key step is to measure the discrepancy between teacher and student networks with predefined distance functions. However, using the same distance for graphs of various structures may be unfit, and the optimal distance formulation is hard to determine. To tackle these problems, we propose a novel Adversarial Knowledge Distillation framework for graph models named GraphAKD, which adversarially trains a discriminator and a generator to adaptively detect and decrease the discrepancy. Specifically, noticing that the well-captured inter-node and inter-class correlations favor the success of deep GNNs, we propose to criticize the inherited knowledge from node-level and class-level views with a trainable discriminator. The discriminator distinguishes between teacher knowledge and what the student inherits, while the student GNN works as a generator and aims to fool the discriminator. Experiments on nodelevel and graph-level classification benchmarks demonstrate that GraphAKD improves the student performance by a large margin. The results imply that GraphAKD can precisely transfer knowledge from a complicated teacher GNN to a compact student GNN.
Partial-label learning (PLL) solves the problem where each training instance is assigned a candidate label set, among which only one is the ground-truth label. The core of PLL is to learn efficient feature representations to facilitate label disambiguation. However, existing PLL methods only learn plain representations by coarse supervision, which is incapable of capturing sufficiently distinguishable representations, especially when confronted with the knotty label ambiguity, i.e., certain candidate labels share similar visual patterns. In this paper, we propose a novel framework partial label learning with semantic label representations dubbed ParSE, which consists of two synergistic processes, including visual-semantic representation learning and powerful label disambiguation. In the former process, we propose a novel weighted calibration rank loss that has two implications. First, it implies a progressive calibration strategy that utilizes the disambiguated label confidence to weight the similarity between each image feature embedding and its corresponding semantic label representations of all candidates. Second, it also considers the ranking relationship between candidate and non-candidate ones. Based on learned visual-semantic representations, subsequent label disambiguation is desirably endowed with more powerful abilities. Experiments on benchmarks show that ParSE outperforms state-of-the-art counterparts.
Given raster imagery features and imperfect vector training labels with registration uncertainty, this paper studies a deep learning framework that can quantify and reduce the registration uncertainty of training labels as well as train neural network parameters simultaneously. The problem is important in broad applications such as streamline classification on Earth imagery or tissue segmentation on medical imagery, whereby annotating precise vector labels is expensive and time-consuming. However, the problem is challenging due to the gap between the vector representation of class labels and the raster representation of image features and the need for training neural networks with uncertain label locations. Existing research on uncertain training labels often focuses on uncertainty in label class semantics or characterizes label registration uncertainty at the pixel level (not contiguous vectors). To fill the gap, this paper proposes a novel learning framework that explicitly quantifies vector labels' registration uncertainty. We propose a registration-uncertainty-aware loss function and design an iterative uncertainty reduction algorithm by re-estimating the posterior of true vector label locations distribution based on a Gaussian process. Evaluations on real-world datasets in National Hydrography Dataset refinement show that the proposed approach significantly outperforms several baselines in the registration uncertainty estimations performance and classification performance.
We propose a new kernel that quantifies success for the task of computing a core-periphery partition for an undirected network. Finding the associated optimal partitioning may be expressed in the form of a quadratic unconstrained binary optimization (QUBO) problem, to which a state-of-the-art quantum annealer may be applied. We therefore make use of the new objective function to (a) judge the performance of a quantum annealer, and (b) compare this approach with existing heuristic core-periphery partitioning methods. The quantum annealing is performed on a commercially available D-Wave machine. The QUBO problem involves a full matrix even when the underlying network is sparse. Hence, we develop and test a sparsified version of the original QUBO which increases the available problem dimension for the quantum annealer. Results are provided on both synthetic and real data sets, and we conclude that the QUBO/quantum annealing approach offers benefits in terms of optimizing this new quantity of interest.
Recurrent neural networks (RNN) are widely used for handling sequence data. However, their black-box nature makes it difficult for users to interpret the decision-making process. We propose a new method to construct deterministic finite automata to explain RNN. In an automaton, states are abstracted from hidden states produced by the RNN, and the transitions represent input symbols. Thus, users can follow the paths of transitions, called patterns, to understand how a prediction is produced. Existing methods for extracting automata partition the hidden state space at the beginning of the extraction, which often leads to solutions that are either inaccurate or too large in size to comprehend. Unlike previous methods, our approach allows the automata states to be formed adaptively during the extraction. Instead of defining patterns on pre-determined clusters, our proposed model, AdaAX, identifies small sets of hidden states determined by patterns with finer granularity in data. Then these small sets are gradually merged to form states, allowing users to trade fidelity for lower complexity. Experiments show that our automata can achieve higher fidelity while being significantly smaller in size than baseline methods on synthetic and complex real datasets.
In order to develop effective sequential recommenders, a series of sequence representation learning (SRL) methods are proposed to model historical user behaviors. Most existing SRL methods rely on explicit item IDs for developing the sequence models to better capture user preference. Though effective to some extent, these methods are difficult to be transferred to new recommendation scenarios, due to the limitation by explicitly modeling item IDs. To tackle this issue, we present a novel universal sequence representation learning approach, named UniSRec. The proposed approach utilizes the associated description text of items to learn transferable representations across different recommendation scenarios. For learning universal item representations, we design a lightweight item encoding architecture based on parametric whitening and mixture-of-experts enhanced adaptor. For learning universal sequence representations, we introduce two contrastive pre-training tasks by sampling multi-domain negatives. With the pre-trained universal sequence representation model, our approach can be effectively transferred to new recommendation domains or platforms in a parameter-efficient way, under either inductive or transductive settings. Extensive experiments conducted on real-world datasets demonstrate the effectiveness of the proposed approach. Especially, our approach also leads to a performance improvement in a cross-platform setting, showing the strong transferability of the proposed universal SRL method. The code and pre-trained model are available at: https://github.com/RUCAIBox/UniSRec.
Self-supervised learning (SSL) has been extensively explored in recent years. Particularly, generative SSL has seen emerging success in natural language processing and other fields, such as the wide adoption of BERT and GPT. Despite this, contrastive learning---which heavily relies on structural data augmentation and complicated training strategies---has been the dominant approach in graph SSL, while the progress of generative SSL on graphs, especially graph autoencoders (GAEs), has thus far not reached the potential as promised in other fields. In this paper, we identify and examine the issues that negatively impact the development of GAEs, including their reconstruction objective, training robustness, and error metric. We present a masked graph autoencoder GraphMAE (code is publicly available at https://github.com/THUDM/GraphMAE) that mitigates these issues for generative self-supervised graph learning. Instead of reconstructing structures, we propose to focus on feature reconstruction with both a masking strategy and scaled cosine error that benefit the robust training of GraphMAE. We conduct extensive experiments on 21 public datasets for three different graph learning tasks. The results manifest that GraphMAE---a simple graph autoencoder with our careful designs---can consistently generate outperformance over both contrastive and generative state-of-the-art baselines. This study provides an understanding of graph autoencoders and demonstrates the potential of generative self-supervised learning on graphs.
We study the problem of few-shot Fine-grained Entity Typing (FET), where only a few annotated entity mentions with contexts are given for each entity type. Recently, prompt-based tuning has demonstrated superior performance to standard fine-tuning in few-shot scenarios by formulating the entity type classification task as a ''fill-in-the-blank'' problem. This allows effective utilization of the strong language modeling capability of Pre-trained Language Models (PLMs). Despite the success of current prompt-based tuning approaches, two major challenges remain: (1) the verbalizer in prompts is either manually designed or constructed from external knowledge bases, without considering the target corpus and label hierarchy information, and (2) current approaches mainly utilize the representation power of PLMs, but have not explored their generation power acquired through extensive general-domain pre-training. In this work, we propose a novel framework for few-shot FET consisting of two modules: (1) an entity type label interpretation module automatically learns to relate type labels to the vocabulary by jointly leveraging few-shot instances and the label hierarchy, and (2) a type-based contextualized instance generator produces new instances based on given instances to enlarge the training set for better generalization. On three benchmark datasets, our model outperforms existing methods by significant margins.
Logical reasoning over Knowledge Graphs (KGs) for first-order logic (FOL) queries performs the query inference over KGs with logical operators, including conjunction, disjunction, existential quantification and negation, to approximate true answers in embedding spaces. However, most existing work imposes strong distributional assumptions (e.g., Beta distribution) to represent entities and queries into presumed distributional shape, which limits their expressive power. Moreover, query embeddings are challenging due to the relational complexities in multi-relational KGs (e.g., symmetry, anti-symmetry and transitivity). To bridge the gap, we propose a logical query reasoning framework, Line Embedding (LinE), for FOL queries. To relax the distributional assumptions, we introduce the logic space transformation layer, which is a generic neural function that converts embeddings from probabilistic distribution space to LinE embeddings space. To tackle multi-relational and logical complexities, we formulate neural relation-specific projections and individual logical operators to truthfully ground LinE query embeddings on logical regularities and KG factoids. Lastly, to verify the LinE embedding quality, we generate a FOL query dataset from WordNet, which richly encompasses hierarchical relations. Extensive experiments show superior reasoning sensitivity of LinE on three benchmarks against strong baselines, particularly for multi-hop relational queries and negation-related queries.
In recent years, specific evaluation metrics for time series anomaly detection algorithms have been developed to handle the limitations of the classical precision and recall. However, such metrics are heuristically built as an aggregate of multiple desirable aspects, introduce parameters and wipe out the interpretability of the output. In this article, we first highlight the limitations of the classical precision/recall, as well as the main issues of the recent event-based metrics -- for instance, we show that an adversary algorithm can reach high precision and recall on almost any dataset under weak assumption. To cope with the above problems, we propose a theoretically grounded, robust, parameter-free and interpretable extension to precision/recall metrics, based on the concept of "affiliation'' between the ground truth and the prediction sets. Our metrics leverage measures of duration between ground truth and predictions, and have thus an intuitive interpretation. By further comparison against random sampling, we obtain a normalized precision/recall, quantifying how much a given set of results is better than a random baseline prediction. By construction, our approach keeps the evaluation local regarding ground truth events, enabling fine-grained visualization and interpretation of algorithmic results. We compare our proposal against various public time series anomaly detection datasets, algorithms and metrics. We further derive theoretical properties of the affiliation metrics that give explicit expectations about their behavior and ensure robustness against adversary strategies.
Tensor decomposition aims to factorize an input tensor into a number of latent factors. Due to the low-rank nature of tensor in real applications, the latent factors can be used to perform tensor completion in numerous tasks, such as knowledge graph completion and timely recommendation. However, existing works solve the problem in Euclidean space, where the tensor is decomposed into Euclidean vectors. Recent studies show that hyperbolic space is roomier than Euclidean space. With the same dimension, a hyperbolic vector can represent richer information (e.g., hierarchical structure) than a Euclidean vector. In this paper, we propose to decompose tensor in hyperbolic space. Considering that the most popular optimization tools (e.g, SGD, Adam) have not been generalized in hyperbolic space, we design an adaptive optimization algorithm according to the distinctive property of hyperbolic manifold. To address the non-convex property of the problem, we adopt gradient ascent in our optimization algorithm to avoid getting trapped in local optimal landscapes. We conduct experiments on various tensor completion tasks and the result validates the superiority of our method over these baselines that solve the problem in Euclidean space.
We propose an extension to the transformer neural network architecture for general-purpose graph learning by adding a dedicated pathway for pairwise structural information, called edge channels. The resultant framework - which we call Edge-augmented Graph Transformer (EGT) - can directly accept, process and output structural information of arbitrary form, which is important for effective learning on graph-structured data. Our model exclusively uses global self-attention as an aggregation mechanism rather than static localized convolutional aggregation. This allows for unconstrained long-range dynamic interactions between nodes. Moreover, the edge channels allow the structural information to evolve from layer to layer, and prediction tasks on edges/links can be performed directly from the output embeddings of these channels. We verify the performance of EGT in a wide range of graph-learning experiments on benchmark datasets, in which it outperforms Convolutional/Message-Passing Graph Neural Networks. EGT sets a new state-of-the-art for the quantum-chemical regression task on the OGB-LSC PCQM4Mv2 dataset containing 3.8 million molecular graphs. Our findings indicate that global self-attention based aggregation can serve as a flexible, adaptive and effective replacement of graph convolution for general-purpose graph learning. Therefore, convolutional local neighborhood aggregation is not an essential inductive bias.
Decision tree ensembles are widely used and competitive learning models. Despite their success, popular toolkits for learning tree ensembles have limited modeling capabilities. For instance, these toolkits support a limited number of loss functions and are restricted to single task learning. We propose a flexible framework for learning tree ensembles, which goes beyond existing toolkits to support arbitrary loss functions, missing responses, and multi-task learning. Our framework builds on differentiable (a.k.a. soft) tree ensembles, which can be trained using first-order methods. However, unlike classical trees, differentiable trees are difficult to scale. We therefore propose a novel tensor-based formulation of differentiable trees that allows for efficient vectorization on GPUs. We introduce FASTEL: a new toolkit (based on Tensorflow 2) for learning differentiable tree ensembles. We perform experiments on a collection of 28 real open-source and proprietary datasets, which demonstrate that our framework can lead to 100x more compact and 23% more expressive tree ensembles than those obtained by popular toolkits.
Two-view knowledge graphs (KGs) jointly represent two components: an ontology view for abstract and commonsense concepts, and an instance view for specific entities that are instantiated from ontological concepts. As such, these KGs contain heterogeneous structures that are hierarchical, from the ontology-view, and cyclical, from the instance-view. Despite these various structures in KGs, recent works on embedding KGs assume that the entire KG belongs to only one of the two views but not both simultaneously. For works that seek to put both views of the KG together, the instance and ontology views are assumed to belong to the same geometric space, such as all nodes embedded in the same Euclidean space or non-Euclidean product space, an assumption no longer reasonable for two-view KGs where different portions of the graph exhibit different structures. To address this issue, we define and construct a dual-geometric space embedding model (DGS) that models two-view KGs using a complex non-Euclidean geometric space, by embedding different portions of the KG in different geometric spaces. DGS utilizes the spherical space, hyperbolic space, and their intersecting space in a unified framework for learning embeddings. Furthermore, for the spherical space, we propose novel closed spherical space operators that directly decompose to using properties of the spherical space without the need for mapping to an approximate tangent space. Experiments on public datasets show that DGS significantly outperforms previous state-of-the-art baseline models on KG completion tasks, demonstrating its ability to better model heterogeneous structures in KGs.
Cash-out fraud refers to the withdrawal of cash from a credit card by illegitimate payments with merchants. Conventional data-driven approaches for cash-out detection commonly construct a classifier with domain specific feature engineering. To further spot cash-out behaviors in complex scenarios, recent efforts adopt graph models to exploit the interaction relations rich in financial transactions. However, most existing graph-based methods are proposed for online payment activities in internet financial institutions. Moreover, these methods commonly rely on a large amount of online user data, which are not well suitable for the traditional credit card services in commercial banks. In this paper, we focus on discerning fraudulent cash-out users by taking advantage of only the personal credit card data from banks. To alleviate the scarcity of available labeled data, we formulate the cash-out detection problem as identifying dense blocks. First, we define a bipartite multigraph to hold transactions between users and merchants, where cash-out activities generate cyclically intensive and high-volume flows. Second, we give a formal definition of cash-out behaviors from four perspectives: time, capital, cyclicity, and topotaxy. Then, we develop ANTICO, with a class of metrics to capture suspicious signals of the activities and a greedy algorithm to spot suspicious blocks by optimizing the proposed metric. Theoretical analysis shows a provable upper bound of ANTICO on the effectiveness of detecting cash-out users. Experimental results show that ANTICO outperforms state-of-the-art methods in accurately detecting cash-out users on both synthetic and real-world banking data.
Network representation learning has played a critical role in studying networks. One way to study a graph is to focus on its spectrum, i.e., the eigenvalue distribution of its associated matrices. Recent advancements in spectral graph theory show that spectral moments of a network can be used to capture the network structure and various graph properties. However, sometimes networks with different structures or sizes can have the same or similar spectral moments, not to mention the existence of the cospectral graphs. To address such problems, we propose a 3D network representation that relies on the spectral information of subgraphs: the Spectral Path, a path connecting the spectral moments of the network and those of its subgraphs of different sizes. We show that the spectral path is interpretable and can capture relationship between a network and its subgraphs, for which we present a theoretical foundation. We demonstrate the effectiveness of the spectral path in applications such as network visualization and network identification.
Recent years have witnessed remarkable success achieved by graph neural networks (GNNs) in many real-world applications such as recommendation and drug discovery. Despite the success, oversmoothing has been identified as one of the key issues which limit the performance of deep GNNs. It indicates that the learned node representations are highly indistinguishable due to the stacked aggregators. In this paper, we propose a new perspective to look at the performance degradation of deep GNNs, i.e., feature overcorrelation. Through empirical and theoretical study on this matter, we demonstrate the existence of feature overcorrelation in deeper GNNs and reveal potential reasons leading to this issue. To reduce the feature correlation, we propose a general framework DeCorr which can encourage GNNs to encode less redundant information. Extensive experiments have demonstrated that DeCorr can help enable deeper GNNs and is complementary to existing techniques tackling the oversmoothing issue.
As training deep learning models on large dataset takes a lot of time and resources, it is desired to construct a small synthetic dataset with which we can train deep learning models sufficiently. There are recent works that have explored solutions on condensing image datasets through complex bi-level optimization. For instance, dataset condensation (DC) matches network gradients w.r.t. large-real data and small-synthetic data, where the network weights are optimized for multiple steps at each outer iteration. However, existing approaches have their inherent limitations: (1) they are not directly applicable to graphs where the data is discrete; and (2) the condensation process is computationally expensive due to the involved nested optimization. To bridge the gap, we investigate efficient dataset condensation tailored for graph datasets where we model the discrete graph structure as a probabilistic model. We further propose a one-step gradient matching scheme, which performs gradient matching for only one single step without training the network weights. Our theoretical analysis shows this strategy can generate synthetic graphs that lead to lower classification loss on real graphs. Extensive experiments on various graph datasets demonstrate the effectiveness and efficiency of the proposed method. In particular, we are able to reduce the dataset size by 90% while approximating up to 98% of the original performance and our method is significantly faster than multi-step gradient matching (e.g. $15$× in CIFAR10 for synthesizing 500 graphs).
Deep learning models have been demonstrated powerful in modeling complex spatio-temporal data for traffic prediction. In practice, effective deep traffic prediction models rely on large-scale traffic data, which is not always available in real-world scenarios. To alleviate the data scarcity issue, a promising way is to use cross-city transfer learning methods to fine-tune well-trained models from source cities with abundant data. However, existing approaches overlook the divergence between source and target cities, and thus, the trained model from source cities may contain noise or even harmful source knowledge. To address the problem, we propose CrossTReS, a selective transfer learning framework for traffic prediction that adaptively re-weights source regions to assist target fine-tuning. As a general framework for fine-tuning-based cross-city transfer learning, CrossTReS consists of a feature network, a weighting network, and a prediction model. We train the feature network with node- and edge-level domain adaptation techniques to learn generalizable spatial features for both source and target cities. We further train the weighting network via source-target joint meta-learning such that source regions helpful to target fine-tuning are assigned high weights. Finally, the prediction model is selectively trained on the source city with the learned weights to initialize target fine-tuning. We evaluate CrossTReS using real-world taxi and bike data, where under the same settings, CrossTReS outperforms state-of-the-art baselines by up to 8%. Moreover, the learned region weights offer interpretable visualization.
Graph Convolutional Network (GCN) has exhibited strong empirical performance in many real-world applications. The vast majority of existing works on GCN primarily focus on the accuracy while ignoring how confident or uncertain a GCN is with respect to its predictions. Despite being a cornerstone of trustworthy graph mining, uncertainty quantification on GCN has not been well studied and the scarce existing efforts either fail to provide deterministic quantification or have to change the training procedure of GCN by introducing additional parameters or architectures. In this paper, we propose the first frequentist-based approach named JuryGCN in quantifying the uncertainty of GCN, where the key idea is to quantify the uncertainty of a node as the width of confidence interval by a jackknife estimator. Moreover, we leverage the influence functions to estimate the change in GCN parameters without re-training to scale up the computation. The proposed JuryGCN is capable of quantifying uncertainty deterministically without modifying the GCN architecture or introducing additional parameters. We perform extensive experimental evaluation on real-world datasets in the tasks of both active learning and semi-supervised node classification, which demonstrate the efficacy of the proposed method.
We present HyperLogLogLog, a practical compression of the HyperLogLog sketch that compresses the sketch from $O(młogłog n)$ bits down to $m łog_2łog_2łog_2 m + O(m+łogłog n)$ bits for estimating the number of distinct elements~n using m~registers. The algorithm works as a drop-in replacement that preserves all estimation properties of the HyperLogLog sketch, it is possible to convert back and forth between the compressed and uncompressed representations, and the compressed sketch maintains mergeability in the compressed domain. The compressed sketch can be updated in amortized constant time, assuming n is sufficiently larger than m. We provide a C++ implementation of the sketch, and show by experimental evaluation against well-known implementations by Google and Apache that our implementation provides small sketches while maintaining competitive update and merge times. Concretely, we observed approximately a 40% reduction in the sketch size. Furthermore, we obtain as a corollary a theoretical algorithm that compresses the sketch down to $młog_2łog_2łog_2łog_2 m+O(młogłogłog m/łogłog m+łogłog n)$ bits.
Score-based generative models (SGMs) are a recent breakthrough in generating fake images. SGMs are known to surpass other generative models, e.g., generative adversarial networks (GANs) and variational autoencoders (VAEs). Being inspired by their big success, in this work, we fully customize them for generating fake tabular data. In particular, we are interested in oversampling minor classes since imbalanced classes frequently lead to sub-optimal training outcomes. To our knowledge, we are the first presenting a score-based tabular data oversampling method. Firstly, we re-design our own score network since we have to process tabular data. Secondly, we propose two options for our generation method: the former is equivalent to a style transfer for tabular data and the latter uses the standard generative policy of SGMs. Lastly, we define a fine-tuning method, which further enhances the oversampling quality. In our experiments with 6 datasets and 10 baselines, our method outperforms other oversampling methods in all cases.
Graph representations of a target domain often project it to a set of entities (nodes) and their relations (edges). However, such projections often miss important and rich information. For example, in graph representations used in missing value imputation, items --- represented as nodes --- may contain rich textual information. However, when processing graphs with graph neural networks (GNN), such information is either ignored or summarized into a single vector representation used to initialize the GNN. Towards addressing this, we present CoRGi, a GNN that considers the rich data within nodes in the context of their neighbors. This is achieved by endowing CoRGi's message passing with a personalized attention mechanism over the content of each node. This way, CoRGi assigns user-item-specific attention scores with respect to the words that appear in an item's content. We evaluate CoRGi on two edge-value prediction tasks and show that CoRGi is better at making edge-value predictions over existing methods, especially on sparse regions of the graph.
Efficient deployment of transformer models in practice is challenging due to their inference cost including memory footprint, latency, and power consumption, which scales quadratically with input sequence length. To address this, we present a novel token reduction method dubbed Learned Token Pruning (LTP) which adaptively removes unimportant tokens as an input sequence passes through transformer layers. In particular, LTP prunes tokens with an attention score below a threshold, whose value is learned for each layer during training. Our threshold-based method allows the length of the pruned sequence to vary adaptively based on the input sequence, and avoids algorithmically expensive operations such as top-k token selection. We extensively test the performance of LTP on GLUE and SQuAD tasks and show that our method outperforms the prior state-of-the-art token pruning methods by up to ∽2.5% higher accuracy with the same amount of FLOPs. In particular, LTP achieves up to 2.1× FLOPs reduction with less than 1% accuracy drop, which results in up to 1.9× and 2.0× throughput improvement on Intel Haswell CPUs and NVIDIA V100 GPUs. Furthermore, we demonstrate that LTP is more robust than prior methods to variations in input sequence lengths. Our code has been developed in PyTorch and open-sourced
Triangular meshes have been actively used in computer graphics to represent 3D shapes. However, due to their non-uniform and irregular nature, learning such data with a Deep Neural Network is not straightforward. Transforming mesh data to simpler structures (e.g., voxel grids, point clouds, or multi-view 2D images) leads to other issues including spatial information loss and scalability. Traditional descriptors for mesh data simply extract hand-crafted features, which might not be effective in various environments. Several deep architectures that directly consume mesh data have been proposed, but their input features are still heuristic and unable to fully capture both geodesic and geometric characteristics of a mesh. In addition, their model architectures are not designed to be capable of providing visual explanations of their decision making.
In this paper, we propose ExMeshCNN, a novel and explainable CNN structure for learning 3D meshes. In the first layer, we implement a descriptor layer composed of two types of learnable descriptors where each focuses on geodesic and geometric characteristics of a mesh, respectively. Then a series of convolution layers follow the descriptor layer to learn local features, where the convolution operations are carefully designed to be performed in a per-face manner. The final layer consists simply of the Global Average Pooling operation and the softmax output. In this way, ExMeshCNN learns mesh data in a completely end-to-end manner while retaining spatial information, where each layer is capable of computing face-level activations and gradients. Owing to these promising properties, existing visual attribution methods for model interpretability, such as LRP and Grad-CAM, can be easily applied to ExMeshCNN to highlight the salient surfaces of a 3D mesh for the corresponding prediction. Experimental results show that ExMeshCNN not only exhibits state-of-the-art or comparable performances in the 3D mesh classification and segmentation with the smallest number of parameters, but also provides the visual explanations of why it makes a specific prediction in the 3D space.
Active learning enables the efficient construction of a labeled dataset by labeling informative samples from an unlabeled dataset. In a real-world active learning scenario, the use of diversity-based sampling is indispensable because there are many redundant or highly similar samples. Core-set approach is the promising diversity-based method selecting diverse samples by considering the distance between samples. However, the approach poorly performs compared to the uncertainty-based method that selects the most difficult samples where neural models reveal low confidence. In this work, we analyze the feature space through the lens of density and, interestingly, observe that locally sparse regions tend to have more informative samples than dense regions. Motivated by our analysis, we empower the core-set approach with the density-awareness and propose a density-aware core-set (DACS) which estimates the density of the unlabeled samples and selects diverse samples mainly from sparse regions which are treated as the informative regions. To reduce the computational bottlenecks in estimating the density, we introduce a new density approximation based on locality-sensitive hashing. Experimental results demonstrate the efficacy of DACS in both classification and regression tasks and specifically show that DACS can produce state-of-the-art performance in a practical scenario. Since DACS is weakly dependent on architectures, we also present a simple yet effective combination method to show that the existing methods can be beneficially combined with DACS.
Flow graphs capture the directed flow of a quantity of interest (e.g., water, power, vehicles) being transported through an underlying network. Modeling and generating realistic flow graphs is key in many applications in infrastructure design, transportation, and biomedical and social sciences. However, they pose a great challenge to existing generative models due to a complex dynamics that is often governed by domain-specific physical laws or patterns. We introduce FlowGEN, an implicit generative model for flow graphs, that learns how to jointly generate graph topologies and flows with diverse dynamics directly from data using a novel (flow) graph neural network. Experiments show that our approach is able to effectively reproduce relevant local and global properties of flow graphs, including flow conservation, cyclic trends, and congestion around hotspots.
Graph Neural Networks (GNNs) are one of the prominent methods to solve semi-supervised learning on graphs. However, most of the existing GNN models often need sufficient observed data to allow for effective learning and generalization. In real-world scenarios where complete input graph structure and sufficient node labels might not be achieved easily, GNN models would encounter with severe performance degradation. To address this problem, we propose WSGNN, short for weakly-supervised graph neural network. WSGNN is a flexible probabilistic generative framework which harnesses variational inference approach to solve graph semi-supervised learning in a label-structure joint estimation manner. It collaboratively learns task-related new graph structure and node representations through a two-branch network, and targets a composite variational objective derived from underlying data generation distribution concerning the inter-dependence between scarce observed data and massive missing data. Especially, under weakly-supervised low-data regime where labeled nodes and observed edges are both very limited, extensive experimental results on node classification and link prediction over common benchmarks demonstrate the state-of-the-art performance of WSGNN over strong competitors. Concretely, when only 1 label per class and 1% edges are observed on Cora, WSGNN maintains a decent 52.00% classification accuracy, exceeding GCN by 75.6%.
Modeling how network-level traffic flow changes in the urban environment is useful for decision-making in transportation, public safety and urban planning. The traffic flow system can be viewed as a dynamic process that transits between states (e.g., traffic volumes on each road segment) over time. In the real-world traffic system with traffic operation actions like traffic signal control or reversible lane changing, the system's state is influenced by both the historical states and the actions of traffic operations. In this paper, we consider the problem of modeling network-level traffic flow under a real-world setting, where the available data is sparse (i.e., only part of the traffic system is observed). We present DTIGNN, an approach that can predict network-level traffic flows from sparse data. DTIGNN models the traffic system as a dynamic graph influenced by traffic signals, learns the transition models grounded by fundamental transition equations from transportation, and predicts future traffic states with imputation in the process. Through comprehensive experiments, we demonstrate that our method outperforms state-of-the-art methods and can better support decision-making in transportation.
Hartigan's Dip-test of unimodality gained increasing interest in unsupervised learning over the past few years. It is free from complex parameterization and does not require a distribution assumed a priori. A useful property is that the resulting Dip-values can be derived to find a projection axis that identifies multimodal structures in the data set. In this paper, we show how to apply the gradient not only with respect to the projection axis but also with respect to the data to improve the cluster structure. By tightly coupling the Dip-test with an autoencoder, we obtain an embedding that clearly separates all clusters in the data set. This method, called DipEncoder, is the basis of a novel deep clustering algorithm. Extensive experiments show that the DipEncoder is highly competitive to state-of-the-art methods.
Designing accurate deep learning models for molecular property prediction plays an increasingly essential role in drug and material discovery. Recently, due to the scarcity of labeled molecules, self-supervised learning methods for learning generalizable and transferable representations of molecular graphs have attracted lots of attention. In this paper, we argue that there exist two major issues hindering current self-supervised learning methods from obtaining desired performance on molecular property prediction, that is, the ill-defined pre-training tasks and the limited model capacity. To this end, we introduce Knowledge-guided Pre-training of Graph Transformer (KPGT), a novel self-supervised learning framework for molecular graph representation learning, to alleviate the aforementioned issues and improve the performance on the downstream molecular property prediction tasks. More specifically, we first introduce a high-capacity model, named Line Graph Transformer (LiGhT), which emphasizes the importance of chemical bonds and is mainly designed to model the structural information of molecular graphs. Then, a knowledge-guided pre-training strategy is proposed to exploit the additional knowledge of molecules to guide the model to capture the abundant structural and semantic information from large-scale unlabeled molecular graphs. Extensive computational tests demonstrated that KPGT can offer superior performance over current state-of-the-art methods on several molecular property prediction tasks.
Physical systems are extending their monitoring capacities to edge areas with low-cost, low-power sensors and advanced data mining and machine learning techniques. However, new systems often have limited data for training the model, calling for effective knowledge transfer from other relevant grids. Specifically, Domain Adaptation (DA) seeks domain-invariant features to boost the model performance in the target domain. Nonetheless, existing DA techniques face significant challenges due to the unique characteristics of physical datasets: (1) complex spatial-temporal correlations, (2) diverse data sources including node/edge measurements and labels, and (3) large-scale data sizes. In this paper, we propose a novel cross-graph DA based on two core designs of graph kernels and graph coarsening. The former design handles spatial-temporal correlations and can incorporate networked measurements and labels conveniently. The spatial structures, temporal trends, measurement similarity, and label information together determine the similarity of two graphs, guiding the DA to find domain-invariant features. Mathematically, we construct a Graph kerNel-based distribution Adaptation (GNA) with a specifically-designed graph kernel. Then, we prove the proposed kernel is positive definite and universal, which strictly guarantees the feasibility of the used DA measure. However, the computation cost of the kernel is prohibitive for large systems. In response, we propose a novel coarsening process to obtain much smaller graphs for GNA. Finally, we report the superiority of GNA in diversified systems, including power systems, mass-damper systems, and human-activity sensing systems.
In data mining, estimating the number of distinct values (NDV) is a fundamental problem with various applications. Existing methods for estimating NDV can be broadly classified into two categories: i) scanning-based methods, which scan the entire data and maintain a sketch to approximate NDV; and ii) sampling-based methods, which estimate NDV using sampling data rather than accessing the entire data warehouse. Scanning-based methods achieve a lower approximation error at the cost of higher I/O and more time. Sampling-based estimation is preferable in applications with a large data volume and a permissible error restriction due to its higher scalability. However, while the sampling-based method is more effective on a single machine, it is less practical in a distributed environment with massive data volumes. For obtaining the final NDV estimators, the entire sample must be transferred throughout the distributed system, incurring a prohibitive communication cost when the sample rate is significant. This paper proposes a novel sketch-based distributed method that achieves sub-linear communication costs for distributed sampling-based NDV estimation under mild assumptions. Our method leverages a sketch-based algorithm to estimate the sample's frequency of frequency in the distributed streaming model, which is compatible with most classical sampling-based NDV estimators. Additionally, we provide theoretical evidence for our method's ability to minimize communication costs in the worst-case scenario. Extensive experiments show that our method saves orders of magnitude in communication costs compared to existing sampling- and sketch-based methods.
Cognitive diagnostic assessment is a fundamental task in intelligent education, which aims at quantifying students' cognitive level on knowledge attributes. Since there exists learning dependency among knowledge attributes, it is crucial for cognitive diagnosis models (CDMs) to incorporate attribute hierarchy when assessing students. The attribute hierarchy is only explored by a few CDMs such as Attribute Hierarchy Method, and there are still two significant limitations in these methods. First, the time complexity would be unbearable when the number of attributes is large. Second, the assumption used to model the attribute hierarchy is too strong so that it may lose some information of the hierarchy and is not flexible enough to fit all situations. To address these limitations, we propose a novel Bayesian network-based Hierarchical Cognitive Diagnosis Framework (HierCDF), which enables many traditional diagnostic models to flexibly integrate the attribute hierarchy for better diagnosis. Specifically, we first use an efficient Bayesian network to model the influence of attribute hierarchy on students' cognitive states. Then we design a CDM adaptor to bridge the gap between students' cognitive states and the input features of existing diagnostic models. Finally, we analyze the generality and complexity of HierCDF to show its effectiveness in modeling hierarchy information. The performance of HierCDF is experimentally proved on real-world large-scale datasets.
Federated learning (FL) is a promising privacy-preserving machine learning paradigm over distributed located data. In FL, the data is kept locally by each user. This protects the user privacy, but also makes the server difficult to verify data quality, especially if the data are correctly labeled. Training with corrupted labels is harmful to the federated learning task; however, little attention has been paid to FL in the case of label noise. In this paper, we focus on this problem and propose a learning-based reweighting approach to mitigate the effect of noisy labels in FL. More precisely, we tuned a weight for each training sample such that the learned model has optimal generalization performance over a validation set. More formally, the process can be formulated as a Federated Bilevel Optimization problem. Bilevel optimization problem is a type of optimization problem with two levels of entangled problems. The non-distributed bilevel problems have witnessed notable progress recently with new efficient algorithms. However, solving bilevel optimization problems under the Federated Learning setting is under-investigated. We identify that the high communication cost in hypergradient evaluation is the major bottleneck. So we proposeComm-FedBiO to solve the general Federated Bilevel Optimization problems; more specifically, we propose two communication-efficient subroutines to estimate the hypergradient. Convergence analysis of the proposed algorithms is also provided. Finally, we apply the proposed algorithms to solve the noisy label problem. Our approach has shown superior performance on several real-world datasets compared to various baselines.
Benefiting from the message passing mechanism, Graph Neural Networks (GNNs) have been successful on flourish tasks over graph data. However, recent studies have shown that attackers can catastrophically degrade the performance of GNNs by maliciously modifying the graph structure. A straightforward solution to remedy this issue is to model the edge weights by learning a metric function between pairwise representations of two end nodes, which attempts to assign low weights to adversarial edges. The existing methods use either raw features or representations learned by supervised GNNs to model the edge weights. However, both strategies are faced with some immediate problems: raw features cannot represent various properties of nodes (e.g., structure information), and representations learned by supervised GNN may suffer from the poor performance of the classifier on the poisoned graph. We need representations that carry both feature information and as mush correct structure information as possible and are insensitive to structural perturbations. To this end, we propose an unsupervised pipeline, named STABLE, to optimize the graph structure. Finally, we input the well-refined graph into a downstream classifier. For this part, we design an advanced GCN that significantly enhances the robustness of vanilla GCN [24] without increasing the time complexity. Extensive experiments on four real-world graph benchmarks demonstrate that STABLE outperforms the state-of-the-art methods and successfully defends against various attacks.
Modeling complex spatial and temporal dependencies are indispensable for location-bound time series learning. Existing methods, typically relying on graph neural networks (GNNs) and temporal learning modules based on recurrent neural networks, have achieved significant performance improvements. However, their representation capabilities and prediction results are limited when pre-defined graphs are unavailable. Unlike spatio-temporal GNNs focusing on designing complex architectures, we propose a novel adaptive graph construction strategy: Self-Paced Graph Contrastive Learning (SPGCL). It learns informative relations by maximizing the distinguishing margin between positive and negative neighbors and generates an optimal graph with a self-paced strategy. Specifically, the existing neighborhoods iteratively absorb more reliable nodes with the highest affinity scores as new neighbors to generate the next-round neighborhoods, and augmentations are applied to improve the transferability and robustness. As the adaptively self-paced graph approaches the optimized graph for prediction, the mutual information between nodes and the corresponding neighbors is maximized. Our work provides a new perspective of addressing spatio-temporal learning problems beyond information aggregation in Euclidean space and can be generalized to different tasks. Extensive experiments conducted on two typical spatio-temporal learning tasks (traffic forecasting and land displacement prediction) demonstrate the superior performance of SPGCL against the state-of-the-art.
Anomaly detection is essential for preventing hazardous outcomes for safety-critical applications like autonomous driving. Given their safety-criticality, these applications benefit from provable bounds on various errors in anomaly detection. To achieve this goal in the semi-supervised setting, we propose to provide Probably Approximately Correct (PAC) guarantees on the false negative and false positive detection rates for anomaly detection algorithms. Our method (PAC-Wrap) can wrap around virtually any existing semi-supervised and unsupervised anomaly detection method, endowing it with rigorous guarantees. Our experiments with various anomaly detectors and datasets indicate that PAC-Wrap is broadly effective.
With the extensive applications of machine learning models, automatic hyperparameter optimization (HPO) has become increasingly important. Motivated by the tuning behaviors of human experts, it is intuitive to leverage auxiliary knowledge from past HPO tasks to accelerate the current HPO task. In this paper, we propose TransBO, a novel two-phase transfer learning framework for HPO, which can deal with the complementary nature among source tasks and dynamics during knowledge aggregation issues simultaneously. This framework extracts and aggregates source and target knowledge jointly and adaptively, where the weights can be learned in a principled manner. The extensive experiments, including static and dynamic transfer learning settings and neural architecture search, demonstrate the superiority of TransBO over the state-of-the-arts.
The tuning of hyperparameters becomes increasingly important as machine learning (ML) models have been extensively applied in data mining applications. Among various approaches, Bayesian optimization (BO) is a successful methodology to tune hyperparameters automatically. While traditional methods optimize each tuning task in isolation, there has been recent interest in speeding up BO by transferring knowledge across previous tasks. In this work, we introduce an automatic method to design the BO search space with the aid of tuning history from past tasks. This simple yet effective approach can be used to endow many existing BO methods with transfer learning capabilities. In addition, it enjoys the three advantages: universality, generality, and safeness. The extensive experiments show that our approach considerably boosts BO by designing a promising and compact search space instead of using the entire space, and outperforms the state-of-the-arts on a wide range of benchmarks, including machine learning and deep learning tuning tasks, and neural architecture search.
Weakly supervised named entity recognition methods train label models to aggregate the token annotations of multiple noisy labeling functions (LFs) without seeing any manually annotated labels. To work well, the label model needs to contextually identify and emphasize well-performed LFs while down-weighting the under-performers. However, evaluating the LFs is challenging due to the lack of ground truths. To address this issue, we propose the sparse conditional hidden Markov model (Sparse-CHMM). Instead of predicting the entire emission matrix as other HMM-based methods, Sparse-CHMM focuses on estimating its diagonal elements, which are considered as the reliability scores of the LFs. The sparse scores are then expanded to the full-fledged emission matrix with pre-defined expansion functions. We also augment the emission with weighted XOR scores, which track the probabilities of an LF observing incorrect entities. Sparse-CHMM is optimized through unsupervised learning with a three-stage training pipeline that reduces the training difficulty and prevents the model from falling into local optima. Compared with the baselines in the Wrench benchmark, Sparse-CHMM achieves a 3.01 average F1 score improvement on five comprehensive datasets. Experiments show that each component of Sparse-CHMM is effective, and the estimated LF reliabilities strongly correlate with true LF F1 scores.
Graph Convolutional Networks (GCNs) have fueled a surge of research interest due to their encouraging performance on graph learning tasks, but they are also shown vulnerability to adversarial attacks. In this paper, an effective graph structural attack is investigated to disrupt graph spectral filters in the Fourier domain, which are the theoretical foundation of GCNs. We define the notion of spectral distance based on the eigenvalues of graph Laplacian to measure the disruption of spectral filters. We realize the attack by maximizing the spectral distance and propose an efficient approximation to reduce the time complexity brought by eigen-decomposition. The experiments demonstrate the remarkable effectiveness of the proposed attack in both black-box and white-box settings for both test-time evasion attacks and training-time poisoning attacks. Our qualitative analysis suggests the connection between the imposed spectral changes in the Fourier domain and the attack behavior in the spatial domain, which provides empirical evidence that maximizing spectral distance is an effective way to change the graph structural property and thus disturb the frequency components for graph filters to affect the learning of GCNs.
Finding an appropriate representation of dynamic activities in the brain is crucial for many downstream applications. Due to its highly dynamic nature, temporally averaged fMRI (functional magnetic resonance imaging) can only provide a narrow view of underlying brain activities. Previous works lack the ability to learn and interpret the latent dynamics in brain architectures. This paper builds an efficient graph neural network model that incorporates both region-mapped fMRI sequences and structural connectivities obtained from DWI (diffusion-weighted imaging) as inputs. We find good representations of the latent brain dynamics through learning sample-level adaptive adjacency matrices and performing a novel multi-resolution inner cluster smoothing. We also attribute inputs with integrated gradients, which enables us to infer (1) highly involved brain connections and subnetworks for each task, (2) temporal keyframes of imaging sequences that characterize tasks, and (3) subnetworks that discriminate between individual subjects. This ability to identify critical subnetworks that characterize signal states across heterogeneous tasks and individuals is of great importance to neuroscience and other scientific domains. Extensive experiments and ablation studies demonstrate our proposed method's superiority and efficiency in spatial-temporal graph signal modeling with insightful interpretations of brain dynamics.
Graph diffusion problems such as the propagation of rumors, computer viruses, or smart grid failures are ubiquitous and societal. Hence it is usually crucial to identify diffusion sources according to the current graph diffusion observations. Despite its tremendous necessity and significance in practice, source localization, as the inverse problem of graph diffusion, is extremely challenging as it is ill-posed: different sources may lead to the same graph diffusion patterns. Different from most traditional source localization methods, this paper focuses on a probabilistic manner to account for the uncertainty of different candidate sources. Such endeavors require to overcome significant challenges along the way including: 1) the uncertainty in graph diffusion source localization is hard to be quantified; 2) the complex patterns of the graph diffusion sources are difficult to be probabilistically characterized; 3) the generalization under any underlying diffusion patterns is hard to be imposed. To solve the above challenges, this paper presents a generic framework: Source Localization Variational AutoEncoder (SL-VAE) for locating the diffusion sources under arbitrary diffusion patterns. Particularly, we propose a probabilistic model that leverages the forward diffusion estimation model along with deep generative models to approximate the diffusion source distribution for quantifying the uncertainty. SL-VAE further utilizes prior knowledge of the source-observation pairs to characterize the complex patterns of diffusion sources by a learned generative prior. Lastly, a unified objective that integrates the forward diffusion estimation model is derived to enforce the model to generalize under arbitrary diffusion patterns. Extensive experiments are conducted on $7$ real-world datasets to demonstrate the superiority of SL-VAE in reconstructing the diffusion sources by excelling the state-of-the-arts on average 20% in AUC score. The code and data are available at: https://github.com/triplej0079/SLVAE.
The generalizability to new databases is of vital importance to Text-to-SQL systems which aim to parse human utterances into SQL statements. Existing works achieve this goal by leveraging the exact matching method to identify the lexical matching between the question words and the schema items. However, these methods fail in other challenging scenarios, such as the synonym substitution in which the surface form differs between the corresponding question words and schema items. In this paper, we propose a framework named ISESL-SQL to iteratively build a semantic enhanced schema-linking graph between question tokens and database schemas. First, we extract a schema linking graph from PLMs through a probing procedure in an unsupervised manner. Then the schema linking graph is further optimized during the training process through a deep graph learning method. Meanwhile, we also design an auxiliary task called graph regularization to improve the schema information mentioned in the schema-linking graph. Extensive experiments on three benchmarks demonstrate that ISESL-SQL could consistently outperform the baselines and further investigations show its generalizability and robustness.
This paper studies the strongly-convex-strongly-concave minimax optimization with unbalanced dimensionality. Such problems contain several popular applications in data science such as few shot learning and fairness-aware machine learning task. The design of conventional iterative algorithm for minimax optimization typically focuses on reducing the total number of oracle calls, which ignores the unbalanced computational cost for accessing the information from two different variables in minimax. We propose a novel second-order optimization algorithm, called Partial-Quasi-Newton (PQN) method, which takes the advantage of unbalanced structure in the problem to establish the Hessian estimate efficiently. We theoretically prove our PQN method converges to the saddle point faster than existing minimax optimization algorithms. The numerical experiments on real-world applications show the proposed PQN performs significantly better than the state-of-the-art methods.
Spatial temporal forecasting plays an important role in improving the quality and performance of Intelligent Transportation Systems. This task is rather challenging due to the complicated and long-range spatial temporal dependencies in traffic network. Existing studies typically employ different deep neural networks to learn the spatial and temporal representations so as to capture the complex and dynamic dependencies. In this paper, we argue that it is insufficient to capture the long-range spatial dependencies from the implicit representations learned by temporal extracting modules. To address this problem, we propose Multi-Step Dependency Relation (MSDR), a brand new variant of recurrent neural network. Instead of only looking at the hidden state from only one latest time step, MSDR explicitly takes those of multiple historical time steps as the input of each time unit. We also develop two strategies to incur the spatial information into the dependency relation embedding between multiple historical time steps and the current one in MSDR. On the basis of it, we propose the Graph-based MSDR (GMSDR) framework to support general spatial temporal forecasting applications by seamlessly integrating graph-based neural networks with MSDR. We evaluate our proposed approach on several popular datasets. The results show that the proposed GMSDR framework outperforms state-of-the-art methods by an obvious margin.
Most methods for context-aware recommendation focus on improving the feature interaction layer, but overlook the embedding layer. However, an embedding layer with random initialization often suffers in practice from the sparsity of the contextual features, as well as the interactions between the users (or items) and context. In this paper, we propose a novel user-event graph embedding learning (UEG-EL) framework to address these two sparsity challenges. Specifically, our UEG-EL contains three modules: 1) a graph construction module is used to obtain a user-event graph containing nodes for users, intents and items, where the intent nodes are generated by applying intent node attention (INA) on nodes of the contextual features; 2) a user-event collaborative graph convolution module is designed to obtain the refined embeddings of all features by executing a new convolution strategy on the user-event graph, where each intent node acts as a hub to efficiently propagate the information among different features; 3) a recommendation module is equipped to integrate some existing context-aware recommendation model, where the feature embeddings are directly initialized with the obtained refined embeddings. Moreover, we identify a unique challenge of the basic framework, that is, the contextual features associated with too many instances may suffer from noise when aggregating the information. We thus further propose a simple but effective variant, i.e., UEG-EL-V, in order to prune the information propagation of the contextual features. Finally, we conduct extensive experiments on three public datasets to verify the effectiveness and compatibility of our UEG-EL and its variant.
Gene Ontology (GO) is the primary gene function knowledge base that enables computational tasks in biomedicine. The basic element of GO is a term, which includes a set of genes with the same function. Existing research efforts of GO mainly focus on predicting gene term associations. Other tasks, such as generating descriptions of new terms, are rarely pursued. In this paper, we propose a novel task: GO term description generation. This task aims to automatically generate a sentence that describes the function of a GO term belonging to one of the three categories, i.e., molecular function, biological process, and cellular component. To address this task, we propose a Graph-in-Graph network that can efficiently leverage the structural information of GO. The proposed network introduces a two-layer graph: the first layer is a graph of GO terms where each node is also a graph (gene graph). Such a Graph-in-Graph network can derive the biological functions of GO terms and generate proper descriptions. To validate the effectiveness of the proposed network, we build three large-scale benchmark datasets. By incorporating the proposed Graph-in-Graph network, the performances of seven different sequence-to-sequence models can be substantially boosted across all evaluation metrics, with up to 34.7%, 14.5%, and 39.1% relative improvements in BLEU, ROUGE-L, and METEOR, respectively.
Rationale is defined as a subset of input features that best explains or supports the prediction by machine learning models. Rationale identification has improved the generalizability and interpretability of neural networks on vision and language data. In graph applications such as molecule and polymer property prediction, identifying representative subgraph structures named as graph rationales plays an essential role in the performance of graph neural networks. Existing graph pooling and/or distribution intervention methods suffer from the lack of examples to learn to identify optimal graph rationales. In this work, we introduce a new augmentation operation called environment replacement that automatically creates virtual data examples to improve rationale identification. We propose an efficient framework that performs rationale-environment separation and representation learning on the real and augmented examples in latent spaces to avoid the high complexity of explicit graph decoding and encoding. Comparing against recent techniques, experiments on seven molecular and four polymer datasets demonstrate the effectiveness and efficiency of the proposed augmentation-based graph rationalization framework. Data and the implementation of the proposed framework are publicly available https://github.com/liugangcode/GREA.
Multi-label aspect category detection allows a given review sentence to contain multiple aspect categories, which is shown to be more practical in sentiment analysis and attracting increasing attention. As annotating large amounts of data is time-consuming and labor-intensive, data scarcity occurs frequently in real-world scenarios, which motivates multi-label few-shot aspect category detection. However, research on this problem is still in infancy and few methods are available. In this paper, we propose a novel label-enhanced prototypical network (LPN) for multi-label few-shot aspect category detection. The highlights of LPN can be summarized as follows. First, it leverages label description as auxiliary knowledge to learn more discriminative prototypes, which can retain aspect-relevant information while eliminating the harmful effect caused by irrelevant aspects. Second, it integrates with contrastive learning, which encourages that the sentences with the same aspect label are pulled together in embedding space while simultaneously pushing apart the sentences with different aspect labels. In addition, it introduces an adaptive multi-label inference module to predict the aspect count in the sentence, which is simple yet effective. Extensive experimental results on three datasets demonstrate that our proposed model LPN can consistently achieve state-of-the-art performance.
Learning fair representations is an essential task to reduce bias in data-oriented decision making. It protects minority subgroups by requiring the learned representations to be independent of sensitive attributes. To achieve independence, the vast majority of the existing work primarily relaxes it to the minimization of the mutual information between sensitive attributes and learned representations. However, direct computation of mutual information is computationally intractable, and various upper bounds currently used either are still intractable or contradict the utility of the learned representations. In this paper, we introduce distance covariance as a new dependence measure into fair representation learning. By observing that sensitive attributes (e.g., gender, race, and age group) are typically categorical, the distance covariance can be converted to a tractable penalty term without contradicting the utility desideratum. Based on the tractable penalty, we propose FairDisCo, a variational method to learn fair representations. Experiments demonstrate that FairDisCo outperforms existing competitors for fair representation learning.
Knowledge graph reasoning plays a pivotal role in many real-world applications, such as network alignment, computational fact-checking, recommendation, and many more. Among these applications, knowledge graph completion (KGC) and multi-hop question answering over knowledge graph (Multi-hop KGQA) are two representative reasoning tasks. In the vast majority of the existing works, the two tasks are considered separately with different models or algorithms. However, we envision that KGC and Multi-hop KGQA are closely related to each other. Therefore, the two tasks will benefit from each other if they are approached adequately. In this work, we propose a neural model named BiNet to jointly handle KGC and multi-hop KGQA, and formulate it as a multi-task learning problem. Specifically, our proposed model leverages a shared embedding space and an answer scoring module, which allows the two tasks to automatically share latent features and learn the interactions between natural language question decoder and answer scoring module. Compared to the existing methods, the proposed BiNet model addresses both multi-hop KGQA and KGC tasks simultaneously with superior performance. Experiment results show that BiNet outperforms state-of-the-art methods on a wide range of KGQA and KGC benchmark datasets.
Heterogeneous graph streams are very common in the applications today. Although representation learning has advantages in prediction accuracy, it is inherently deficient in the abilities to interpret or to reason well. It has long been realized as far back as in 1990 by Marvin Minsky that connectionist networks and symbolic rules should co-exist in a system and overcome the deficiencies of each other. The goal of this paper is to show that it is feasible to simultaneously and efficiently perform representation learning (for connectionist networks) and rule learning spontaneously out of the same online training process for graph streams. We devise such a system called RL$^2$, and show, both analytically and empirically, that it is highly efficient and responsive for graph streams, and produces good results for both representation learning and rule learning in terms of prediction accuracy and returning top-quality rules for interpretation and building dynamic Bayesian networks.
Knowledge graph (KG) embeddings have been a mainstream approach for reasoning over incomplete KGs. However, limited by their inherently shallow and static architectures, they can hardly deal with the rising focus on complex logical queries, which comprise logical operators, imputed edges, multiple source entities, and unknown intermediate entities. In this work, we present the Knowledge Graph Transformer (kgTransformer) with masked pre-training and fine-tuning strategies. We design a KG triple transformation method to enable Transformer to handle KGs, which is further strengthened by the Mixture-of-Experts (MoE) sparse activation. We then formulate the complex logical queries as masked prediction and introduce a two-stage masked pre-training strategy to improve transferability and generalizability.Extensive experiments on two benchmarks demonstrate that kgTransformer can consistently outperform both KG embedding-based baselines and advanced encoders on nine in-domain and out-of-domain reasoning tasks. Additionally, kgTransformer can reason with explainability via providing the full reasoning paths to interpret given answers.
Recent studies on Graph Neural Networks (GNNs) point out that most GNNs depend on the homophily assumption but fail to generalize to graphs with heterophily where dissimilar nodes connect. The concept of homophily or heterophily defined previously is a global measurement of the whole graph and cannot describe the local connectivity of a node. From the node-level perspective, we find that real-world graph structures exhibit a mixture of homophily and heterophily, which refers to the co-existence of both homophilous and heterophilous nodes. Under such a mixture, we reveal that GNNs are severely biased towards homophilous nodes, suffering a sharp performance drop on heterophilous nodes. To mitigate the bias issue, we explore an Uncertainty-aware Debiasing (UD) framework, which retains the knowledge of the biased model on certain nodes and compensates for the nodes with high uncertainty. In particular, UD estimates the uncertainty of the GNN output to recognize heterophilous nodes. UD then trains a debiased GNN by pruning the biased parameters with certain nodes and retraining the pruned parameters on nodes with high uncertainty. We apply UD on both homophilous GNNs (GCN and GAT) and heterophilous GNNs (Mixhop and GPR-GNN) and conduct extensive experiments on synthetic and benchmark datasets, where the debiased model consistently performs better and narrows the performance gap between homophilous and heterophilous nodes.
For building recommender systems, a critical task is to learn a policy with collected feedback (e.g., ratings, clicks) to decide which items to be recommended to users. However, it has been shown that the selection bias in the collected feedback leads to biased learning and thus a sub-optimal policy. To deal with this issue, counterfactual learning has received much attention, where existing approaches can be categorized as either value learning or policy learning approaches. This work studies policy learning approaches for top-K recommendations with a large item space and points out several difficulties related to importance weight explosion, observation insufficiency, and training efficiency. A practical framework for policy learning is then proposed to overcome these difficulties. Our experiments confirm the effectiveness and efficiency of the proposed framework.
With the tremendous expansion of graphs data, node classification shows its great importance in many real-world applications. Existing graph neural network based methods mainly focus on classifying unlabeled nodes within fixed classes with abundant labeling. However, in many practical scenarios, graph evolves with emergence of new nodes and edges. Novel classes appear incrementally along with few labeling due to its newly emergence or lack of exploration. In this paper, we focus on this challenging but practical graph few-shot class-incremental learning (GFSCIL) problem and propose a novel method called Geometer. Instead of replacing and retraining the fully connected neural network classifier, Geometer predicts the label of a node by finding the nearest class prototype. Prototype is a vector representing a class in the metric space. With the pop-up of novel classes, Geometer learns and adjusts the attention-based prototypes by observing the geometric proximity, uniformity and separability. Teacher-student knowledge distillation and biased sampling are further introduced to mitigate catastrophic forgetting and unbalanced labeling problem respectively. Experimental results on four public datasets demonstrate that Geometer achieves a substantial improvement of 9.46% to 27.60% over state-of-the-art methods.
Spatio-temporal graph learning is a key method for urban computing tasks, such as traffic flow, taxi demand and air quality forecasting. Due to the high cost of data collection, some developing cities have few available data, which makes it infeasible to train a well-performed model. To address this challenge, cross-city knowledge transfer has shown its promise, where the model learned from data-sufficient cities is leveraged to benefit the learning process of data-scarce cities. However, the spatio-temporal graphs among different cities show irregular structures and varied features, which limits the feasibility of existing Few-Shot Learning (FSL) methods. Therefore, we propose a model-agnostic few-shot learning framework for spatio-temporal graph called ST-GFSL. Specifically, to enhance feature extraction by transferring cross-city knowledge, ST-GFSL proposes to generate non-shared parameters based on node-level meta knowledge. The nodes in target city transfer the knowledge via parameter matching, retrieving from similar spatio-temporal characteristics. Furthermore, we propose to reconstruct the graph structure during meta-learning. The graph reconstruction loss is defined to guide structure-aware learning, avoiding structure deviation among different datasets. We conduct comprehensive experiments on four traffic speed prediction benchmarks and the results demonstrate the effectiveness of ST-GFSL compared with state-of-the-art methods.
Time series anomaly detection remains one of the most active areas of research in data mining. In spite of the dozens of creative solutions proposed for this problem, recent empirical evidence suggests that time series discords, a relatively simple twenty-year old distance-based technique, remains among the state-of-art techniques. While there are many algorithms for computing the time series discords, they all have limitations. First, they are limited to the batch case, whereas the online case is more actionable. Second, these algorithms exhibit poor scalability beyond tens of thousands of datapoints. In this work we introduce DAMP, a novel algorithm that addresses both these issues. DAMP computes exact left-discords on fast arriving streams, at up to 300,000 Hz using a commodity desktop. This allows us to find time series discords in datasets with trillions of datapoints for the first time. We will demonstrate the utility of our algorithm with the most ambitious set of time series anomaly detection experiments ever conducted.
Collaborative multi-agent reinforcement learning (MARL) has been widely used in many practical applications, where each agent makes a decision based on its own observation. Most mainstream methods treat each local observation as an entirety when modeling the decentralized local utility functions. However, they ignore the fact that local observation information can be further divided into several entities, and only part of the entities is helpful to model inference. Moreover, the importance of different entities may change over time. To improve the performance of decentralized policies, the attention mechanism is used to capture features of local information. Nevertheless, existing attention models rely on dense fully connected graphs and cannot better perceive important states. To this end, we propose a sparse state based MARL (S2RL) framework, which utilizes a sparse attention mechanism to discard irrelevant information in local observations. The local utility functions are estimated through the self-attention and sparse attention mechanisms separately, then are combined into a standard joint value function and auxiliary joint value function in the central critic. We design the S2RL framework as a plug-and-play module, making it general enough to be applied to various methods. Extensive experiments on StarCraft II show that S2RL can significantly improve the performance of many state-of-the-art methods.
Modeling sequential patterns from data is at the core of various time series forecasting tasks. Deep learning models have greatly outperformed many traditional models, but these black-box models generally lack explainability in prediction and decision making. To reveal the underlying trend with understandable mathematical expressions, scientists and economists tend to use partial differential equations (PDEs) to explain the highly nonlinear dynamics of sequential patterns. However, it usually requires domain expert knowledge and a series of simplified assumptions, which is not always practical and can deviate from the ever-changing world. Is it possible to learn the differential relations from data dynamically to explain the time-evolving dynamics? In this work, we propose an learning framework that can automatically obtain interpretable PDE models from sequential data. Particularly, this framework is comprised of learnable differential blocks, named P-blocks, which is proved to be able to approximate any time-evolving complex continuous functions in theory. Moreover, to capture the dynamics shift, this framework introduces a meta-learning controller to dynamically optimize the hyper-parameters of a hybrid PDE model. Extensive experiments on times series forecasting of financial, engineering, and health data show that our model can provide valuable interpretability and achieve comparable performance to state-of-the-art models. From empirical studies, we find that learning a few differential operators may capture the major trend of sequential dynamics without massive computational complexity.
Hypergraphs provide an effective abstraction for modeling multi-way group interactions among nodes, where each hyperedge can connect any number of nodes. Different from most existing studies which leverage statistical dependencies, we study hypergraphs from the perspective of causality. Specifically, in this paper, we focus on the problem of individual treatment effect (ITE) estimation on hypergraphs, aiming to estimate how much an intervention (e.g., wearing face covering) would causally affect an outcome (e.g., COVID-19 infection) of each individual node. Existing works on ITE estimation either assume that the outcome on one individual should not be influenced by the treatment assignments on other individuals (i.e., no interference), or assume the interference only exists between pairs of connected individuals in an ordinary graph. We argue that these assumptions can be unrealistic on real-world hypergraphs, where higher-order interference can affect the ultimate ITE estimations due to the presence of group interactions. In this work, we investigate high-order interference modeling, and propose a new causality learning framework powered by hypergraph neural networks. Extensive experiments on real-world hypergraphs verify the superiority of our framework over existing baselines.
Causal skeleton learning aims to identify the undirected graph of the underlying causal Bayesian network (BN) from observational data. It plays a pivotal role in causal discovery and many other downstream applications. The methods for causal skeleton learning fall into three primary categories: constraint-based, score-based, and gradient-based methods. This paper, for the first time, advocates for learning a causal skeleton in a supervision-based setting, where the algorithm learns from additional datasets associated with the ground-truth BNs (complementary to input observational data). Concretizing a supervision-based method is non-trivial due to the high complexity of the problem itself, and the potential "domain shift" between training data (i.e., additional datasets associated with ground-truth BNs) and test data (i.e., observational data) in the supervision-based setting. First, it is well-known that skeleton learning suffers worst-case exponential complexity. Second, conventional supervised learning assumes an independent and identical distribution (i.i.d.) on test data, which is not easily attainable due to the divergent underlying causal mechanisms between training and test data. Our proposed framework, ML4S, adopts order-based cascade classifiers and pruning strategies that can withstand high computational overhead without sacrificing accuracy. To address the "domain shift" challenge, we generate training data from vicinal graphs w.r.t. the target BN. The associated datasets of vicinal graphs share similar joint distributions with the observational data. We evaluate ML4S on a variety of datasets and observe that it remarkably outperforms the state of the arts, demonstrating the great potential of the supervision-based skeleton learning paradigm.
Modeling sequential data is essential to many applications such as natural language processing, recommendation systems, time series predictions, anomaly detection, etc. When processing sequential data, one of the critical issues is how to capture the temporal-correlation among events. Though prevalent and effective in many applications, conventional approaches such as RNNs and Transformers, struggle with handling the non-stationary characteristics (i.e., such temporal-correlation among events would change over time), which is indeed encountered in many real-world scenarios. In this paper, we present a non-stationary time-aware kernelized attention approach for input sequences of neural networks. By constructing the Generalized Spectral Mixture Kernel (GSMK), and integrating it to the attention mechanism, we mathematically reveal its representation capability in terms of the time-dependent temporal-correlation. Following that, a novel neural network structure is proposed, which would enable us to encode both stationary and non-stationary time event series. Finally, we demonstrate the performance of the proposed method on both synthetic data which presents the theoretical insights, and a variety of real-world datasets which shows its competitive performance against related work.
Bundle recommendation aims to recommend a bundle of related items to users, which can satisfy the users' various needs with one-stop convenience. Recent methods usually take advantage of both user-bundle and user-item interactions information to obtain informative representations for users and bundles, corresponding to bundle view and item view, respectively. However, they either use a unified view without differentiation or loosely combine the predictions of two separate views, while the crucial cooperative association between the two views' representations is overlooked.
In this work, we propose to model the cooperative association between the two different views through cross-view contrastive learning. By encouraging the alignment of the two separately learned views, each view can distill complementary information from the other view, achieving mutual enhancement. Moreover, by enlarging the dispersion of different users/bundles, the self-discrimination of representations is enhanced. Extensive experiments on three public datasets demonstrate that our method outperforms SOTA baselines by a large margin. Meanwhile, our method requires minimal parameters of three set of embeddings (user, bundle, and item) and the computational costs are largely reduced due to more concise graph structure and graph learning module. In addition, various ablation and model studies demystify the working mechanism and justify our hypothesis. Codes and datasets are available at https://github.com/mysbupt/CrossCBR.
While invariance of causal mechanisms has inspired recent work in both robust machine learning and causal inference, causal mechanisms may also vary over domains due to, for example, population-specific differences, the context of data collection, or intervention. To discover invariant and changing mechanisms from data, we propose extending the algorithmic model for causation to mechanism changes and instantiating it using Minimum Description Length. In essence, for a continuous variable Y in multiple contexts C, we identify variables X as causal if the regression functions g : X → Y have succinct descriptions in all contexts. In empirical evaluations we show that our method, VARIO, finds invariant variable sets, reveals mechanism changes, and discovers causal networks, such as on real-world data that gives insight into the signaling pathways in human immune cells.
AI systems that can capture human-like behavior are becoming increasingly useful in situations where humans may want to learn from these systems, collaborate with them, or engage with them as partners for an extended duration. In order to develop human-oriented AI systems, the problem of predicting human actions---as opposed to predicting optimal actions---has received considerable attention. Existing work has focused on capturing human behavior in an aggregate sense, which potentially limits the benefit any particular individual could gain from interaction with these systems. We extend this line of work by developing highly accurate predictive models of individual human behavior in chess. Chess is a rich domain for exploring human-AI interaction because it combines a unique set of properties: AI systems achieved superhuman performance many years ago, and yet humans still interact with them closely, both as opponents and as preparation tools, and there is an enormous corpus of recorded data on individual player games. Starting with Maia, an open-source version of AlphaZero trained on a population of human players, we demonstrate that we can significantly improve prediction accuracy of a particular player's moves by applying a series of fine-tuning methods. Furthermore, our personalized models can be used to perform stylometry---predicting who made a given set of moves---indicating that they capture human decision-making at an individual level. Our work demonstrates a way to bring AI systems into better alignment with the behavior of individual people, which could lead to large improvements in human-AI interaction.
A primary challenge in metagenomics is reconstructing individual microbial genomes from the mixture of short fragments created by sequencing. Recent work leverages the sparsity of the assembly graph to find r-dominating sets which enable rapid approximate queries through a dominator-centric graph partition. In this paper, we consider two problems related to reducing uncertainty and improving scalability in this setting.
First, we observe that nodes with multiple closest dominators necessitate arbitrary tie-breaking in the existing pipeline. As such, we propose findingsparse dominating sets which minimize this effect via a newcongestion parameter. We prove minimizing congestion is NP-hard, and give an O (√Δr) approximation algorithm, where Δ is the max degree.
To improve scalability, the graph should be partitioned into uniformly sized pieces, subject to placing vertices with a closest dominator. This leads to balanced neighborhood partitioning : given an r-dominating set, find a partition into connected subgraphs with optimal uniformity so that each vertex is co-assigned with some closest dominator. Using variance of piece sizes to measure uniformity, we show this problem is NP-hard iff r is greater than 1. We design and analyze several algorithms, including a polynomial-time approach which is exact when r=1 (and heuristic otherwise).
We complement our theoretical results with computational experiments on a corpus of real-world networks showing sparse dominating sets lead to more balanced neighborhood partitionings. Further, on the metagenome fHuSB1, our approach maintains high query containment and similarity while reducing piece size variance.
Conversational search and recommendation systems can ask clarifying questions through the conversation and collect valuable information from users. However, an important question remains: how can we extract relevant information from the user's utterances and use it in the retrieval or recommendation in the next turn of the conversation? Utilizing relevant information from users' utterances leads the system to better results at the end of the conversation. In this paper, we propose a model based on reinforcement learning, namely RelInCo, which takes the user's utterances and the context of the conversation and classifies each word in the user's utterances as belonging to the relevant or non-relevant class. RelInCo uses two Actors: 1) Arrangement-Actor, which finds the most relevant order of words in user's utterances, and 2) Selector-Actor, which determines which words, in the order provided by the arrangement Actor, can bring the system closer to the target of the conversation. In this way, we can find relevant information in the user's utterance and use it in the conversation. The objective function in our model is designed in such a way that it can maximize any desired retrieval and recommendation metrics (i.e., the ultimate
Extrapolation to predict unseen data outside the training distribution is a common challenge in real-world scientific applications of physics and chemistry. However, the extrapolation capabilities of neural networks have not been extensively studied in machine learning. Although it has been recently revealed that neural networks become linear regression in extrapolation problems, a universally applicable method to support the extrapolation of neural networks in general regression settings has not been investigated. In this paper, we propose automated nonlinearity encoder (ANE) that is a data-agnostic embedding method to improve the extrapolation capabilities of neural networks by conversely linearizing the original input-to-target relationships without architectural modifications of prediction models. ANE achieved state-of-the-art extrapolation accuracies in extensive scientific applications of various data formats. As a real-world application, we applied ANE for high-throughput screening to discover novel solar cell materials, and ANE significantly improved the screening accuracy.
Learning fair representation is crucial for achieving fairness or debiasing sensitive information. Most existing works rely on adversarial representation learning to inject some invariance into representation. However, adversarial learning methods are known to suffer from relatively unstable training, and this might harm the balance between fairness and predictiveness of representation. We propose a new approach, learningFAir Representation via distributional CONtrastive Variational AutoEncoder (FarconVAE), which induces the latent space to be disentangled into sensitive and non-sensitive parts. We first construct the pair of observations with different sensitive attributes but with the same labels. Then, FarconVAE enforces each non-sensitive latent to be closer, while sensitive latents to be far from each other and also far from the non-sensitive latent by contrasting their distributions. We provide a new type of contrastive loss motivated by Gaussian and Student-t kernels for distributional contrastive learning with theoretical analysis. Besides, we adopt a new swap-reconstruction loss to boost the disentanglement further. FarconVAE shows superior performance on fairness, pretrained model debiasing, and domain generalization tasks from various modalities, including tabular, image, and text.
Opinion formation and propagation are crucial phenomena in social networks and have been extensively studied across several disciplines. Traditionally, theoretical models of opinion dynamics have been proposed to describe the interactions between individuals (i.e., social interaction) and their impact on the evolution of collective opinions. Although these models can incorporate sociological and psychological knowledge on the mechanisms of social interaction, they demand extensive calibration with real data to make reliable predictions, requiring much time and effort. Recently, the widespread use of social media platforms provides new paradigms to learn deep learning models from a large volume of social media data. However, these methods ignore any scientific knowledge about the mechanism of social interaction. In this work, we present the first hybrid method called Sociologically-Informed Neural Network (SINN), which integrates theoretical models and social media data by transporting the concepts of physics-informed neural networks (PINNs) from natural science (i.e., physics) into social science (i.e., sociology and social psychology). In particular, we recast theoretical models as ordinary differential equations (ODEs). Then we train a neural network that simultaneously approximates the data and conforms to the ODEs that represent the social scientific knowledge. In addition, we extend PINNs by integrating matrix factorization and a language model to incorporate rich side information (e.g., user profiles) and structural knowledge (e.g., cluster structure of the social interaction network). Moreover, we develop an end-to-end training procedure for SINN, which involves Gumbel-Softmax approximation to include stochastic mechanisms of social interaction. Extensive experiments on real-world and synthetic datasets show SINN outperforms six baseline methods in predicting opinion dynamics.
Node embedding aims to map nodes in the complex graph into low-dimensional representations. The real-world large-scale graphs and difficulties of labeling motivate wide studies of unsupervised node embedding problems. Nevertheless, previous effort mostly operates in a centralized setting where a complete graph is given. With the growing awareness of data privacy, data holders who can be represented by one vertex in the graph and are only its neighbors demand greater privacy protection. In this paper, we introduce FedWalk, a random-walk-based unsupervised node embedding algorithm that operates in such a node-level visibility graph with raw graph information remaining locally. FedWalk is designed to offer centralized competitive graph representation capability with data privacy protection and great communication efficiency. FedWalk instantiates the prevalent federated paradigm and contains three modules. We first design a hierarchical clustering tree (HCT) constructor to extract the structural feature of each node. A dynamic time warping algorithm seamlessly handles the structural heterogeneity across different nodes. Based on the constructed HCT, we then design a random walk generator, wherein a sequence encoder is designed to preserve privacy and a two-hop neighbor predictor is designed to save communication cost. The generated random walks are then used to update node embedding based on a SkipGram model. Extensive experiments on two large graphs demonstrate that FedWalk achieves competitive representativeness as a centralized node embedding algorithm does with only up to 1.8% Micro-F1 score and 4.4% Marco-F1 score loss while reducing about 6.7 times of inter-device communication per walk.
Protecting the intellectual property (IP) of deep neural networks (DNN) becomes an urgent concern for IT corporations. For model piracy forensics, previous model fingerprinting schemes are commonly based on adversarial examples constructed for the owner's model as the fingerprint, and verify whether a suspect model is indeed pirated from the original model by matching the behavioral pattern on the fingerprint examples between one another. However, these methods heavily rely on the characteristics of classification tasks which inhibits their application to more general scenarios. To address this issue, we present MetaV, the first task-agnostic model fingerprinting framework which enables fingerprinting on a much wider range of DNNs independent from the downstream learning task, and exhibits strong robustness against a variety of ownership obfuscation techniques. Specifically, we generalize previous schemes into two critical design components in MetaV: the adaptive fingerprint and the meta-verifier, which are jointly optimized such that the meta-verifier learns to determine whether a suspect model is stolen based on the concatenated outputs of the suspect model on the adaptive fingerprint. As a key of being task-agnostic, the full process makes no assumption on the model internals in the ensemble only if they have the same input and output dimensions. Spanning classification, regression and generative modeling, extensive experimental results validate the substantially improved performance of MetaV over the state-of-the-art fingerprinting schemes and demonstrate the enhanced generality of MetaV for providing task-agnostic fingerprinting. For example, on fingerprinting ResNet-18 trained for skin cancer diagnosis, MetaV achieves simultaneously 100% true positives and 100% true negatives on a diverse test set of 70 suspect models, achieving an about 220% relative improvement in ARUC over the optimal baseline.
We introduce a random hypergraph model for core-periphery structure. By leveraging our model's sufficient statistics, we develop a novel statistical inference algorithm that is able to scale to large hypergraphs with runtime that is practically linear wrt. the number of nodes in the graph after a preprocessing step that is almost linear in the number of hyperedges, as well as a scalable sampling algorithm. Our inference algorithm is capable of learning embeddings that correspond to the reputation (rank) of a node within the hypergraph. We also give theoretical bounds on the size of the core of hypergraphs generated by our model. We experiment with hypergraph data that range to ∼ 105 hyperedges mined from the Microsoft Academic Graph, Stack Exchange, and GitHub and show that our model outperforms baselines wrt. producing good fits.
Neural network capability in symbolic computation has emerged in much recent work. However, symbolic computation is always treated as an end-to-end blackbox prediction task, where human-like symbolic deductive logic is missing. In this paper, we argue that any complex symbolic computation can be broken down to a sequence of finite Fundamental Computation Transformations (FCT), which are grounded as certain mathematical expression computation transformations. The entire computation sequence represents a full human understandable symbolic deduction process. Instead of studying on different end-to-end neural network applications, this paper focuses on approximating FCT which further build up symbolic deductive logic. To better mimic symbolic computations with math expression transformations, we propose a novel tree representation learning architecture GATE (Graph Aggregation Transformer Encoder) for math expressions. We generate a large-scale math expression transformation dataset for training purpose and collect a real-world dataset for validation. Experiments demonstrate the feasibility of producing step-by-step human-like symbolic deduction sequences with the proposed approach, which outperforms other neural network approaches and heuristic approaches.
Through using only a well-trained classifier, model-inversion (MI) attacks can recover the data used for training the classifier, leading to the privacy leakage of the training data. To defend against MI attacks, previous work utilizes a unilateral dependency optimization strategy, i.e., minimizing the dependency between inputs (i.e., features) and outputs (i.e., labels) during training the classifier. However, such a minimization process conflicts with minimizing the supervised loss that aims to maximize the dependency between inputs and outputs, causing an explicit trade-off between model robustness against MI attacks and model utility on classification tasks. In this paper, we aim to minimize the dependency between the latent representations and the inputs while maximizing the dependency between latent representations and the outputs, named a bilateral dependency optimization (BiDO) strategy. In particular, we use the dependency constraints as a universally applicable regularizer in addition to commonly used losses for deep neural networks (e.g., cross-entropy), which can be instantiated with appropriate dependency criteria according to different tasks. To verify the efficacy of our strategy, we propose two implementations of BiDO, by using two different dependency measures: BiDO with constrained covariance (BiDO-COCO) and BiDO with Hilbert-Schmidt Independence Criterion (BiDO-HSIC). Experiments show that BiDO achieves the state-of-the-art defense performance for a variety of datasets, classifiers, and MI attacks while suffering a minor classification-accuracy drop compared to the well-trained classifier with no defense, which lights up a novel road to defend against MI attacks.
Estimating the accuracy of an automatically constructed knowledge graph (KG) becomes a challenging task as the KG often contains a large number of entities and triples. Generally, two major components information extraction (IE) and entity linking (EL) are involved in KG construction. However, the existing approaches just focus on evaluating the triple accuracy that indicates the IE quality, completely ignoring the entity accuracy. Motivated by the fact that the major advance of machines is the strong computing power while humans are skilled in correctness verification, we propose an efficient interactive method to reduce the overall cost for evaluating the KG quality, which produces accuracy estimates with a statistical guarantee for both triples and entities. Instead of annotating triples and entities separately, we design a general annotation cost that blends triples and entities generated from the identical source text. During human verification, the machine can pre-compute and infer triples to be annotated in the next round by speculating human feedback. The human-machine collaborative mechanism is optimized by formulating an order selection problem of triples which is NP-hard. Thus, a Monte Carlo Tree Search is proposed to guide the annotation process by finding an approximate solution. Extensive experiments demonstrate that our method takes less annotation cost while yielding higher accuracy estimation quality compared to the state-of-the-art approaches.
Contextual bandits aim to identify among a set of arms the optimal one with the highest reward based on their contextual information. Motivated by the fact that the arms usually exhibit group behaviors and the mutual impacts exist among groups, we introduce a new model, Arm Group Graph (AGG), where the nodes represent the groups of arms and the weighted edges formulate the correlations among groups. To leverage the rich information in AGG, we propose a bandit algorithm, AGG-UCB, where the neural networks are designed to estimate rewards, and we propose to utilize graph neural networks (GNN) to learn the representations of arm groups with correlations. To solve the exploitation-exploration dilemma in bandits, we derive a new upper confidence bound (UCB) built on neural networks (exploitation) for exploration. Furthermore, we prove that AGG-UCB can achieve a near-optimal regret bound with over-parameterized neural networks, and provide the convergence analysis of GNN with fully-connected layers which may be of independent interest. In the end, we conduct extensive experiments against state-of-the-art baselines on multiple public data sets, showing the effectiveness of the proposed algorithm.
Driven by the exponential increase of software and the advent of the pull-based development system Git, a large amount of open-source software has emerged on various social coding platforms. GitHub, as the largest platform, not only attracts developers and researchers to contribute legitimate software and research-related source code but has also become a popular platform for an increasing number of cybercriminals to perform continuous cyberattacks. Hence, some tools have been developed to learn representations of repositories on GitHub for various related applications (e.g., malicious repository detection) recently. However, most of them merely focus on code content while ignoring the rich relational data among repositories. In addition, they usually require a mass of resources to obtain sufficient labeled data for model training while ignoring the usefully handy unlabeled data. To this end, we propose a novel model Rep2Vec which integrates the code content, the structural relations, and the unlabeled data to learn the repository representations. First, to comprehensively model the repository data, we build a repository heterogeneous graph (Rep-HG) which is encoded by a graph neural network. Afterwards, to fully exploit unlabeled data in Rep-HG, we introduce adversarial attacks to generate more challenging contrastive pairs for the contrastive learning module to train the encoder in node view and meta-path view simultaneously. To alleviate the workload of the encoder against attacks, we further design a dual-stream contrastive learning module that integrates contrastive learning on adversarial graph and original graph together. Finally, the pre-trained encoder is fine-tuned to the downstream task, and further enhanced by a knowledge distillation module. Extensive experiments on the collected dataset from GitHub demonstrate the effectiveness of Rep2Vec in comparison with state-of-the-art methods for multiple repository tasks.
Tabular pre-training models have received increasing attention due to the wide-ranging applications for tabular data analysis. However, most of the existing solutions are directly built upon the tabular data with a mixture of non-semantic and semantic contents. According to the statistics, only 30% of tabular data in wikitables are semantic entities that are surrounded and isolated by enormous irregular characters such as numbers, strings, symbols, etc. Despite the small portion, such semantic entities are crucial for table understanding. This paper attempts to enhance the existing tabular pre-training model by injecting common-sense knowledge from external sources. Compared with the knowledge injection in the natural language pre-training models, the tabular model naturally requires overcoming the domain gaps between external knowledge and tabular data with significant differences in both structures and contents. To this end, we propose the dual-adapters inserted within the pre-trained tabular model for flexible and efficient knowledge injection. The two parallel adapters are trained by the knowledge graph triplets and semantically augmented tables respectively for infusion and alignment with the tabular data. In addition, a path-wise attention layer is attached below to fuse the cross-domain representation with the weighted contribution. Finally, to verify the effectiveness of our proposed knowledge injection framework, we extensively test it on 5 different application scenarios covering both zero-shot and finetuning-based tabular understanding tasks over the cell, column, and tables levels.
Prior work on private data release has only studied counting queries or linear queries, where each tuple in the dataset contributes a value in [0,1] and a query returns the sum of the values. However, many data analytical tasks involve numerical values that are arbitrary real numbers. In this paper, we present a new mechanism to privatize a dataset D for a given set Q of numerical queries, achieving an error of Õ (√n • Δw(D)) for each query w ∈ Q, where Δw(D) is the maximum contribution of any tuple in D queried by w. This instance- and query-specific error bound not only is theoretically appealing, but also leads to excellent practical performance.
Policy distillation (PD) has been widely studied in deep reinforcement learning (RL), while existing PD approaches assume that the demonstration data (i.e., state-action pairs in frames) in a decision making sequence is uniformly distributed. This may bring in unwanted bias since RL is a reward maximizing process instead of simple label matching. Given such an issue, we denote the frame importance as its contribution to the expected reward on a particular frame, and hypothesize that adapting such frame importance could benefit the performance of the distilled student policy. To verify our hypothesis, we analyze why and how frame importance matters in RL settings. Based on the analysis, we propose an importance prioritized PD framework that highlights the training on important frames, so as to learn efficiently. Particularly, the frame importance is measured by the reciprocal of weighted Shannon entropy from a teacher policy's action prescriptions. Experiments on Atari games and policy compression tasks show that capturing the frame importance significantly boosts the performance of the distilled policies.
Adversarial examples in automatic speech recognition (ASR) are naturally sounded by humans yet capable of fooling well trained ASR models to transcribe incorrectly. Existing audio adversarial examples are typically constructed by adding constrained perturbations on benign audio inputs. Such attacks are therefore generated with an audio dependent assumption. For the first time, we propose the Speech Synthesising based Attack (SSA), a novel threat model that constructs audio adversarial examples entirely from scratch, i.e., without depending on any existing audio to fool cutting-edge ASR models. To this end, we introduce a conditional variational auto-encoder (CVAE) as the speech synthesiser. Meanwhile, an adaptive sign gradient descent algorithm is proposed to solve the adversarial audio synthesis task. Experiments on three datasets (i.e., Audio Mnist, Common Voice, and Librispeech) show that our method could synthesise naturally sounded audio adversarial examples to mislead the start-of-the-art ASR models. Our web-page containing generated audio demos is at https://sites.google.com/view/ssa-asr/home.
Data collected by IoT devices are often private and have a large diversity across users. Therefore, learning requires pre-training a model with available representative data samples, deploying the pre-trained model on IoT devices, and adapting the deployed model on the device with local data. Such an on-device adaption for deep learning empowered applications demands data and memory efficiency. However, existing gradient-based meta learning schemes fail to support memory-efficient adaptation. To this end, we propose p-Meta, a new meta learning method that enforces structure-wise partial parameter updates while ensuring fast generalization to unseen tasks. Evaluations on few-shot image classification and reinforcement learning tasks show that p-Meta not only improves the accuracy but also substantially reduces the peak dynamic memory by a factor of 2.5 on average compared to state-of-the-art few-shot adaptation methods.
Survival analysis aims to predict the risk of an event, such as death due to cancer, in the presence of censoring. Recent research has shown that existing survival techniques are prone to unintentional biases towards protected attributes such as age, race, and/or gender. For example, censoring assumed to be unrelated to the prognosis and covariates (typically violated in real data) often leads to overestimation and biased survival predictions for different protected groups. In order to attenuate harmful bias and ensure fair survival predictions, we introduce fairness definitions based on survival functions and censoring. We propose novel fair and interpretable survival models which use pseudo valued-based objective functions with fairness definitions as constraints for predicting subject-specific survival probabilities. Experiments on three real-world survival datasets demonstrate that our proposed fair survival models show significant improvement over existing survival techniques in terms of accuracy and fairness measures. We show that our proposed models provide fair predictions for protected attributes under different types and amounts of censoring. Furthermore, we study the interplay between interpretability and fairness; and investigate how fairness and censoring impact survival predictions for different protected attributes.
Next Point-of Interest (POI) recommendation plays an important role in location-based applications, which aims to recommend the next POIs to users that they are most likely to visit based on their historical trajectories. Existing methods usually use rich side information, or customized POI graphs to capture the sequential patterns among POIs. However, the graphs only focus on connectivity between POIs. Few studies propose to explicitly learn a weighted POI graph, which could reflect the transition patterns among POIs and show the importance of its different neighbors for each POI. In addition, these approaches simply utilize the user characteristics for personalized POI recommendation without sufficient consideration. To this end, we construct a novel User-POI Knowledge Graph with strong representation ability, called Spatial-Temporal Knowledge Graph (STKG). STKG is used to learn the representations of each node (i.e., user, POI) and each edge. Then, we design a similarity function to construct our POI transition graph based on the learned representations. To incorporate the learned graph into sequential model, we propose a novel network Graph-Flashback for recommendation. Graph-Flashback applies a simplified Graph Convolution Network (GCN) on the POI transition graph to enrich the representation of each POI. Further, we define a similarity function to consider both spatiotemporal information and user preference in modelling sequential regularity. Experimental results on two real-world datasets show that our proposed method achieves the state-of-the-art performance and significantly outperforms all existing solutions.
Knowledge graphs (KGs) capture knowledge in the form of head--relation--tail triples and are a crucial component in many AI systems. There are two important reasoning tasks on KGs: (1) single-hop knowledge graph completion, which involves predicting individual links in the KG; and (2), multi-hop reasoning, where the goal is to predict which KG entities satisfy a given logical query. Embedding-based methods solve both tasks by first computing an embedding for each entity and relation, then using them to form predictions. However, existing scalable KG embedding frameworks only support single-hop knowledge graph completion and cannot be applied to the more challenging multi-hop reasoning task. Here we present Scalable Multi-hOp REasoning (SMORE), the first general framework for both single-hop and multi-hop reasoning in KGs. Using a single machine SMORE can perform multi-hop reasoning in Freebase KG (86M entities, 338M edges), which is 1,500x larger than previously considered KGs. The key to SMORE's runtime performance is a novel bidirectional rejection sampling that achieves a square root reduction of the complexity of online training data generation. Furthermore, SMORE exploits asynchronous scheduling, overlapping CPU-based data sampling, GPU-based embedding computation, and frequent CPU--GPU IO. SMORE increases throughput (i.e., training speed) over prior multi-hop KG frameworks by 2.2x with minimal GPU memory requirements (2GB for training 400-dim embeddings on 86M-node Freebase) and achieves near linear speed-up with the number of GPUs. Moreover, on the simpler single-hop knowledge graph completion task SMORE achieves comparable or even better runtime performance to state-of-the-art frameworks on both single GPU and multi-GPU settings.
The adversarial attack reveals the vulnerability of deep models by incurring test domain shift, while delusive attack relieves the privacy concern about personal data by injecting malicious noise into the training domain to make data unexploitable. However, beyond their successful applications, the two attacks can be easily defended by adversarial training (AT). While AT is not the panacea, it suffers from poor generalization for robustness. For the limitations of attack and defense, we argue that to fit data well, DNNs can learn the spurious relations between inputs and outputs, which are consequently utilized by the attack and defense and degrade their effectiveness, and DNNs can not easily capture the causal relations like humans to make robust decisions under attacks. In this paper, to better understand and improve attack and defense, we first take a bottom-up perspective to describe the correlations between latent factors and observed data, then analyze the effect of domain shift on DNNs induced by attack and finally develop our causal graph, namely Domain-attack Invariant Causal Model (DICM). Based on DICM, we propose a coherent causal invariant principle, which guides our algorithm design to infer the human-like causal relations. We call our algorithm Domain-attack Invariant Causal Learning (DICE) and the experimental results on two attacks and one defense task verify its effectiveness.
This paper introduces a novel approach embedding flow-based models in hierarchical structures. The proposed model learns the representation of high-dimensional data via a message-passing scheme by integrating flow-based functions through variational inference. Meanwhile, our model produces a representation of the data using a lower dimension, thus overcoming the drawbacks of many flow-based models, usually requiring a high dimensional latent space involving many trivial variables. With the proposed aggregation nodes, our model provides a new approach for distribution modeling and numerical inference on datasets. Multiple experiments on synthetic and real-world datasets show the benefits of our~proposed~method and potentially broad applications.
In many scenarios, 1) data streams are generated in real time; 2) labeled data are expensive and only limited labels are available in the beginning; 3) real-world data is not always i.i.d. and data drift over time gradually; 4) the storage of historical streams is limited. This learning setting limits the applicability and availability of many Machine Learning (ML) algorithms. We generalize the learning task under such setting as a semi-supervised drifted stream learning with short lookback problem (SDSL). SDSL imposes two under-addressed challenges on existing methods in semi-supervised learning and continuous learning: 1) robust pseudo-labeling under gradual shifts and 2) anti-forgetting adaptation with short lookback. To tackle these challenges, we propose a principled and generic generation-replay framework to solve SDSL. To achieve robust pseudo-labeling, we develop a novel pseudo-label classification model to leverage supervised knowledge of previously labeled data, unsupervised knowledge of new data, and, structure knowledge of invariant label semantics. To achieve adaptive anti-forgetting model replay, we propose to view the anti-forgetting adaptation task as a flat region search problem. We propose a novel minimax game-based replay objective function to solve the flat region search problem and develop an effective optimization solver. Experimental results demonstrate the effectiveness of the proposed method.
Rankings have become the primary interface in two-sided online markets. Many have noted that the rankings not only affect the satisfaction of the users (e.g., customers, listeners, employers, travelers), but that the position in the ranking allocates exposure -- and thus economic opportunity -- to the ranked items (e.g., articles, products, songs, job seekers, restaurants, hotels). This has raised questions of fairness to the items, and most existing works have addressed fairness by explicitly linking item exposure to item relevance. However, we argue that any particular choice of such a link function may be difficult to defend, and we show that the resulting rankings can still be unfair. To avoid these shortcomings, we develop a new axiomatic approach that is rooted in principles of fair division. This not only avoids the need to choose a link function, but also more meaningfully quantifies the impact on the items beyond exposure. Our axioms of envy-freeness and dominance over uniform ranking postulate that for a fair ranking policy every item should prefer their own rank allocation over that of any other item, and that no item should be actively disadvantaged by the rankings. To compute ranking policies that are fair according to these axioms, we propose a new ranking objective related to the Nash Social Welfare. We show that the solution has guarantees regarding its envy-freeness, its dominance over uniform rankings for every item, and its Pareto optimality. In contrast, we show that conventional exposure-based fairness can produce large amounts of envy and have a highly disparate impact on the items. Beyond these theoretical results, we illustrate empirically how our framework controls the trade-off between impact-based individual item fairness and user utility.
Retraining a classifier with new data is inseparable from ML/AI applications, but most of the existing ML methods do not take into account the backward compatibility of predictions. That is, although the overall performance of a new classifier is improved, users will be confused by the wrong predictions of the new classifier, especially when the predictions of the old classifier are correct for the same samples. To this end, several metrics and learning methods for backward compatibility have been actively studied recently. Despite significant interest in backward compatibility, the metrics and methods are not well known from a theoretical perspective. In this paper, we first analyze the existing backward compatibility metrics and reveal that these metrics essentially assess the same quantity between old and new models. In addition, to obtain a unified view of backward compatibility metrics, we propose a generalized backward compatibility (GBC) metric that can represent the existing backward compatibility metrics. We formulate a learning objective based on the GBC metric and derive the estimation error bound, and the result is applied to one of the existing methods. Through further analysis, we reveal that the existing backward compatibility metrics are not suitable for imbalanced classification. We then design a backward compatibility metric for imbalanced classification on the basis of the GBC metric and empirically demonstrate the practicality of the proposed metric.
As a widely used weakly supervised learning scheme, modern multiple instance learning (MIL) models achieve competitive performance at the bag level. However, instance-level prediction, which is essential for many important applications, remains largely unsatisfactory. We propose to conduct novel active deep multiple instance learning that samples a small subset of informative instances for annotation, aiming to significantly boost the instance-level prediction. A variance regularized loss function is designed to properly balance the bias and variance of instance-level predictions, aiming to effectively accommodate the highly imbalanced instance distribution in MIL and other fundamental challenges. Instead of directly minimizing the variance regularized loss that is non-convex, we optimize a distributionally robust bag level likelihood as its convex surrogate. The robust bag likelihood provides a good approximation of the variance based MIL loss with a strong theoretical guarantee. It also automatically balances bias and variance, making it effective to identify the potentially positive instances to support active sampling. The robust bag likelihood can be naturally integrated with a deep architecture to support deep model training using mini-batches of positive-negative bag pairs. Finally, a novel P-F sampling function is developed that combines a probability vector and predicted instance scores, obtained by optimizing the robust bag likelihood. By leveraging the key MIL assumption, the sampling function can explore the most challenging bags and effectively detect their positive instances for annotation, which significantly improves the instance-level prediction. Experiments conducted over multiple real-world datasets clearly demonstrate the state-of-the-art instance-level prediction achieved by the proposed model.
The propensity model introduced by Jain et al has become a standard approach for dealing with missing and long-tail labels in extreme multi-label classification (XMLC). In this paper, we critically revise this approach showing that despite its theoretical soundness, its application in contemporary XMLC works is debatable. We exhaustively discuss the flaws of the propensity-based approach, and present several recipes, some of them related to solutions used in search engines and recommender systems, that we believe constitute promising alternatives to be followed in XMLC.
Successful machine learning typically relies on fixed data distribution. However, due to unforeseen situations in the open world, distribution shift often occurs in applications. For instance, in the image recognition task, an unpredictable distributional shift may occur due to changes in background or lighting. Furthermore, to alleviate the harm of distribution shift, the resource budget is not infinite and often constrained. To cope with such a novel problem Resource Constrained Adaptation under Unknown Shift, in this paper we study active model adaptation both theoretically and empirically. First, we present a generalization analysis of active model adaptation for distribution shift. In theory, we show that active model adaptation could improve the generalization error from O(1/N) to O(1/N), with only a few queried samples. Second, based on the theoretical analysis, we present a systemic solution Auto, consisting of three sub-steps, that is, distribution tracking, sample selection and model adaptation. Specifically, we design a shifted distribution detection module to locate the distributional shifted samples. To fit the labeling budget, we employ a core-set algorithm to enhance the informativeness of the selected samples. Finally, we update the model through the newly queried labeled data. We conduct empirical studies of nine existing active strategies on diverse real world data sets and the results show that Auto could remarkably outperform all the baselines.
Multivariate Time Series (MTS) forecasting plays a vital role in a wide range of applications. Recently, Spatial-Temporal Graph Neural Networks (STGNNs) have become increasingly popular MTS forecasting methods. STGNNs jointly model the spatial and temporal patterns of MTS through graph neural networks and sequential models, significantly improving the prediction accuracy. But limited by model complexity, most STGNNs only consider short-term historical MTS data, such as data over the past one hour. However, the patterns of time series and the dependencies between them (i.e., the temporal and spatial patterns) need to be analyzed based on long-term historical MTS data. To address this issue, we propose a novel framework, in which STGNN is Enhanced by a scalable time series Pre-training model (STEP). Specifically, we design a pre-training model to efficiently learn temporal patterns from very long-term history time series (e.g., the past two weeks) and generate segment-level representations. These representations provide contextual information for short-term time series input to STGNNs and facilitate modeling dependencies between time series. Experiments on three public real-world datasets demonstrate that our framework is capable of significantly enhancing downstream STGNNs, and our pre-training model aptly captures temporal patterns.
Open information extraction (OIE) methods extract plenty of OIE triples <noun phrase, relation phrase, noun phrase> from unstructured text, which compose large open knowledge bases (OKBs). Noun phrases and relation phrases in such OKBs are not canonicalized, which leads to scattered and redundant facts. It is found that two views of knowledge (i.e., a fact view based on the fact triple and a context view based on the fact triple's source context) provide complementary information that is vital to the task of OKB canonicalization, which clusters synonymous noun phrases and relation phrases into the same group and assigns them unique identifiers. However, these two views of knowledge have so far been leveraged in isolation by existing works. In this paper, we propose CMVC, a novel unsupervised framework that leverages these two views of knowledge jointly for canonicalizing OKBs without the need of manually annotated labels. To achieve this goal, we pro- pose a multi-view CH K-Means clustering algorithm to mutually reinforce the clustering of view-specific embeddings learned from each view by considering their different clustering qualities. In order to further enhance the canonicalization performance, we propose a training data optimization strategy in terms of data quantity and data quality respectively in each particular view to refine the learned view-specific embeddings in an iterative manner. Additionally, we propose a Log-Jump algorithm to predict the optimal number of clusters in a data-driven way without requiring any labels. We demonstrate the superiority of our framework through extensive experiments on multiple real-world OKB data sets against state-of-the-art methods.
Most existing brain imaging work focuses on resting-state fMRI (rs-fMRI) data where the subject is at rest in the scanner typically for disease diagnosis problems. Here we analyze task fMRI (t-fMRI) data where the subject performs a multi-event task over multiple trials. t-fMRI data allows exploring more challenging applications such as prognosis of treatment but at the cost of being more complex to analyze. Not only do multiple types of trials exist but the trials of each type are repeated a varying number of times for each subject. This leads to a multi-view (multiple types of trials) and multi-instance (multiple trials of each type of each subject) setting. We propose a deep multi-model architecture to encode multi-view brain activities from t-fMRI data and a multi-layer perceptron ensemble model to combine these view models and make subject-wise predictions. We explore domain adaptation transfer learning between models to address unbalanced views and a novel way to make predictions out of multi-instance embeddings. We evaluate our model's performance on subject-wise cross-validations to accurately determine performance. The experimental results show the proposed method outperforms published methods on the AX-CPT fMRI data for the prognosis problem of predicting treatment improvement in recent-onset childhood schizophrenia. To our knowledge, this is the first data-driven study of the aforementioned task on voxel-wise t-fMRI data of the whole brain.
Unsupervised domain adaptation (UDA) has become an appealing approach for knowledge transfer from a labeled source domain to an unlabeled target domain. However, when the classes in source and target domains are imbalanced, most existing UDA methods experience significant performance drop, as the decision boundary usually favors the majority classes. Some recent class-imbalanced domain adaptation (CDA) methods aim to tackle the challenge of biased label distribution by exploiting pseudo-labeled target samples during training process. However, these methods suffer from the issues with unreliable pseudo labels and error accumulation during training. In this paper, we propose a pairwise adversarial training approach for class-imbalanced domain adaptation. Unlike conventional adversarial training in which the adversarial samples are obtained from the lp ball of the original samples, we generate adversarial samples from the interpolated line of the aligned pairwise samples from source and target domains. The pairwise adversarial training (PAT) is a novel data-augmentation method which can be integrated into existing UDA models to tackle with the CDA problem. Experimental results and ablation studies show that the UDA models integrated with our method achieve considerable improvements on benchmarks compared with the original models as well as the state-of-the-art CDA methods. Our source code is available at: https://github.com/DamoSWL/Pairwise-Adversarial-Training
The majority of trading in financial markets is executed through a limit order book (LOB). The LOB is an event-based continuously-updating system that records contemporaneous demand (`bids' to buy) and supply (`asks' to sell) for a financial asset. Following recent successes in the literature that combine stochastic point processes with neural networks to model event stream patterns, we propose a novel state-dependent parallel neural Hawkes process to predict LOB events and simulate realistic LOB data. The model is characterized by: (1) separate intensity rate modelling for each event type through a parallel structure of continuous time LSTM units; and (2) an event-state interaction mechanism that improves prediction accuracy and enables efficient sampling of the event-state stream. We first demonstrate the superiority of the proposed model over traditional stochastic or deep learning models for predicting event type and time of a real world LOB dataset. Using stochastic point sampling from a well trained model, we then develop a realistic deep learning-based LOB simulator that exhibits multiple stylized facts found in real LOB data.
Recent advances in deep learning have brought remarkable performance improvements in named entity recognition (NER), specifically in token-level classification problems. However, deep learning models often require a large amount of annotated data to achieve satisfactory performance, and NER annotation is significantly time-consuming and labor-intensive due to the fine-grained labels. To address this issue, we propose a textual data augmentation method that can automatically generate informative synthetic samples, which contribute to the development of a robust classifier. The proposed method generates additional training data by estimating the optimal level of worst-case transformation of training data while preserving the original annotation, and includes them into training to construct a robust decision boundary. Extensive experiments conducted on two benchmark datasets in a low-resource environment reveal that the proposed method outperforms two baseline augmentation methods including human annotation, which is typically considered to provide a decent amount of performance boost. To elucidate the processes, we also present in-depth analyses of the generated samples and estimated model parameters.
Graph Neural Networks (GNNs) are playing increasingly important roles in critical decision-making scenarios due to their exceptional performance and end-to-end design. However, concerns have been raised that GNNs could make biased decisions against underprivileged groups or individuals. To remedy this issue, researchers have proposed various fairness notions including individual fairness that gives similar predictions to similar individuals. However, existing methods in individual fairness rely on Lipschitz condition: they only optimize overall individual fairness and disregard equality of individual fairness between groups. This leads to drastically different levels of individual fairness among groups. We tackle this problem by proposing a novel GNN framework GUIDE to achieve group equality informed individual fairness in GNNs. We aim to not only achieve individual fairness but also equalize the levels of individual fairness among groups. Specifically, our framework operates on the similarity matrix of individuals to learn personalized attention to achieve individual fairness without group level disparity. Comprehensive experiments on real-world datasets demonstrate that GUIDE obtains good balance of group equality informed individual fairness and model utility. The open-source implementation of GUIDE can be found here: https://github.com/mikesong724/GUIDE.
Graph Neural Networks (GNNs) are state-of-the-art models for performing prediction tasks on graphs. While existing GNNs have shown great performance on various tasks related to graphs, little attention has been paid to the scenario where out-of-distribution (OOD) nodes exist in the graph during training and inference. Borrowing the concept from CV and NLP, we define OOD nodes as nodes with labels unseen from the training set. Since a lot of networks are automatically constructed by programs, real-world graphs are often noisy and may contain nodes from unknown distributions. In this work, we define the problem of graph learning with out-of-distribution nodes. Specifically, we aim to accomplish two tasks: 1) detect nodes which do not belong to the known distribution and 2) classify the remaining nodes to be one of the known classes. We demonstrate that the connection patterns in graphs are informative for outlier detection, and propose Out-of-Distribution Graph Attention Network (OODGAT), a novel GNN model which explicitly models the interaction between different kinds of nodes and separate inliers from outliers during feature propagation. Extensive experiments show that OODGAT outperforms existing outlier detection methods by a large margin, while being better or comparable in terms of in-distribution classification.
Recent years have witnessed the burgeoning of data visualization (DV) systems in both the research and the industrial communities since they provide vivid and powerful tools to convey the insights behind the massive data. A necessary step to visualize data is through creating suitable specifications in some declarative visualization languages (DVLs, e.g., Vega-Lite, ECharts). Due to the steep learning curve of mastering DVLs, automatically generating DVs via natural language questions, or text-to-vis, has been proposed and received great attention. However, existing neural network-based text-to-vis models, such as Seq2Vis or ncNet, usually generate DVs from scratch, limiting their performance due to the complex nature of this problem. Inspired by how developers reuse previously validated source code snippets from code search engines or a large-scale codebase when they conduct software development, we provide a novel hybrid retrieval-generation framework named RGVisNet for text-to-vis. It retrieves the most relevant DV query candidate as a prototype from the DV query codebase, and then revises the prototype to generate the desired DV query. Specifically, the DV query retrieval model is a neural ranking model which employs a schema-aware encoder for the NL question, and a GNN-based DV query encoder to capture the structure information of a DV query. At the same time, the DV query revision model shares the same structure and parameters of the encoders, and employs a DV grammar-aware decoder to reuse the retrieved prototype. Experimental evaluation on the public NVBench dataset validates that RGVisNet can significantly outperform existing generative text-to-vis models such as ncNet, by up to 74.28% relative improvement in terms of overall accuracy. To the best of our knowledge, RGVisNet is the first framework that seamlessly integrates the retrieval- with the generative-based approach for the text-to-vis task.
Graph Neural Networks (GNNs) have demonstrated great power for the semi-supervised node classification task. However, most GNN methods are sensitive to the noise of graph structures. Graph structure learning (GSL) is then introduced for robustification, which contains two major parts: recovering the optimal graph and fine-tuning the GNN parameters on this generated graph for the downstream task. Nonetheless, most of the existing GSL solutions merely focus on the node features during the first module for graph generation and exploit label information only by back-propagation during the second module for GNN training. They neglect the different roles that labeled and unlabeled nodes could play in GSL for the semi-supervised task, leading to a sub-optimal graph under this setting. In this paper, we give a precise definition on the optimality of the refined graph and provide the exact form of an optimal asymmetric graph structure designed explicitly for the semi-supervised node classification by distinguishing the different roles of labeled and unlabeled nodes through theoretical analysis. We propose a probabilistic model to infer the edge weights in this graph, which can be jointly trained with the subsequent node classification component. Extensive experimental results demonstrate the effectiveness of our method and the rationality of the optimal graph.
Brain extraction and registration are important preprocessing steps in neuroimaging data analysis, where the goal is to extract the brain regions from MRI scans (ie extraction step) and align them with a target brain image (ie registration step). Conventional research mainly focuses on developing methods for the extraction and registration tasks separately under supervised settings. The performance of these methods highly depends on the amount of training samples and visual inspections performed by experts for error correction. However, in many medical studies, collecting voxel-level labels and conducting manual quality control in high-dimensional neuroimages (eg 3D MRI) are very expensive and time-consuming. Moreover, brain extraction and registration are highly related tasks in neuroimaging data and should be solved collectively. In this paper, we study the problem of unsupervised collective extraction and registration in neuroimaging data. We propose a unified end-to-end framework, called ERNet (Extraction-Registration Network), to jointly optimize the extraction and registration tasks, allowing feedback between them. Specifically, we use a pair of multi-stage extraction and registration modules to learn the extraction mask and transformation, where the extraction network improves the extraction accuracy incrementally and the registration network successively warps the extracted image until it is well-aligned with the target image. Experiment results on real-world datasets show that our proposed method can effectively improve the performance on extraction and registration tasks in neuroimaging data.
Detecting beneficial feature interactions is essential in recommender systems, and existing approaches achieve this by examining all the possible feature interactions. However, the cost of examining all the possible higher-order feature interactions is prohibitive (exponentially growing with the order increasing). Hence existing approaches only detect limited order (e.g., combinations of up to four features) beneficial feature interactions, which may miss beneficial feature interactions with orders higher than the limitation. In this paper, we propose a hypergraph neural network based model named HIRS. HIRS is the first work that directly generates beneficial feature interactions of arbitrary orders and makes recommendation predictions accordingly. The number of generated feature interactions can be specified to be much smaller than the number of all the possible interactions and hence, our model admits a much lower running time. To achieve an effective algorithm, we exploit three properties of beneficial feature interactions, and propose deep-infomax-based methods to guide the interaction generation. Our experimental results show that HIRS outperforms state-of-the-art algorithms by up to 5% in terms of recommendation accuracy.
Search result diversification focuses on reducing redundancy and improving subtopic richness in the results for a given query. Most existing approaches measure document diversity mainly based on text or pre-trained representations. However, some underlying relationships between the query and documents are difficult for the model to capture only from the content. Given that the knowledge base can offer well-defined entities and explicit relationships between entities, we exploit knowledge to model the relationship between documents and the query and propose a knowledge-enhanced search result diversification approach KEDIV. Concretely, we build a query-specific relation graph to model the complicated query-document relationship from an entity view. Then a graph neural network and node weight adjust algorithm are applied to the relation graph to obtain context-aware entity representations and document representations at each selection step. The diversity features are derived from the updated node representations of the relation graph. In this way, we can take advantage of entities' abundant information to model document's diversity in search result diversification. Experimental results on commonly used datasets show that our proposed approach can outperform the state-of-the-art methods.
In graph classification, attention- and pooling-based graph neural networks (GNNs) prevail to extract the critical features from the input graph and support the prediction. They mostly follow the paradigm of learning to attend, which maximizes the mutual information between the attended graph and the ground-truth label. However, this paradigm makes GNN classifiers recklessly absorb all the statistical correlations between input features and labels in the training data, without distinguishing the causal and noncausal effects of features. Instead of underscoring the causal features, the attended graphs are prone to visit the noncausal features as the shortcut to predictions. Such shortcut features might easily change outside the training distribution, thereby making the GNN classifiers suffer from poor generalization.
In this work, we take a causal look at the GNN modeling for graph classification. With our causal assumption, the shortcut feature serves as a confounder between the causal feature and prediction. It tricks the classifier to learn spurious correlations that facilitate the prediction in in-distribution (ID) test evaluation, while causing the performance drop in out-of-distribution (OOD) test data. To endow the classifier with better interpretation and generalization, we propose the Causal Attention Learning (CAL) strategy, which discovers the causal patterns and mitigates the confounding effect of shortcuts. Specifically, we employ attention modules to estimate the causal and shortcut features of the input graph. We then parameterize the backdoor adjustment of causal theory -- combine each causal feature with various shortcut features. It encourages the stable relationships between the causal estimation and the prediction, regardless of the changes in shortcut parts and distributions. Extensive experiments on synthetic and real-world datasets demonstrate the effectiveness of CAL.
This paper studies the convergence and generalization of a large class of Stochastic Gradient Descent (SGD) momentum schemes, in both learning from scratch and transferring representations with fine-tuning. Momentum-based acceleration of SGD is the default optimizer for many deep learning models. However, there is a lack of general convergence guarantees for many existing momentum variants in conjunction withstochastic gradient. It is also unclear how the momentum methods may affect thegeneralization error. In this paper, we give a unified analysis of several popular optimizers, e.g., Polyak's heavy ball momentum and Nesterov's accelerated gradient. Our contribution is threefold. First, we give a unified convergence guarantee for a large class of momentum variants in thestochastic setting. Notably, our results cover both convex and nonconvex objectives. Second, we prove a generalization bound for neural networks trained by momentum variants. We analyze how hyperparameters affect the generalization bound and consequently propose guidelines on how to tune these hyperparameters in various momentum schemes to generalize well. We provide extensive empirical evidence to our proposed guidelines. Third, this study fills the vacancy of a formal analysis of fine-tuning in literature. To our best knowledge, our work is the first systematic generalizability analysis on momentum methods that cover both learning from scratch and fine-tuning. Our codes are available https://github.com/jsycsjh/Demystify-Hyperparameters-for-Stochastic-Optimization-with-Transferable-Representations .
Despite the promising representation learning of graph neural networks (GNNs), the supervised training of GNNs notoriously requires large amounts of labeled data from each application. An effective solution is to apply the transfer learning in graph: using easily accessible information to pre-train GNNs, and fine-tuning them to optimize the downstream task with only a few labels. Recently, many efforts have been paid to design the self-supervised pretext tasks, and encode the universal graph knowledge among the various applications. However, they rarely notice the inherent training objective gap between the pretext and downstream tasks. This significant gap often requires costly fine-tuning for adapting the pre-trained model to downstream problem, which prevents the efficient elicitation of pre-trained knowledge and then results in poor results. Even worse, the naive pre-training strategy usually deteriorates the downstream task, and damages the reliability of transfer learning in graph data. To bridge the task gap, we propose a novel transfer learning paradigm to generalize GNNs, namely graph pre-training and prompt tuning (GPPT). Specifically, we first adopt the masked edge prediction, the most simplest and popular pretext task, to pre-train GNNs. Based on the pre-trained model, we propose the graph prompting function to modify the standalone node into a token pair, and reformulate the downstream node classification looking the same as edge prediction. The token pair is consisted of candidate label class and node entity. Therefore, the pre-trained GNNs could be applied without tedious fine-tuning to evaluate the linking probability of token pair, and produce the node classification decision. The extensive experiments on eight benchmark datasets demonstrate the superiority of GPPT, delivering an average improvement of 4.29% in few-shot graph analysis and accelerating the model convergence up to 4.32X. The code is available in: https://github.com/MingChen-Sun/GPPT.
Including pairwise or higher-order interactions among predictors of a Generalized Additive Model (GAM) is gaining increasing attention in the literature. However, existing models face anidentifiability challenge. In this paper, we propose pureGAM, an inherently pure additive model of both main effects and higher-order interactions. By imposing thepureness condition to constrain each component function, pureGAM is proved to be identifiable without compromising accuracy. Furthermore, the pureness condition introduces additional interpretability in terms of simplicity. Practically, pureGAM is a unified model to support both numerical and categorical features with a novel learning procedure to achieve optimal performance. Evaluations show that pureGAM outperforms other GAMs and has very competitive performance even compared with opaque models, and its interpretability remarkably outperforms competitors in terms of pureness. We also share a successful adoption of pureGAM in one real-world application.
The variational autoencoder (VAE) is a powerful latent variable model for unsupervised representation learning. However, it does not work well in case of insufficient data points. To improve the performance in such situations, the conditional VAE (CVAE) is widely used, which aims to share task-invariant knowledge with multiple tasks through the task-invariant latent variable. In the CVAE, the posterior of the latent variable given the data point and task is regularized by the task-invariant prior, which is modeled by the standard Gaussian distribution. Although this regularization encourages independence between the latent variable and task, the latent variable remains dependent on the task. To reduce this task-dependency, the previous work introduced an additional regularizer. However, its learned representation does not work well on the target tasks. In this study, we theoretically investigate why the CVAE cannot sufficiently reduce the task-dependency and show that the simple standard Gaussian prior is one of the causes. Based on this, we propose a theoretical optimal prior for reducing the task-dependency. In addition, we theoretically show that unlike the previous work, our learned representation works well on the target tasks. Experiments on various datasets show that our approach obtains better task-invariant representations, which improves the performances of various downstream applications such as density estimation and classification.
We study a variant of classical clustering formulations in the context of algorithmic fairness, known as diversity-aware clustering. In this variant we are given a collection of facility subsets, and a solution must contain at least a specified number of facilities from each subset while simultaneously minimizing the clustering objective (k-median or k-means). We investigate the fixed-parameter tractability of these problems and show several negative hardness and inapproximability results, even when we afford exponential running time with respect to some parameters.
Motivated by these results we identify natural parameters of the problem, and present fixed-parameter approximation algorithms with approximation ratios (1 + 2 over e + ∈) and (1 + 8 over e + ∈) for diversity-aware k-median and diversity-aware k-means respectively, and argue that these ratios are essentially tight assuming the gap-exponential time hypothesis. We also present a simple and more practical bicriteria approximation algorithm with better running time bounds. We finally propose efficient and practical heuristics. We evaluate the scalability and effectiveness of our methods in a wide variety of rigorously conducted experiments, on both real and synthetic data.
Cognitive diagnosis, aiming at providing an approach to reveal the proficiency level of learners on knowledge concepts, plays an important role in intelligent education area and has recently received more and more attention. Although a number of works have been proposed in recent years, most of contemporary works acquire the traits parameters of learners and items in a transductive way, which are only suitable for stationary data. However, in the real scenario, the data is collected online, where learners, test items and interactions usually grow continuously, which can rarely meet the stationary condition. To this end, we propose a novel framework, Incremental Cognitive Diagnosis (ICD), to tailor cognitive diagnosis into the online scenario of intelligent education. Specifically, we first design a Deep Trait Network (DTN), which acquires the trait parameters in an inductive way rather than a transductive way. Then, we propose an Incremental Update Algorithm (IUA) to balance the effectiveness and training efficiency. We carry out Turning Point (TP) analysis to reduce update frequency, where we derive the minimum update condition based on the monotonicity theory of cognitive diagnosis. Meanwhile, we use a momentum update strategy on the incremental data to decrease update time without sacrificing effectiveness. Moreover, to keep the trait parameters as stable as possible, we refine the loss function in the incremental updating stage. Last but no least, our ICD is a general framework which can be applied to most of contemporary cognitive diagnosis models. To the best of our knowledge, this is the first attempt to investigate the incremental cognitive diagnosis problem with theoretical results about the update condition and a tailored incremental learning strategy. Extensive experiments demonstrate the effectiveness and robustness of our method.
Estimating how a treatment affects units individually, known as heterogeneous treatment effect (HTE) estimation, is an essential part of decision-making and policy implementation. The accumulation of large amounts of data in many domains, such as healthcare and e-commerce, has led to increased interest in developing data-driven algorithms for estimating heterogeneous effects from observational and experimental data. However, these methods often make strong assumptions about the observed features and ignore the underlying causal model structure, which can lead to biased HTE estimation. At the same time, accounting for the causal structure of real-world data is rarely trivial since the causal mechanisms that gave rise to the data are typically unknown. To address this problem, we develop a feature selection method that considers each feature's value for HTE estimation and learns the relevant parts of the causal structure from data. We provide strong empirical evidence that our method improves existing data-driven HTE estimation methods under arbitrary underlying causal structures. Our results on synthetic, semi-synthetic, and real-world datasets show that our feature selection algorithm leads to lower HTE estimation error.
Classical recommendation methods typically render user representation as a single vector in latent space. Oftentimes, a user's interactions with items are influenced by several hidden factors. To better uncover these hidden factors, we seek disentangled representations. Existing disentanglement methods for recommendations are mainly concerned with user-item interactions alone. To further improve not only the effectiveness of recommendations but also the interpretability of the representations, we propose to learn a second set of disentangled user representations from textual content and to align the two sets of representations with one another. The purpose of this coupling is two-fold. For one benefit, we leverage textual content to resolve sparsity of user-item interactions, leading to higher recommendation accuracy. For another benefit, by regularizing factors learned from user-item interactions with factors learned from textual content, we map uninterpretable dimensions from user representation into words. An attention-based alignment is introduced to align and enrich hidden factors representations. A series of experiments conducted on four real-world datasets show the efficacy of our methods in improving recommendation quality.
Atmospheric winds are a key physical phenomenon impacting natural hazards, energy transport, ocean currents, large-scale circulation, and ecosystem fluxes. Observing winds is a complex process and presents a large gap in NASA's Earth Observation System. Atmospheric motion vectors (AMVs) aim to fill this gap by making numerical estimates of cloud movement between sequences of multi-spectral satellite images, tracking clouds and water vapor. Recent imaging hardware and software advancements have enabled the use of numerical optical flow techniques to produce accurate and dense vector fields outperforming traditional methods. This work presents WindFlow as the first machine learning based system for feature tracking atmospheric motion using optical flow. Due to the lack of large-scale satellite-based observations, we leverage high-resolution numerical simulations from NASA's GEOS-5 Nature Run to perform supervised learning and transfer to satellite images. We demonstrate that our approach using deep learning based optical flow scales to ultra-high-resolution images of size 2881x5760 with less than 1 m/s bias and 2.5 m/s average error. Four network and learning architectures are compared and it is found that recurrent all-pairs field transforms (RAFT) produces the lowest errors on all metrics for wind speed and direction. Results on held out numerical outputs show RAFT's good performance in each of the spatial, temporal, and physical dimensions. A comparison between WindFlow and an operational AMV product against rawinsonde observations shows that RAFT transfers across simulations and thermal infrared satellite observations. This work shows that machine learning based optical flow is an efficient approach to generating robust feature tracking for AMVs consistently over large regions.
Collaborative filtering (CF) plays a critical role in the development of recommender systems. Most CF methods utilize an encoder to embed users and items into the same representation space, and the Bayesian personalized ranking (BPR) loss is usually adopted as the objective function to learn informative encoders. Existing studies mainly focus on designing more powerful encoders (e.g., graph neural network) to learn better representations. However, few efforts have been devoted to investigating the desired properties of representations in CF, which is important to understand the rationale of existing CF methods and design new learning objectives. In this paper, we measure the representation quality in CF from the perspective of alignment and uniformity on the hypersphere. We first theoretically reveal the connection between the BPR loss and these two properties. Then, we empirically analyze the learning dynamics of typical CF methods in terms of quantified alignment and uniformity, which shows that better alignment or uniformity both contribute to higher recommendation performance. Based on the analyses results, a learning objective that directly optimizes these two properties is proposed, named DirectAU. We conduct extensive experiments on three public datasets, and the proposed learning framework with a simple matrix factorization model leads to significant performance improvements compared to state-of-the-art CF methods.
Representation (feature) space is an environment where data points are vectorized, distances are computed, patterns are characterized, and geometric structures are embedded. Extracting a good representation space is critical to address the curse of dimensionality, improve model generalization, overcome data sparsity, and increase the availability of classic models. Existing literature, such as feature engineering and representation learning, is limited in achieving full automation (e.g., over heavy reliance on intensive labor and empirical experiences), explainable explicitness (e.g., traceable reconstruction process and explainable new features), and flexible optimal (e.g., optimal feature space reconstruction is not embedded into downstream tasks). Can we simultaneously address the automation, explicitness, and optimal challenges in representation space reconstruction for a machine learning task? To answer this question, we propose a group-wise reinforcement generation perspective. We reformulate representation space reconstruction into an interactive process of nested feature generation and selection, where feature generation is to generate new meaningful and explicit features, and feature selection is to eliminate redundant features to control feature sizes. We develop a cascading reinforcement learning method that leverages three cascading Markov Decision Processes to learn optimal generation policies to automate the selection of features and operations and the feature crossing. We design a group-wise generation strategy to cross a feature group, an operation, and another feature group to generate new features and find the strategy that can enhance exploration efficiency and augment reward signals of cascading agents. Finally, we present extensive experiments to demonstrate the effectiveness, efficiency, traceability, and explicitness of our system.
Topic mining extracts patterns and insights from text data (e.g., documents, emails and product reviews), which can be used in various applications such as intent detection. However, topic mining can result in severe privacy threats to the users who have contributed to the text corpus since they can be re-identified from the text data with certain background knowledge. To our best knowledge, we propose the first differentially private topic mining technique (namely TopicDP) which injects well-calibrated Gaussian noise into the matrix output of any topic mining algorithm to ensure differential privacy and good utility. Specifically, we smoothen the sensitivity for the Gaussian mechanism via sensitivity sampling, which addresses the major challenges resulted from the high sensitivity in topic mining for differential privacy. Furthermore, we theoretically prove the differential privacy guarantee under the Rényi differential privacy mechanism and the utility error bounds of TopicDP. Finally, we conduct extensive experiments on two real-word text datasets (Enron email and Amazon Reviews), and the experimental results demonstrate that TopicDP is a model-agnostic framework that can generate better privacy preserving performance for topic mining as compared against other differential privacy mechanisms.
Data augmentation has been proven to be an effective technique for developing machine learning models that are robust to known classes of distributional shifts (e.g., rotations of images), and alignment regularization is a technique often used together with data augmentation to further help the model learn representations invariant to the shifts used to augment the data. In this paper, motivated by a proliferation of options of alignment regularizations, we seek to evaluate the performances of several popular design choices along the dimensions of robustness and invariance, for which we introduce a new test procedure. Our synthetic experiment results speak to the benefits of squared ℓ2 norm regularization. Further, we also formally analyze the behavior of alignment regularization to complement our empirical study under assumptions we consider realistic. Finally, we test this simple technique we identify (worst-case data augmentation with squared ℓ2 norm alignment regularization) and show that the benefits of this method outrun those of the specially designed methods. We also release a software package in both TensorFlow and PyTorch for users to use the method with a couple of lines at https://github.com/jyanln/AlignReg.
Learning individualized causal effect (ICE) plays a vital role in various fields of big data analysis, ranging from fine-grained policy evaluation to personalized treatment development. However, the presence of unmeasured confounders increases the difficulty of estimating ICE in real-world scenarios. A wide range of methods have been proposed to address the unmeasured confounders with the aid of instrument variable (IV), which sources from the treatment randomization. The performance of these methods relies on the well-predefined IVs that satisfy the unconfounded instruments assumption (i.e., the IVs are independent with the unmeasured confounders given observed covariates), which is untestable and leads to finding a valid IV becomes an art rather than science. In this paper, we focus on estimating the ICE with confounded instruments that violate the unconfounded instruments assumption. By considering the conditional independence between the set of confounded instruments and the outcome variable, we propose a novel method, named CVAE-IV, to generate a substitute of the unmeasured confounder with a conditional variational autoencoder. Our theoretical analysis guarantees that the generated confounder substitute will identify unbiased ICE. Extensive experiments on bias demand prediction and Mendelian randomization analysis verify the effectiveness of our method.
The item fairness issue has become one of the significant concerns with the development of recommender systems in recent years, focusing on whether items' exposures are consistent with their utilities. So the measurement of item unfairness depends on the modeling of item utility, and most previous approaches estimated item utility simply based on user-item interaction logs in recommender systems. The Click-through rate (CTR) is the most popular one. However, we argue that these types of item utilities (named observed utility here) measurements may result in unfair exposures of items. The number of exposure for each item is uneven, and recommendation methods select the exposure audiences (users).
In this work, we propose the concept of items' fair utility, defined as the proportion of users who are interested in the item among all users. Firstly, we conduct a large-scale random exposure experiment to collect the fair utility in a real-world recommender application. Significant differences are observed between the fair utility and the widely used observed utility (CTR). Then, intending to obtain fair utility at a low cost, we propose an exploratory task for real-time estimations of fair utility with handy historical interaction logs. Encouraging results are achieved, validating the feasibility of fair utility projections. Furthermore, we present a fairness-aware re-distribution framework and conduct abundant simulation experiments, adopting fair utility to improve fairness and overall recommendation performance at the same time. Online and offline results show that both item fairness and recommendation quality can be improved simultaneously by introducing item fair utility.
Training Graph Neural Networks (GNNs) incrementally is a particularly urgent problem, because real-world graph data usually arrives in a streaming fashion, and inefficiently updating of the models results in out-of-date embeddings, thus degrade its performance in downstream tasks. Traditional incremental learning methods will gradually forget old knowledge when learning new patterns, which is the catastrophic forgetting problem. Although saving and revisiting historical graph data alleviates the problem, the storage limitation in real-world applications reduces the amount of saved data, causing GNN to forget other knowledge. In this paper, we propose a streaming GNN based on generative replay, which can incrementally learn new patterns while maintaining existing knowledge without accessing historical data. Specifically, our model consists of the main model (GNN) and an auxiliary generative model. The generative model based on random walks with restart can learn and generate fake historical samples (i.e., nodes and their neighborhoods), which can be trained with real data to avoid the forgetting problem. Besides, we also design an incremental update algorithm for the generative model to maintain the graph distribution and for GNN to capture the current patterns. Our model is evaluated on different streaming data sets. The node classification results prove that our model can update the model efficiently and achieve comparable performance to model retraining. Code is available at https://github.com/Junshan-Wang/SGNN-GR.
The importance of building text-to-SQL parsers which can be applied to new databases has long been acknowledged, and a critical step to achieve this goal is schema linking, i.e., properly recognizing mentions of unseen columns or tables when generating SQLs. In this work, we propose a novel framework to elicit relational structures from large-scale pre-trained language models (PLMs) via a probing procedure based on Poincaré distance metric, and use the induced relations to augment current graph-based parsers for better schema linking. Compared with commonly-used rule-based methods for schema linking, we found that probing relations can robustly capture semantic correspondences, even when surface forms of mentions and entities differ. Moreover, our probing procedure is entirely unsupervised and requires no additional parameters. Extensive experiments show that our framework sets new state-of-the-art performance on three benchmarks. We empirically verify that our probing procedure can indeed find desired relational structures through qualitative analysis.
The increased integration of renewable energy poses a slew of technical challenges for the operation of power distribution networks. Among them, voltage fluctuations caused by the instability of renewable energy are receiving increasing attention. Utilizing MARL algorithms to coordinate multiple control units in the grid, which is able to handle rapid changes of power systems, has been widely studied in active voltage control task recently. However, existing approaches based on MARL ignore the unique nature of the grid and achieve limited performance. In this paper, we introduce the transformer architecture to extract representations adapting to power network problems and propose a Transformer-based Multi-Agent Actor-Critic framework (T-MAAC) to stabilize voltage in power distribution networks. In addition, we adopt a novel auxiliary-task training process tailored to the voltage control task, which improves the sample efficiency and facilitating the representation learning of the transformer-based model. We couple T-MAAC with different multi-agent actor-critic algorithms, and the consistent improvements on the active voltage control task demonstrate the effectiveness of the proposed method.
Node classification is of great importance among various graph mining tasks. In practice, real-world graphs generally follow the long-tail distribution, where a large number of classes only consist of limited labeled nodes. Although Graph Neural Networks (GNNs) have achieved significant improvements in node classification, their performance decreases substantially in such a few-shot scenario. The main reason can be attributed to the vast generalization gap between meta-training and meta-test due to the task variance caused by different node/class distributions in meta-tasks (i.e., node-level and class-level variance). Therefore, to effectively alleviate the impact of task variance, we propose a task-adaptive node classification framework under the few-shot learning setting. Specifically, we first accumulate meta-knowledge across classes with abundant labeled nodes. Then we transfer such knowledge to the classes with limited labeled nodes via our proposed task-adaptive modules. In particular, to accommodate the different node/class distributions among meta-tasks, we propose three essential modules to perform node-level, class-level, and task-level adaptations in each meta-task, respectively. In this way, our framework can conduct adaptations to different meta-tasks and thus advance the model generalization performance on meta-test tasks. Extensive experiments on four prevalent node classification datasets demonstrate the superiority of our framework over the state-of-the-art baselines. Our code is provided at https://github.com/SongW-SW/TENT https://github.com/SongW-SW/TENT.
Partial label learning is a weakly supervised learning framework where each training example is associated with multiple candidate labels, among which only one is valid. Existing works on partial label learning mainly focus on classification model induction by disambiguating candidate label sets in the output space. Nevertheless, the feature representations of partial label training examples may be less informative of the ground-truth labels, which may result in negative influences on the disambiguation process. To circumvent this difficulty, the first attempt towards discrimination augmentation for partial label learning is investigated in this paper. The feature space is enriched with confidence-rated class prototype features to replenish discriminative characteristics of the underlying ground-truth labels for partial label training examples. Specially, an optimization formulation is proposed to jointly optimize the class prototype and estimate the labeling confidence over partial label training examples, which enforces both global consistency in the feature space and local consistency in the label space. We show that the class prototypes and the labeling confidence can be solved via alternating optimization. Extensive experiments on synthetic as well as real-world data sets validate the effectiveness of the proposed approach for improving the generalization performance of state-of-the-art partial label learning algorithms.
Conversational recommender systems (CRS) aim to proactively elicit user preference and recommend high-quality items through natural language conversations. Typically, a CRS consists of a recommendation module to predict preferred items for users and a conversation module to generate appropriate responses. To develop an effective CRS, it is essential to seamlessly integrate the two modules. Existing works either design semantic alignment strategies, or share knowledge resources and representations between the two modules. However, these approaches still rely on different architectures or techniques to develop the two modules, making it difficult for effective module integration. To address this problem, we propose a unified CRS model named UniCRS based on knowledge-enhanced prompt learning. Our approach unifies the recommendation and conversation subtasks into the prompt learning paradigm, and utilizes knowledge-enhanced prompts based on a fixed pre-trained language model (PLM) to fulfill both subtasks in a unified approach. In the prompt design, we include fused knowledge representations, task-specific soft tokens, and the dialogue context, which can provide sufficient contextual information to adapt the PLM for the CRS task. Besides, for the recommendation subtask, we also incorporate the generated response template as an important part of the prompt, to enhance the information interaction between the two subtasks. Extensive experiments on two public CRS datasets have demonstrated the effectiveness of our approach. Our code is publicly available at the link: https://github.com/RUCAIBox/UniCRS.
Graph Neural Networks (GNNs) have shown great power in learning node representations on graphs. However, they may inherit historical prejudices from training data, leading to discriminatory bias in predictions. Although some work has developed fair GNNs, most of them directly borrow fair representation learning techniques from non-graph domains without considering the potential problem of sensitive attribute leakage caused by feature propagation in GNNs. However, we empirically observe that feature propagation could vary the correlation of previously innocuous non-sensitive features to the sensitive ones. This can be viewed as a leakage of sensitive information which could further exacerbate discrimination in predictions. Thus, we design two feature masking strategies according to feature correlations to highlight the importance of considering feature propagation and correlation variation in alleviating discrimination. Motivated by our analysis, we propose Fair View Graph Neural Network (FairVGNN) to generate fair views of features by automatically identifying and masking sensitive-correlated features considering correlation variation after feature propagation. Given the learned fair views, we adaptively clamp weights of the encoder to avoid using sensitive-related features. Experiments on real-world datasets demonstrate that FairVGNN enjoys a better trade-off between model utility and fairness.
Recently, Neural Architecture Search (NAS) for GNN has received increasing popularity as it can seek an optimal architecture for a given new graph. However, the optimal architecture is applied to all the instances (i.e., nodes, in the context of graph) equally, which might be insufficient to handle the diverse local patterns ingrained in a graph, as shown in this paper and some very recent studies. Thus, we argue the necessity of node-wise architecture search for GNN. Nevertheless, node-wise architecture cannot be realized by trivially applying NAS methods node by node due to the scalability issue and the need for determining test nodes' architectures. To tackle these challenges, we propose a framework wherein the parametric controllers decide the GNN architecture for each node based on its local patterns. We instantiate our framework with depth, aggregator and resolution controllers, and then elaborate on learning the backbone GNN model and the controllers to encourage their cooperation. Empirically, we justify the effects of node-wise architecture through the performance improvements introduced by the three controllers, respectively. Moreover, our proposed framework significantly outperforms state-of-the-art methods on five of the ten real-world datasets, where the diversity of these datasets has hindered any graph convolution-based method to lead on them simultaneously. This result further confirms that node-wise architecture can help GNNs become versatile models.
Learned recommender systems may inadvertently leak information about their training data, leading to privacy violations. We investigate privacy threats faced by recommender systems through the lens of membership inference. In such attacks, an adversary aims to infer whether a user's data is used to train the target recommender. To achieve this, previous work has used a shadow recommender to derive training data for the attack model, and then predicts the membership by calculating difference vectors between users' historical interactions and recommended items. State-of-the-art methods face two challenging problems: (i) training data for the attack model is biased due to the gap between shadow and target recommenders, and (ii) hidden states in recommenders are not observational, resulting in inaccurate estimations of difference vectors.
To address the above limitations, we propose a Debiasing Learning for Membership Inference Attacks against recommender systems (DL-MIA) framework that has four main components: (i) a difference vector generator, (ii) a disentangled encoder, (iii) a weight estimator, and (iv) an attack model. To mitigate the gap between recommenders, a variational auto-encoder (VAE) based disentangled encoder is devised to identify recommender invariant and specific features. To reduce the estimation bias, we design a weight estimator, assigning a truth-level score for each difference vector to indicate estimation accuracy. We evaluate DL-MIA against both general recommenders and sequential recommenders on three real-world datasets. Experimental results show that DL-MIA effectively alleviates training and estimation biases simultaneously, and Íachieves state-of-the-art attack performance.
Current recommender systems have achieved great successes in online services, such as E-commerce and social media. However, they still suffer from the performance degradation in real scenarios, because various biases always occur in the generation process of user behaviors. Despite the recent development of addressing some specific type of bias, a variety of data bias, some of which are even unknown, are often mixed up in real applications. Although the uniform (or unbiased) data may help for the purpose of general debiasing, such data can either be hardly available or induce high experimental cost. In this paper, we consider a more practical setting where we aim to conduct general debiasing with the biased observational data alone. We assume that the observational user behaviors are determined by invariant preference (i.e. a user's true preference) and the variant preference (affected by some unobserved confounders). We propose a novel recommendation framework called InvPref which iteratively decomposes the invariant preference and variant preference from biased observational user behaviors by estimating heterogeneous environments corresponding to different types of latent bias. Extensive experiments, including the settings of general debiasing and specific debiasing, verify the advantages of our method.
Reducing sensor requirements while keeping optimal control performance is crucial to many industrial control applications to achieve robust, low-cost, and computation-efficient controllers. However, existing feature selection solutions for the typical machine learning domain can hardly be applied in the domain of control with changing dynamics. In this paper, a novel framework, namely the Dual-world embedded Attentive Feature Selection (D-AFS), can efficiently select the most relevant sensors for the system under dynamic control. Rather than the one world used in most Deep Reinforcement Learning (DRL) algorithms, D-AFS has both the real world and its virtual peer with twisted features. By analyzing the DRL's response in two worlds, D-AFS can quantitatively identify respective features' importance towards control. A well-known active flow control problem, cylinder drag reduction, is used for evaluation. Results show that D-AFS successfully finds an optimized five-probes layout with 18.7% drag reduction than the state-of-the-art solution with 151 probes and 49.2% reduction than five-probes layout by human experts. We also apply this solution to four OpenAI classical control cases. In all cases, D-AFS achieves the same or better sensor configurations than originally provided solutions. Results highlight, we argued, a new way to achieve efficient and optimal sensor designs for experimental or industrial systems. Our source codes are made publicly available at https://github.com/G-AILab/DAFSFluid.
In recommender systems, one common challenge is the cold-start problem, where interactions are very limited for fresh users in the systems. To address this challenge, recently, many works introduce the meta-optimization idea into the recommendation scenarios, i.e. learning to learn the user preference by only a few past interaction items. The core idea is to learn global shared meta-initialization parameters for all users and rapidly adapt them into local parameters for each user respectively. They aim at deriving general knowledge across preference learning of various users, so as to rapidly adapt to the future new user with the learned prior and a small amount of training data. However, previous works have shown that recommender systems are generally vulnerable to bias and unfairness. Despite the success of meta-learning at improving the recommendation performance with cold-start, the fairness issues are largely overlooked.
In this paper, we propose a comprehensive fair meta-learning framework, named CLOVER, for ensuring the fairness of meta-learned recommendation models. We systematically study three kinds of fairness - individual fairness, counterfactual fairness, and group fairness in the recommender systems, and propose to satisfy all three kinds via a multi-task adversarial learning scheme. Our framework offers a generic training paradigm that is applicable to different meta-learned recommender systems. We demonstrate the effectiveness of CLOVER on the representative meta-learned user preference estimator on three real-world data sets. Empirical results show that CLOVER achieves comprehensive fairness without deteriorating the overall cold-start recommendation performance.
Relation extraction (RE) is an important task for many natural language processing applications. Document-level relation extraction task aims to extract the relations within a document and poses many challenges to the RE tasks as it requires reasoning across sentences and handling multiple relations expressed in the same document. Existing state-of-the-art document-level RE models use the graph structure to better connect long-distance correlations. In this work, we propose SagDRE model, which further considers and captures the original sequential information from the text. The proposed model learns sentence-level directional edges to capture the information flow in the document and uses the token-level sequential information to encode the shortest paths from one entity to the other. In addition, we propose an adaptive margin loss to address the long-tailed multi-label problem of document-level RE tasks, where multiple relations can be expressed in a document for an entity pair and there are a few popular relations. The loss function aims to encourage separations between positive and negative classes. The experimental results on datasets from various domains demonstrate the effectiveness of the proposed methods.
Opioids (e.g., oxycodone and morphine) are highly addictive prescription (aka Rx) drugs which can be easily overprescribed and lead to opioid overdose. Recently, the opioid epidemic is increasingly serious across the US as its related deaths have risen at alarming rates. To combat the deadly opioid epidemic, a state-run prescription drug monitoring program (PDMP) has been established to alleviate the drug over-prescribing problem in the US. Although PDMP provides a detailed prescription history related to opioids, it is still not enough to prevent opioid overdose because it cannot predict over-prescribing risk. In addition, existing machine learning-based methods mainly focus on drug doses while ignoring other prescribing patterns behind patients' historical records, thus resulting in suboptimal performance. To this end, we propose a novel model DDHGNN - Disentangled Dynamic Heterogeneous Graph Neural Network, for over-prescribing prediction. Specifically, we abstract the PDMP data into a dynamic heterogeneous graph which comprehensively depicts the prescribing and dispensing (P&D) relationships. Then, we design a dynamic heterogeneous graph neural network to learn patients' representations. Furthermore, we devise an adversarial disentangler to learn a disentangled representation which is particularly related to the prescribing patterns. Extensive experiments on a 1-year anonymous PDMP data demonstrate that DDHGNN outperforms state-of-the-art methods, revealing its promising future in preventing opioid overdose.
Zero-inflated, heavy-tailed spatiotemporal data is common across science and engineering, from climate science to meteorology and seismology. A central modeling objective in such settings is to forecast the intensity, frequency, and timing of extreme and non-extreme events; yet in the context of deep learning, this objective presents several key challenges. First, a deep learning framework applied to such data must unify a mixture of distributions characterizing the zero events, moderate events, and extreme events. Second, the framework must be capable of enforcing parameter constraints across each component of the mixture distribution. Finally, the framework must be flexible enough to accommodate for any changes in the threshold used to define an extreme event after training. To address these challenges, we propose Deep Extreme Mixture Model (DEMM), fusing a deep learning-based hurdle model with extreme value theory to enable point and distribution prediction of zero-inflated, heavy-tailed spatiotemporal variables. The framework enables users to dynamically set a threshold for defining extreme events at inference-time without the need for retraining. We present an extensive experimental analysis applying DEMM to precipitation forecasting, and observe significant improvements in point and distribution prediction. All code is available at https://github.com/andrewmcdonald27/DeepExtremeMixtureModel.
Science and engineering fields use computer simulation extensively. These simulations are often run at multiple levels of sophistication to balance accuracy and efficiency. Multi-fidelity surrogate modeling reduces the computational cost by fusing different simulation outputs. Cheap data generated from low-fidelity simulators can be combined with limited high-quality data generated by an expensive high-fidelity simulator. Existing methods based on Gaussian processes rely on strong assumptions of the kernel functions and can hardly scale to high-dimensional settings. We propose Multi-fidelity Hierarchical Neural Processes (MF-HNP), a unified neural latent variable model for multi-fidelity surrogate modeling. MF-HNP inherits the flexibility and scalability of Neural Processes. The latent variables transform the correlations among different fidelity levels from observations to latent space. The predictions across fidelities are conditionally independent given the latent states. It helps alleviate the error propagation issue in existing methods. MF-HNP is flexible enough to handle non-nested high dimensional data at different fidelity levels with varying input and output dimensions. We evaluate MF-HNP on epidemiology and climate modeling tasks, achieving competitive performance in terms of accuracy and uncertainty estimation. In contrast to deep Gaussian Processes with only low-dimensional (< 10) tasks, our method shows great promise for speeding up high-dimensional complex simulations (over 7000 for epidemiology modeling and 45000 for climate modeling).
Open-set domain adaptation aims to improve the generalization performance of a learning algorithm on a target task of interest by leveraging the label information from a relevant source task with only a subset of classes. However, most existing works are designed for the static setting, and can be hardly extended to the dynamic setting commonly seen in many real-world applications. In this paper, we focus on the more realistic open-set domain adaptation setting with a static source task and a time evolving target task where novel unknown target classes appear over time. Specifically, we show that the classification error of the new target task can be tightly bounded in terms of positive-unlabeled classification errors for historical tasks and open-set domain discrepancy across tasks. By empirically minimizing the upper bound of the target error, we propose a novel positive-unlabeled learning based algorithm named OuterAdapter for dynamic open-set domain adaptation with time evolving unknown classes. Extensive experiments on various data sets demonstrate the effectiveness and efficiency of our proposed OuterAdapter algorithm over state-of-the-art domain adaptation baselines.
Exploration-Exploitation (E& E) algorithms are commonly adopted to deal with the feedback-loop issue in large-scale online recommender systems. Most of existing studies believe that high uncertainty can be a good indicator of potential reward, and thus primarily focus on the estimation of model uncertainty. We argue that such an approach overlooks the subsequent effect of exploration on model training. From the perspective of online learning, the adoption of an exploration strategy would also affect the collecting of training data, which further influences model learning. To understand the interaction between exploration and training, we design a Pseudo-Exploration module that simulates the model updating process after a certain item is explored and the corresponding feedback is received. We further show that such a process is equivalent to adding an adversarial perturbation to the model input, and thereby name our proposed approach as an the Adversarial Gradient Driven Exploration (AGE). For production deployment, we propose a dynamic gating unit to pre-determine the utility of an exploration. This enables us to utilize the limited amount of resources for exploration, and avoid wasting pageview resources on ineffective exploration. The effectiveness of AGE was firstly examined through an extensive number of ablation studies on an academic dataset. Meanwhile, AGE has also been deployed to one of the world-leading display advertising platforms, and we observe significant improvements on various top-line evaluation metrics.
Community detection refers to the task of discovering closely related subgraphs to understand the networks. However, traditional community detection algorithms fail to pinpoint a particular kind of community. This limits its applicability in real-world networks, e.g., distinguishing fraud groups from normal ones in transaction networks. Recently, semi-supervised community detection emerges as a solution. It aims to seek other similar communities in the network with few labeled communities as training data. Existing works can be regarded as seed-based: locate seed nodes and then develop communities around seeds. However, these methods are quite sensitive to the quality of selected seeds since communities generated around a mis-detected seed may be irrelevant. Besides, they have individual issues, e.g., inflexibility and high computational overhead. To address these issues, we propose CLARE, which consists of two key components, Community Locator and Community Rewriter. Our idea is that we can locate potential communities and then refine them. Therefore, the community locator is proposed for quickly locating potential communities by seeking subgraphs that are similar to training ones in the network. To further adjust these located communities, we devise the community rewriter. Enhanced by deep reinforcement learning, it suggests intelligent decisions, such as adding or dropping nodes, to refine community structures flexibly. Extensive experiments verify both the effectiveness and efficiency of our work compared with prior state-of-the-art approaches on multiple real-world datasets.
Recently discovered polyhedral structures of the value function for finite discounted Markov decision processes (MDP) shed light on understanding the success of reinforcement learning. We investigate the value function polytope in greater detail and characterize the polytope boundary using a hyperplane arrangement. We further show that the value space is a union of finitely many cells of the same hyperplane arrangement, and relate it to the polytope of the classical linear programming formulation for MDPs. Inspired by these geometric properties, we propose a new algorithm, Geometric Policy Iteration (GPI), to solve discounted MDPs. GPI updates the policy of a single state by switching to an action that is mapped to the boundary of the value function polytope, followed by an immediate update of the value function. This new update rule aims at a faster value improvement without compromising computational efficiency. Moreover, our algorithm allows asynchronous updates of state values which is more flexible and advantageous compared to traditional policy iteration when the state set is large. We prove that the complexity of GPI achieves the best known bound O|𝓐|over 1 - γ log 1 over 1-γ of policy iteration and empirically demonstrate the strength of GPI on MDPs of various sizes.
A/B tests, also known as online controlled experiments, have been used at scale by data-driven enterprises to guide decisions and test innovative ideas. Meanwhile, nonstationarity, such as the time-of-day effect, can commonly arise in various business metrics. We show that inadequately addressing nonstationarity can cause A/B tests to be statistically inefficient or invalid, leading to wrong conclusions. To address these issues, we develop a new framework that provides appropriate modeling and adequate statistical analysis for nonstationary A/B tests. Without changing the infrastructure for any existing A/B test procedure, we propose a new estimator that views time as a continuous covariate to perform post stratification with a sample-dependent number of stratification levels. We prove central limit theorem in a natural limiting regime under nonstationarity, so that valid large-sample statistical inference is available. We show that the proposed estimator achieves the optimal asymptotic variance among all estimators. When the experiment design phase of an A/B test allows, we propose a new time-grouped randomization approach to make a better balance on treatment and control assignments in presence of time nonstationarity. A brief account of numerical experiments are conducted to illustrate the theoretical analysis.
Graph Neural Networks (GNNs) have exhibited their powerful ability of tackling nontrivial problems on graphs. However, as an extension of deep learning models to graphs, GNNs are vulnerable to noise or adversarial attacks due to the underlying perturbations propagating in message passing scheme, which can affect the ultimate performances dramatically. Thus, it's vital to study a robust GNN framework to defend against various perturbations. In this paper, we propose a Robust Tensor Graph Convolutional Network (RT-GCN) model to improve the robustness. On the one hand, we utilize multi-view augmentation to reduce the augmentation variance and organize them as a third-order tensor, followed by the truncated T-SVD to capture the low-rankness of the multi-view augmented graph, which improves the robustness from the perspective of graph preprocessing. On the other hand, to effectively capture the inter-view and intra-view information on the multi-view augmented graph, we propose tensor GCN (TGCN) framework and analyze the mathematical relationship between TGCN and vanilla GCN, which improves the robustness from the perspective of model architecture. Extensive experimental results have verified the effectiveness of RT-GCN on various datasets, demonstrating the superiority to the state-of-the-art models on diverse adversarial attacks for graphs.
Graph Neural Networks (GNNs) have been shown as promising solutions for collaborative filtering (CF) with the modeling of user-item interaction graphs. The key idea of existing GNN-based recommender systems is to recursively perform the message passing along the user-item interaction edge for refining the encoded embeddings. Despite their effectiveness, however, most of the current recommendation models rely on sufficient and high-quality training data, such that the learned representations can well capture accurate user preference. User behavior data in many practical recommendation scenarios is often noisy and exhibits skewed distribution, which may result in suboptimal representation performance in GNN-based models. In this paper, we propose SHT, a novel Self-Supervised Hypergraph Transformer framework (SHT) which augments user representations by exploring the global collaborative relationships in an explicit way. Specifically, we first empower the graph neural CF paradigm to maintain global collaborative effects among users and items with a hypergraph transformer network. With the distilled global context, a cross-view generative self-supervised learning component is proposed for data augmentation over the user-item interaction graph, so as to enhance the robustness of recommender systems. Extensive experiments demonstrate that SHT can significantly improve the performance over various state-of-the-art baselines. Further ablation studies show the superior representation ability of our SHT recommendation framework in alleviating the data sparsity and noise issues. The source code and evaluation datasets are available at: https://github.com/akaxlh/SHT.
Estimating the kernel mean in a reproducing kernel Hilbert space is central to many kernel-based learning algorithms. Given a finite sample, an empirical average is used as a standard estimation of the target kernel mean. Prior works have shown that better estimators can be constructed by shrinkage methods. In this work, we propose to corrupt data examples with noise from known distributions and present a new kernel mean estimator, called the marginalized kernel mean estimator, which estimates kernel mean under the corrupted distributions. Theoretically, we justify that the marginalized kernel mean estimator introduces implicit regularization in kernel mean estimation. Empirically, on a variety of tasks, we show that the marginalized kernel mean estimator is sample-efficient and obtains much lower estimation errors than the existing estimators.
Retrosynthetic planning, which aims to find a reaction pathway to synthesize a target molecule, plays an important role in chemistry and drug discovery. This task is usually modeled as a search problem. Recently, data-driven methods have attracted many research interests and shown promising results for retrosynthetic planning. We observe that the same intermediate molecules are visited many times in the searching process, and they are usually independently treated in previous tree-based methods (e.g., AND-OR tree search, Monte Carlo tree search). Such redundancies make the search process inefficient. We propose a graph-based search policy that eliminates the redundant explorations of any intermediate molecules. As searching over a graph is more complicated than over a tree, we further adopt a graph neural network to guide the search over graphs. Meanwhile, our method can search a batch of targets together in the graph and remove the inter-target duplication in the tree-based search methods. Experimental results on two datasets demonstrate the effectiveness of our method. Especially on the widely used USPTO benchmark, we improve the search success rate to 99.47%, advancing previous state-of-the-art performance for 2.6 points.
Recent knowledge graph (KG) embeddings have been advanced by hyperbolic geometry due to its superior capability for representing hierarchies. The topological structures of real-world KGs, however, are rather heterogeneous, i.e., a KG is composed of multiple distinct hierarchies and non-hierarchical graph structures. Therefore, a homogeneous (either Euclidean or hyperbolic) geometry is not sufficient for fairly representing such heterogeneous structures. To capture the topological heterogeneity of KGs, we present an ultrahyperbolic KG embedding (UltraE) in an ultrahyperbolic (or pseudo-Riemannian) manifold that seamlessly interleaves hyperbolic and spherical manifolds. In particular, we model each relation as a pseudo-orthogonal transformation that preserves the pseudo-Riemannian bilinear form. The pseudo-orthogonal transformation is decomposed into various operators (i.e., circular rotations, reflections and hyperbolic rotations), allowing for simultaneously modeling heterogeneous structures as well as complex relational patterns. Experimental results on three standard KGs show that UltraE outperforms previous Euclidean, hyperbolic, and mixed-curvature KG embedding approaches.
Convolutional kernel networks (CKN) have been proposed to solve image classification tasks, and have shown competitive performance over classical neural networks while being easy to train and robust to overfitting. In real-world ordinal regression problems, we usually have plenty of unlabeled data but a limited number of labeled ordered data. Although recent research works have shown that directly optimizing AUC can impose a better ranking on the data than optimizing traditional error rate, it is still an open question to design an efficient semi-supervised ordinal regression AUC maximization algorithm based on CKN with convergence guarantee. To address this question, in this paper, we propose a new semi-supervised ordinal regression CKN algorithm (S^2 CKNOR) with end-to-end AUC maximization. Specifically, we decompose the ordinal regression into a series of binary classification subproblems and propose an unbiased non-convex objective function to optimize AUC, such that both labeled and unlabeled data can be used to enhance the model performance. Further, we propose a nested alternating minimization algorithm to solve the non-convex objective, where each (convex) subproblem is solved by a quadruply stochastic gradient algorithm, and the non-convex one is solved by the stochastic projected gradient method. Importantly, we prove that our S^2 CKNOR algorithm can finally converge to a critical point of the non-convex objective. Extensive experimental results demonstrate that our S^2 CKNOR achieves the best AUC results on various real-world datasets.
Trajectory prediction is a fundamental problem for a wide spectrum of location-based applications. Existing methods can achieve inspiring results in predicting personal frequent routes conditioned on massive historical data. However, trajectory estimation may involve cold-start routes or users due to the data sparsity problem, which severely limits the performance of spatial trajectory prediction. Although meta-learning models can alleviate the cold-start problem, they simply utilize the same initialization for all tasks and thus cannot fit each user well due to users' varying travel preferences. To this end, we propose an adaptive meta-optimized model called MetaPTP for personalized spatial trajectory prediction. Specifically, it adopts a soft-clustering based method to guide the network initialization in a finer granularity, so that shared knowledge can be better transferred across users with similar travel preferences. Besides, towards model fine-tuning, an effective trajectory sampling method is introduced to generate meaningful support set, which simultaneously considers user preference and spatial trace similarities to provide task-related information for model adaptation. In addition, we design a weight generator to adaptively assign reasonable weights to trajectories in support set to avoid sub-optimal results which will occur when fine-tuning the initial network with the same weight for trajectories with different user preferences and spatial distributions. Finally, extensive experiments on two real-world datasets demonstrate the superiority of our model.
Graph representation learning has been extensively studied, and recent models can well incorporate both node features and graph structures. Despite these progress, the inherent scalability challenge for classical computers of processing graph data and solving the downstream tasks (many are NP-hard) is still a bottleneck for existing classical graph learning models. On the other hand, quantum computing is known a promising direction for its theoretically verified scalability as well as the increasing evidence for the access to physical quantum machine in near-term. Different from many existing classical-quantum hybrid machine learning models on graphs, in this paper we take a more aggressive initiative for developing a native quantum paradigm for (attributed) graph representation learning, which to our best knowledge, has not been fulfilled in literature yet. Specifically, our model adopts the well-established theory and technique in quantum computing e.g. quantum random walk, and adapt it to the attributed graph. Then the node attribute quantum state sequence is fed into a quantum recurrent network to obtain the final node embedding. Experimental results on three public datasets show the effectiveness of our quantum model which also outperforms a classical learning approach GraphRNA notably in terms of efficiency even on a classical computer. Though it is still restricted to the classical loss-based learning paradigm with gradient descent for model parameter training, while our computing scheme is compatible with quantum computing without involving classical computers. This is in fact largely in contrast to many hybrid quantum graph learning models which often involve many steps and modules having to be performed on classical computers.
This paper investigates a critical resource allocation problem in the first party cloud: scheduling containers to machines. There are tens of services, and each service runs a set of homogeneous containers with dynamic resource usage; containers of a service are scheduled daily in a batch fashion. This problem can be naturally formulated as Stochastic Bin Packing Problem (SBPP). However, traditional SBPP research often focuses on cases of empty machines, whose objective, i.e., to minimize the number of used machines, is not well-defined for the more common reality with nonempty machines. This paper aims to close this gap. First, we define a new objective metric, Used Capacity at Confidence (UCaC), which measures the maximum used resources at a probability and is proved to be consistent for both empty and nonempty machines and reformulate the SBPP under chance constraints. Second, by modeling the container resource usage distribution in a generative approach, we reveal that UCaC can be approximated with Gaussian, which is verified by trace data of real-world applications. Third, we propose an exact solver by solving the equivalent cutting stock variant as well as two heuristics-based solvers -- UCaC best fit, bi-level heuristics. We experimentally evaluate these solvers on both synthetic datasets and real application traces, demonstrating our methodology's advantage over traditional SBPP optimal solver minimizing the number of used machines, with a low rate of resource violations.
Cloud-based learning is currently the mainstream in both academia and industry. However, the global data distribution, as a mixture of all the users' data distributions, for training a global model may deviate from each user's local distribution for inference, making the global model non-optimal for each individual user. To mitigate distribution discrepancy, on-device training over local data for model personalization is a potential solution, but suffers from serious overfitting. In this work, we propose a new device-cloud collaborative learning framework under the paradigm of domain adaption, called MPDA, to break the dilemmas of purely cloud-based learning and on-device training. From the perspective of a certain user, the general idea of MPDA is to retrieve some similar data from the cloud's global pool, which functions as large-scale source domains, to augment the user's local data as the target domain. The key principle of choosing which outside data depends on whether the model trained over these data can generalize well over the local data. We theoretically analyze that MPDA can reduce distribution discrepancy and overfitting risk. We also extensively evaluate over the public MovieLens 20M and Amazon Electronics datasets, as well as an industrial dataset collected from Mobile Taobao over a period of 30 days. We finally build a device-tunnel-cloud system pipeline, deploy MPDA in the icon area of Mobile Taobao for click-through rate prediction, and conduct online A/B testing. Both offline and online results demonstrate that MPDA outperforms the baselines of cloud-based learning and on-device training only over local data, from multiple offline and online metrics.
Recently, many machine learning-based approaches that effectively solve graph optimization problems have been proposed. These approaches are usually trained on graphs randomly generated with graph generators or sampled from existing datasets. However, we observe that such training graphs lead to poor testing performance if the testing graphs are not generated analogously, i.e., the generalibility of the models trained on those randomly generated training graphs are very limited. To address this critical issue, in this paper, we propose a new framework, named Learning with Iterative Graph Diversification (LIGD), and formulate a new research problem, named Diverse Graph Modification Problem (DGMP), that iteratively generate diversified training graphs and train the models that solve graph optimization problems to significantly improve their performance. We propose three approaches to solve DGMP by considering both the performance of the machine-learning approaches and the structural properties of the training graphs. Experimental results on well-known problems show that our proposed approaches significantly boost the performance of both supervised and reinforcement learning approaches and produce near-optimal results, significantly outperforming the baseline approaches, such as graph augmentation and deep learning-based graph generation approaches.
Researchers recently started developing deep learning models capable of handling non-Euclidean data. However, because of existing framework limitations on model representations and learning algorithms, few have explored causal discovery on non-Euclidean data. This paper is the first attempt to do so. We start by proposing the Non-Euclidean Causal Model (NECM) which describes the causal generative relationship of non-Euclidean data and creates a new tensor data type along with a mapping process for the non-Euclidean causal mechanism. Second, within the NECM, we propose the non-Euclidean Hybrid Learning (NEHL) method, a causal discovery algorithm relying on the concept of the ball covariance recently introduced in the statistics field. Third, we generate two types of non-Euclidean datasets: Functional Data and Symmetric Positive Definite manifold data in conformity with the NECM. Finally, experimental results on the generated data and real-world data demonstrate the effectiveness of the proposed NEHL method.
Considering the prevalence of the power-law distribution in user-item networks, hyperbolic space has attracted considerable attention and achieved impressive performance in the recommender system recently. The advantage of hyperbolic recommendation lies in that its exponentially increasing capacity is well-suited to describe the power-law distributed user-item network whereas the Euclidean equivalent is deficient. Nonetheless, it remains unclear which kinds of items can be effectively recommended by the hyperbolic model and which cannot. To address the above concerns, we take the most basic recommendation technique, collaborative filtering, as a medium, to investigate the behaviors of hyperbolic and Euclidean recommendation models. The results reveal that (1) tail items get more emphasis in hyperbolic space than that in Euclidean space, but there is still ample room for improvement; (2) head items receive modest attention in hyperbolic space, which could be considerably improved; (3) and nonetheless, the hyperbolic models show more competitive performance than Euclidean models. Driven by the above observations, we design a novel learning method, named hyperbolic informative collaborative learning (HICF), aiming to compensate for the recommendation effectiveness of the head item while at the same time improving the performance of the tail item. The main idea is to adapt the hyperbolic margin ranking learning, making its pull and push procedure geometric-aware, and providing informative guidance for the learning of both head and tail items. Extensive experiments back up the analytic findings and also show the effectiveness of the proposed method. The work is valuable for personalized recommendations since it reveals that the hyperbolic space facilitates modeling the tail item, which often represents user-customized preferences or new products.
Recently, the research of dialogue systems has been widely concerned, especially task-oriented dialogue systems, which have received increased attention due to their wide application prospect. As a core component, dialogue state tracking (DST) plays a key role in task-oriented dialogue systems, and its function is to parse natural language dialogues into dialogue state formed by slot-value pairs. It is well known that dialogue state tracking has been well studied and explored on current benchmark datasets such as the MultiWOZ. However, almost all current research completely ignores the user negative feedback utterances that exist in real-life conversations when a system error occurs, which often contains user-provided corrective information for the system error. Obviously, user negative feedback utterances can be used to correct the inevitable errors in automatic speech recognition and model generalization. Thus, in this paper, we will explore the role of negative feedback utterances in dialogue state tracking in detail through simulated negative feedback utterances. Specifically, due to the lack of dataset involving negative feedback utterances, first, we have to define the schema of user negative feedback utterances and propose a joint modeling method for feedback utterance generation and filtering. Then, we explore three aspects of interaction mechanism that should be considered in real-life conversations involving negative feedback utterances and propose evaluation metrics related to negative feedback utterances. Finally, on WOZ2.0 and MultiWOZ2.1 datasets, by constructing simulated negative feedback utterances in training and testing, we not only verify the important role of negative feedback utterances in dialogue state tracking, but also analyze the advantages and disadvantages of different interaction mechanisms involving negative feedback utterances, lighting future research on negative feedback utterances.
Tables are omnipresent on the web and in various vertical domains, storing massive amounts of valuable data. However, the great flexibility in the table layout hinders the machine from understanding this valuable data. In order to unlock and utilize knowledge from tables, extracting data as numerical tuples is the first and critical step. As a form of relational data, numerical tuples have direct and transparent relationships between their elements and are therefore easy for machines to use. Extracting numerical tuples requires a deep understanding of intricate correlations between cells. The correlations are presented implicitly in texts and visual appearances of tables, which can be roughly classified into Hierarchy and Juxtaposition. Although many studies have made considerable progress in data extraction from tables, most of them only consider hierarchical relationships but neglect the juxtapositions. Meanwhile, they only evaluate their methods on relatively small corpora. This paper proposes a new framework to extract numerical tuples from tables and evaluate it on a large test set. Specifically, we convert this task into a relation extraction problem between cells. To represent cells with their intricate correlations in tables, we propose a BERT-based pre-trained language model, TableLM, to encode tables with diverse layouts. To evaluate the framework, we collect a large finance dataset that includes 19,264 tables and 604K tuples. Extensive experiments on the dataset are conducted to demonstrate the superiority of our framework compared to a well-designed baseline.
Generalization across different environments with the same tasks is critical for successful applications of visual reinforcement learning (RL) in real scenarios. However, visual distractions---which are common in real scenes---from high-dimensional observations can be hurtful to the learned representations in visual RL, thus degrading the performance of generalization. To tackle this problem, we propose a novel approach, namely Characteristic Reward Sequence Prediction (CRESP), to extract the task-relevant information by learning reward sequence distributions (RSDs), as the reward signals are task-relevant in RL and invariant to visual distractions. Specifically, to effectively capture the task-relevant information via RSDs, CRESP introduces an auxiliary task---that is, predicting the characteristic functions of RSDs---to learn task-relevant representations, because we can well approximate the high-dimensional distributions by leveraging the corresponding characteristic functions. Experiments demonstrate that CRESP significantly improves the performance of generalization on unseen environments, outperforming several state-of-the-arts on DeepMind Control tasks with different visual distractions.
The wide spread of fake news has caused serious societal issues. We propose a subgraph reasoning paradigm for fake news detection, which provides a crystal type of explainability by revealing which subgraphs of the news propagation network are the most important for news verification, and concurrently improves the generalization and discrimination power of graph-based detection models by removing task-irrelevant information. In particular, we propose a reinforced subgraph generation method, and perform fine-grained modeling on the generated subgraphs by developing a Hierarchical Path-aware Kernel Graph Attention Network. We also design a curriculum-based optimization method to ensure better convergence and train the two parts in an end-to-end manner.
Learning dynamic user preference has become an increasingly important component for many online platforms (e.g., video-sharing sites, e-commerce systems) to make sequential recommendations. Previous works have made many efforts to model item-item transitions over user interaction sequences, based on various architectures, e.g., recurrent neural networks and self-attention mechanism. Recently emerged graph neural networks also serve as useful backbone models to capture item dependencies in sequential recommendation scenarios. Despite their effectiveness, existing methods have far focused on item sequence representation with singular type of interactions, and thus are limited to capture dynamic heterogeneous relational structures between users and items (e.g., page view, add-to-favorite, purchase). To tackle this challenge, we design a Multi-Behavior Hypergraph-enhanced T ransformer framework (MBHT) to capture both short-term and long-term cross-type behavior dependencies. Specifically, a multi-scale Transformer is equipped with low-rank self-attention to jointly encode behavior-aware sequential patterns from fine-grained and coarse-grained levels. Additionally,we incorporate the global multi-behavior dependency into the hypergraph neural architecture to capture the hierarchical long-range item correlations in a customized manner. Experimental results demonstrate the superiority of our MBHT over various state-of- the-art recommendation solutions across different settings. Further ablation studies validate the effectiveness of our model design and benefits of the new MBHT framework. Our implementation code is released at: https://github.com/yuh-yang/MBHT-KDD22.
Computing trajectory similarities is a critical and fundamental task for various spatial-temporal applications, such as clustering, prediction, and anomaly detection. Traditional similarity metrics, i.e. DTW and Hausdorff, suffer from quadratic computation complexity, leading to their inability on large-scale data. To solve this problem, many trajectory representation learning techniques are proposed to approximate the metric space while reducing the complexity of similarity computation. Nevertheless, these works are designed based on RNN backend, resulting in a serious performance decline on long trajectories. In this paper, we propose a novel graph-based method, namely TrajGAT, to explicitly model the hierarchical spatial structure and improve the performance of long trajectory similarity computation. TrajGAT consists of two main modules, i.e. , graph construction and trajectory encoding. For graph construction, TrajGAT first employs PR quadtree to build the hierarchical structure of the whole spatial area, and then constructs a graph for each trajectory based on the original records and the leaf nodes of the quadtree. For trajectory encoding, we replace the self-attention in Transformer with graph attention and design an encoder to represent the generated graph trajectory. With these two modules, TrajGAT can capture the long-term dependencies of trajectories while reducing the GPU memory usage of Transformer. Our experiments on two real-life datasets show that TrajGAT not only improves the performance on long trajectories but also outperforms the state-of-the-art methods on mixture trajectories significantly.
We consider training a binary classifier under delayed feedback (DF learning). For example, in the conversion prediction in online ads, we initially receive negative samples that clicked the ads but did not buy an item; subsequently, some samples among them buy an item then change to positive. In the setting of DF learning, we observe samples over time, then learn a classifier at some point. We initially receive negative samples; subsequently, some samples among them change to positive. This problem is conceivable in various real-world applications such as online advertisements, where the user action takes place long after the first click. Owing to the delayed feedback, naive classification of the positive and negative samples returns a biased classifier. One solution is to use samples that have been observed for more than a certain time window assuming these samples are correctly labeled. However, existing studies reported that simply using a subset of all samples based on the time window assumption does not perform well, and that using all samples along with the time window assumption improves empirical performance. We extend these existing studies and propose a method with the unbiased and convex empirical risk that is constructed from all samples under the time window assumption. To demonstrate the soundness of the proposed method, we provide experimental results on a synthetic and open dataset that is the real traffic log datasets in online advertising.
Recent studies have shown great promise in applying graph neural networks for multivariate time series forecasting, where the interactions of time series are described as a graph structure and the variables are represented as the graph nodes. Along this line, existing methods usually assume that the graph structure (or the adjacency matrix), which determines the aggregation manner of graph neural network, is fixed either by definition or self-learning. However, the interactions of variables can be dynamic and evolutionary in real-world scenarios. Furthermore, the interactions of time series are quite different if they are observed at different time scales. To equip the graph neural network with a flexible and practical graph structure, in this paper, we investigate how to model the evolutionary and multi-scale interactions of time series. In particular, we first provide a hierarchical graph structure cooperated with the dilated convolution to capture the scale-specific correlations among time series. Then, a series of adjacency matrices are constructed under a recurrent manner to represent the evolving correlations at each layer. Moreover, a unified neural network is provided to integrate the components above to get the final prediction. In this way, we can capture the pair-wise correlations and temporal dependency simultaneously. Finally, experiments on both single-step and multi-step forecasting tasks demonstrate the superiority of our method over the state-of-the-art approaches.
Generating text adversarial examples in the hard-label setting is a more realistic and challenging black-box adversarial attack problem, whose challenge comes from the fact that gradient cannot be directly calculated from discrete word replacements. Consequently, the effectiveness of gradient-based methods for this problem still awaits improvement. In this paper, we propose a gradient-based optimization method named LeapAttack to craft high-quality text adversarial examples in the hard-label setting. To specify, LeapAttack employs the word embedding space to characterize the semantic deviation between the two words of each perturbed substitution by their difference vector. Facilitated by this expression, LeapAttack gradually updates the perturbation direction and constructs adversarial examples in an iterative round trip: firstly, the gradient is estimated by transforming randomly sampled word candidates to continuous difference vectors after moving the current adversarial example near the decision boundary; secondly, the estimated gradient is mapped back to a new substitution word based on the cosine similarity metric. Extensive experimental results show that in the general case LeapAttack can efficiently generate high-quality text adversarial examples with the highest semantic similarity and the lowest perturbation rate in the hard-label setting.
Despite intense efforts in basic and clinical research, an individualized ventilation strategy for critically ill patients remains a major challenge. Recently, dynamic treatment regime (DTR) with reinforcement learning (RL) on electronic health records (EHR) has attracted interest from both the healthcare industry and machine learning research community. However, most learned DTR policies might be biased due to the existence of confounders. Although some treatment actions non-survivors received may be helpful, if confounders cause the mortality, the training of RL models guided by long-term outcomes (e.g., 90-day mortality) would punish those treatment actions causing the learned DTR policies to be suboptimal. In this study, we develop a new deconfounding actor-critic network (DAC) to learn optimal DTR policies for patients. To alleviate confounding issues, we incorporate a patient resampling module and a confounding balance module into our actor-critic framework. To avoid punishing the effective treatment actions non-survivors received, we design a short-term reward to capture patients' immediate health state changes. Combining short-term with long-term rewards could further improve the model performance. Moreover, we introduce a policy adaptation method to successfully transfer the learned model to new-source small-scale datasets. The experimental results on one semi-synthetic and two different real-world datasets show the proposed model outperforms the state-of-the-art models. The proposed model provides individualized treatment decisions for mechanical ventilation that could improve patient outcomes.
This paper describes a new method for representing embedding tables of graph neural networks (GNNs) more compactly via tensor-train (TT) decomposition. We consider the scenario where (a) the graph data that lack node features, thereby requiring the learning of embeddings during training; and (b) we wish to exploit GPU platforms, where smaller tables are needed to reduce host-to-GPU communication even for large-memory GPUs. The use of TT enables a compact parameterization of the embedding, rendering it small enough to fit entirely on modern GPUs even for massive graphs. When combined with judicious schemes for initialization and hierarchical graph partitioning, this approach can reduce the size of node embedding vectors by 1,659 times to 81,362 times on large publicly available benchmark datasets, achieving comparable or better accuracy and significant speedups on multi-GPU systems. In some cases, our model without explicit node features on input can even match the accuracy of models that use node features.
Given a graph with partial observations of node features, how can we estimate the missing features accurately? Feature estimation is a crucial problem for analyzing real-world graphs whose features are commonly missing during the data collection process. Accurate estimation not only provides diverse information of nodes but also supports the inference of graph neural networks that require the full observation of node features. However, designing an effective approach for estimating high-dimensional features is challenging, since it requires an estimator to have large representation power, increasing the risk of overfitting. In this work, we propose SVGA (Structured Variational Graph Autoencoder), an accurate method for feature estimation. SVGA applies strong regularization to the distribution of latent variables by structured variational inference, which models the prior of variables as Gaussian Markov random field based on the graph structure. As a result, SVGA combines the advantages of probabilistic inference and graph neural networks, achieving state-of-the-art performance in real datasets.
Online anomaly detection from a data stream is critical for the safety and security of many applications but is facing severe challenges due to complex and evolving data streams from IoT devices and cloud-based infrastructures. Unfortunately, existing approaches fall too short for these challenges; online anomaly detection methods bear the burden of handling the complexity while offline deep anomaly detection methods suffer from the evolving data distribution. This paper presents a framework for online deep anomaly detection, ARCUS, which can be instantiated with any autoencoder-based deep anomaly detection methods. It handles the complex and evolving data streams using an adaptive model pooling approach with two novel techniques: concept-driven inference and drift-aware model pool update; the former detects anomalies with a combination of models most appropriate for the complexity, and the latter adapts the model pool dynamically to fit the evolving data streams. In comprehensive experiments with ten data sets which are both high-dimensional and concept-drifted, ARCUS improved the anomaly detection accuracy of the streaming variants of state-of-the-art autoencoder-based methods and that of the state-of-the-art streaming anomaly detection methods by up to 22% and 37%, respectively.
Graph Neural Networks (GNNs) have been successfully applied to many real-world static graphs. However, the success of static graphs has not fully translated to dynamic graphs due to the limitations in model design, evaluation settings, and training strategies. Concretely, existing dynamic GNNs do not incorporate state-of-the-art designs from static GNNs, which limits their performance. Current evaluation settings for dynamic GNNs do not fully reflect the evolving nature of dynamic graphs. Finally, commonly used training methods for dynamic GNNs are not scalable. Here we propose ROLAND, an effective graph representation learning framework for real-world dynamic graphs. At its core, the ROLAND framework can help researchers easily repurpose any static GNN to dynamic graphs. Our insight is to view the node embeddings at different GNN layers as hierarchical node states and then recurrently update them over time. We then introduce a live-update evaluation setting for dynamic graphs that mimics real-world use cases, where GNNs are making predictions and being updated on a rolling basis. Finally, we propose a scalable and efficient training approach for dynamic GNNs via incremental training and meta-learning. We conduct experiments over eight different dynamic graph datasets on future link prediction tasks. Models built using the ROLAND framework achieve on average 62.7% relative mean reciprocal rank (MRR) improvement over state-of-the-art baselines under the standard evaluation settings on three datasets. We find state-of-the-art baselines experience out-of-memory errors for larger datasets, while ROLAND can easily scale to dynamic graphs with 56 million edges. After re-implementing these baselines using the ROLAND training strategy, ROLAND models still achieve on average 15.5% relative MRR improvement over the baselines.
Availability attacks, which poison the training data with imperceptible perturbations, can make the data not exploitable by machine learning algorithms so as to prevent unauthorized use of data. In this work, we investigate why these perturbations work in principle. We are the first to unveil an important population property of the perturbations of these attacks: they are almost linearly separable when assigned with the target labels of the corresponding samples, which hence can work as shortcuts for the learning objective. We further verify that linear separability is indeed the workhorse for availability attacks. We synthesize linearly-separable perturbations as attacks and show that they are as powerful as the deliberately crafted attacks. Moreover, such synthetic perturbations are much easier to generate. For example, previous attacks need dozens of hours to generate perturbations for ImageNet while our algorithm only needs several seconds. Our finding also suggests that the shortcut learning is more widely present than previously believed as deep models would rely on shortcuts even if they are of an imperceptible scale and mixed together with the normal features. Our source code is published at https://github.com/dayu11/Availability-Attacks-Create-Shortcuts.
Heterogeneous graph convolutional networks have gained great popularity in tackling various network analytical tasks on heterogeneous network data, ranging from link prediction to node classification. However, most existing works ignore the relation heterogeneity with multiplex network between multi-typed nodes and different importance of relations in meta-paths for node embedding, which can hardly capture the heterogeneous structure signals across different relations. To tackle this challenge, this work proposes a Multiplex Heterogeneous Graph Convolutional Network (MHGCN) for heterogeneous network embedding. Our MHGCN can automatically learn the useful heterogeneous meta-path interactions of different lengths in multiplex heterogeneous networks through multi-layer convolution aggregation. Additionally, we effectively integrate both multi-relation structural signals and attribute semantics into the learned node embeddings with both unsupervised and semi-supervised learning paradigms. Extensive experiments on five real-world datasets with various network analytical tasks demonstrate the significant superiority of MHGCN against state-of-the-art embedding baselines in terms of all evaluation metrics. The source code of our method is available at: https://github.com/NSSSJSS/MHGCN.
In the ecology of short video platforms, the optimal exposure proportion of each video category is crucial to guide recommendation systems and content production in a macroscopic way. Though extensive studies on recommendation systems are devoted to providing the most well-matched videos for each view request, fitting the data without considering inherent biases such as selection bias and exposure bias will result in serious issues. In this paper, we formalize the exposure proportion strategy as a policy-making problem with multi-dimensional continuous treatment under certain constraints from a causal inference point of view. We propose a novel ensemble policy learning method based on causal trees, called Maximum Difference of Preference Point Forest (MDP2 Forest), which overcomes the shortcomings of existing policy learning approaches. Experimental results on both simulated and synthetic datasets show the superiority of our algorithm compared to other policy learning or causal inference methods in terms of the treatment estimation accuracy and the mean regret. Furthermore, the proposed MDP2 Forest method can also adapt to a wide range of business settings such as imposing different kinds of constraints on the multi-dimensional treatment.
In modern complex physical systems, advanced sensing technologies extend the sensor coverage but also increase the difficulties of improving system monitoring capabilities based on real-time data availability. Traditional model-based methods of sensor management are limited to specific systems/settings, which can be challenged when system knowledge is intractable. Fortunately, the large amount of data collected in real-time allows machine learning methods to be a complement. Especially, reinforcement learning-based control is recognized for its capability to dynamically interact with systems. However, the direct implementation of learning methods easily overfits and results in inaccurate physics modeling for sensor management. Although physical regularization is a popular direction to bridge the gap, learning-based sensor control still suffers from convergence failure under highly complex and uncertain scenarios. This paper develops physics-embedded and self-supervised reinforcement learning for sensor management using an intrinsic reward. Specifically, the intrinsic-motivated sensor management (IMSM) constructs the local surprise information from the physical latent features, which captures hidden states in observations, and thus intrinsically motivates the agent to speed-up exploration. We show that the designs can not only relieve the lack of consistency with underlying physics/physical dynamics, but also adapt the global objective of maximizing monitoring capabilities to local environment changes. We demonstrate its effectiveness by experiments on physical system sensor control. The proposed model is implemented for the sensor management of unmanned vehicles and sensor rescheduling in complex/settled power systems, with or without observability constraints. Numerical results show that our model provides consistently higher threat detection accuracy and better observability recovery, as compared to existing methods.
Zero-shot node classification is a very important challenge for classical semi-supervised node classification algorithms, such as Graph Convolutional Network (GCN) which has been widely applied to node classification. In order to predict the unlabeled nodes from unseen classes, zero-shot node classification needs to transfer knowledge from seen classes to unseen classes. It is crucial to consider the relations between the classes in zero-shot node classification. However, the GCN only considers the relations between the nodes, not the relations between the classes. Therefore, the GCN can not handle the zero-shot node classification effectively. This paper proposes a Dual Bidirectional Graph Convolutional Networks (DBiGCN) that consists of dual BiGCNs from the perspective of the nodes and the classes, respectively. The BiGCN can integrate the relations between the nodes and between the classes simultaneously in an united network. In addition, to make the dual BiGCNs work collaboratively, a label consistency loss is introduced, which can achieve mutual guidance and mutual improvement between the dual BiGCNs. Finally, the experimental results on real-world graph data sets verify the effectiveness of the proposed method.
Multimodal electronic health record (EHR) data are widely used in clinical applications. Conventional methods usually assume that each sample (patient) is associated with the unified observed modalities, and all modalities are available for each sample. However, missing modality caused by various clinical and social reasons is a common issue in real-world clinical scenarios. Existing methods mostly rely on solving a generative model that learns a mapping from the latent space to the original input space, which is an unstable ill-posed inverse problem. To relieve the underdetermined system, we propose a model solving a direct problem, dubbed learning with Missing Modalities in Multimodal healthcare data (M3Care). M3Care is an end-to-end model compensating the missing information of the patients with missing modalities to perform clinical analysis. Instead of generating raw missing data, M3Care imputes the task-related information of the missing modalities in the latent space by the auxiliary information from each patient's similar neighbors, measured by a task-guided modality-adaptive similarity metric, and thence conducts the clinical tasks. The task-guided modality-adaptive similarity metric utilizes the uncensored modalities of the patient and the other patients who also have the same uncensored modalities to find similar patients. Experiments on real-world datasets show that M3Care outperforms the state-of-the-art baselines. Moreover, the findings discovered by M3Care are consistent with experts and medical knowledge, demonstrating the capability and the potential of providing useful insights and explanations.
While Variational Graph Auto-Encoder (VGAE) has presented promising ability to learn representations for documents, most existing VGAE methods do not model a latent topic structure and therefore lack semantic interpretability. Exploring hidden topics within documents and discovering key words associated with each topic allow us to develop a semantic interpretation of the corpus. Moreover, documents are usually associated with authors. For example, news reports have journalists specializing in writing certain type of events, academic papers have authors with expertise in certain research topics, etc. Modeling authorship information could benefit topic modeling, since documents by the same authors tend to reveal similar semantics. This observation also holds for documents published on the same venues. However, most topic models ignore the auxiliary authorship and publication venues. Given above two challenges, we propose a Variational Graph Author Topic Model for documents to integrate both semantic interpretability and authorship and venue modeling into a unified VGAE framework. For authorship and venue modeling, we construct a hierarchical multi-layered document graph with both intra- and cross-layer topic propagation. For semantic interpretability, three word relations (contextual, syntactic, semantic) are modeled and constitute three word sub-layers in the document graph. We further propose three alternatives for variational divergence. Experiments verify the effectiveness of our model on supervised and unsupervised tasks.
Crowd simulation acts as the basic component in traffic management, urban planning, and emergency management. Most existing approaches use physics-based models due to their robustness and strong generalizability, yet they fall short in fidelity since human behaviors are too complex and heterogeneous for a universal physical model to describe. Recent research tries to solve this problem by deep learning methods. However, they are still unable to generalize well beyond training distributions. In this work, we propose to jointly leverage the strength of the physical and neural network models for crowd simulation by a Physics-Infused Machine Learning (PIML) framework. The key idea is to let the two models learn from each other by iteratively going through a physics-informed machine learning process and a machine-learning-aided physics discovery process. We present our realization of the framework with a novel neural network model, Physics-informed Crowd Simulator (PCS), and tailored interaction mechanisms enabling the two models to facilitate each other. Specifically, our designs enable the neural network model to identify generalizable signals from real-world data better and yield physically consistent simulations with the physical model's form and simulation results as a prior. Further, by performing symbolic regression on the well-trained neural network, we obtain improved physical models that better describe crowd dynamics. Extensive experiments on two publicly available large-scale real-world datasets show that, with the framework, we successfully obtain a neural network model with strong generalizability and a new physical model with valid physical meanings at the same time. Both models outperform existing state-of-the-art simulation methods in accuracy, fidelity, and generalizability, which demonstrates the effectiveness of the PIML framework for improving simulation performance and its capability for facilitating scientific discovery and deepening our understandings of crowd dynamics. We release the codes at https://github.com/tsinghua-fib-lab/PIML.
Graph few-shot learning seeks to alleviate the label scarcity problem resulting from the difficulties and high cost of data annotations in graph learning. However, the overwhelming solutions in graph few-shot learning focus on homogeneous graphs, ignoring the ubiquitous heterogeneous graphs (HGs), which represent real-world complex systems and domain knowledge with multi-typed nodes interconnected by multi-typed edges. To this end, we study the cross-domain few-shot learning problem over HGs and develop a novel model for Cross-domain Heterogeneous Graph Meta learning (CrossHG-Meta). The general idea is to promote the HG node classification in the data-scarce target domain by transferring meta-knowledge from a series of HGs in data-rich source domains. The key challenges are to 1) combat the heterogeneity in HGs to acquire the transferable meta-knowledge; 2) handle the domain shifts between the source HG and target HG; and 3) fast adapt to novel target tasks with few-shot annotated examples. Regarding the graph heterogeneity, CrossHG-Meta firstly builds a graph encoder to aggregate heterogeneous neighborhood information from multiple semantic contexts. Secondly, to tackle domain shifts, a cross-domain meta-learning strategy is proposed to include a domain critic, which is designed to explicitly lead cross-domain adaptation for meta-tasks in different domains and improve model generalizability. Last, to further alleviate data scarcity, CrossHG-Meta leverages unlabelled information in source domains with auxiliary self-supervised learning task to provide cross-domain contrastive regularization alongside the meta-optimization process to facilitate node embedding. Extensive experimental results on three multi-domain HG datasets demonstrate that the proposed model outperforms various state-of-the-art baselines for multiple few-shot node classification tasks under the cross-domain setting.
Negative pairs, especially hard negatives as combined with common negatives (easy to discriminate), are essential in contrastive learning, which plays a role of avoiding degenerate solutions in the sense of constant representation across different instances. Inspired by recent hard negative mining methods via pairwise mixup operation in vision, we propose M-Mix, which dynamically generates a sequence of hard negatives. Compared with previous methods, M-Mix mainly has three features: 1) adaptively choose samples to mix; 2) simultaneously mix multiple samples; 3) automatically assign different mixing weights to the selected samples. We evaluate our method on two image datasets (CIFAR-10, CIFAR-100), five node classification datasets (PPI, DBLP, Pubmed, etc), five graph classification datasets (IMDB, PTC_MR, etc), and two downstream combinatorial tasks (graph edit distance and node clustering). Results show that it achieves state-of-the-art performance under self-supervised settings. Code is available at: https://github.com/Sherrylone/m-mix.
Electric Vehicles (EVs) have been emerging as a promising low-carbon transport target. While a large number of public charging stations are available, the use of these stations is often imbalanced, causing many problems to Charging Station Operators (CSOs). To this end, in this paper, we propose a Multi-Agent Graph Convolutional Reinforcement Learning (MAGC) framework to enable CSOs to achieve more effective use of these stations by providing dynamic pricing for each of the continuously arising charging requests with optimizing multiple long-term commercial goals. Specifically, we first formulate this charging station request-specific dynamic pricing problem as a mixed competitive-cooperative multi-agent reinforcement learning task, where each charging station is regarded as an agent. Moreover, by modeling the whole charging market as a dynamic heterogeneous graph, we devise a multi-view heterogeneous graph attention networks to integrate complex interplay between agents induced by their diversified relationships. Then, we propose a shared meta generator to generate individual customized dynamic pricing policies for large-scale yet diverse agents based on the extracted meta characteristics. Finally, we design a contrastive heterogeneous graph pooling representation module to learn a condensed yet effective state action representation to facilitate policy learning of large-scale agents. Extensive experiments on two real-world datasets demonstrate the effectiveness of MAGC and empirically show that the overall use of stations can be improved if all the charging stations in a charging market embrace our dynamic pricing policy.
Simulating urban morphology with location attributes is a challenging task in urban science. Recent studies have shown that Generative Adversarial Networks (GANs) have the potential to shed light on this task. However, existing GAN-based models are limited by the sparsity of urban data and instability in model training, hampering their applications. Here, we propose a GAN framework with geographical knowledge, namely Metropolitan GAN (MetroGAN), for urban morphology simulation. We incorporate a progressive growing structure to learn hierarchical features and design a geographical loss to impose the constraints of water areas. Besides, we propose a comprehensive evaluation framework for the complex structure of urban systems. Results show that MetroGAN outperforms the state-of-the-art urban simulation methods by over 20% in all metrics. Inspiringly, using physical geography features singly, MetroGAN can still generate shapes of the cities. These results demonstrate that MetroGAN solves the instability problem of previous urban simulation GANs and is generalizable to deal with various urban attributes.
Graph Neural Networks (GNNs) have achieved great success in various graph mining tasks. However, drastic performance degradation is always observed when a GNN is stacked with many layers. As a result, most GNNs only have shallow architectures, which limits their expressive power and exploitation of deep neighborhoods. Most recent studies attribute the performance degradation of deep GNNs to the over-smoothing issue. In this paper, we disentangle the conventional graph convolution operation into two independent operations: Propagation (P) and Transformation (T). Following this, the depth of a GNN can be split into the propagation depth (Dp) and the transformation depth (Dt). Through extensive experiments, we find that the major cause for the performance degradation of deep GNNs is the model degradation issue caused by large Dt rather than the over-smoothing issue mainly caused by large Dp. Further, we present Adaptive Initial Residual (AIR), a plug-and-play module compatible with all kinds of GNN architectures, to alleviate the model degradation issue and the over-smoothing issue simultaneously. Experimental results on six real-world datasets demonstrate that GNNs equipped with AIR outperform most GNNs with shallow architectures owing to the benefits of both large DD_p$ and Dt, while the time costs associated with AIR can be ignored.
In streaming media applications, like music Apps, songs are recommended in a continuous way in users' daily life. The recommended songs are played automatically although users may not pay any attention to them, posing a challenge of user attention bias in training recommendation models, i.e., the training instances contain a large number of false-positive labels (users' feedback). Existing approaches either directly use the auto-feedbacks or heuristically delete the potential false-positive labels. Both of the approaches lead to biased results because the false-positive labels cause the shift of training data distribution, hurting the accuracy of the recommendation models. In this paper, we propose a learning-based counterfactual approach to adjusting the user auto-feedbacks and learning the recommendation models using Neural Dueling Bandit algorithm, called NDB. Specifically, NDB maintains two neural networks: a user attention network for computing the importance weights that are used for modifying the original rewards, and another random network trained with dueling bandit for conducting online recommendations based on the modified rewards. Theoretical analysis showed that the modified rewards are statistically unbiased, and the learned bandit policy enjoys a sub-linear regret bound. Experimental results demonstrated that NDB can significantly outperform the state-of-the-art baselines.
Graph neural networks (GNN) are powerful tools in many web research problems. However, existing GNNs are not fully suitable for many real-world web applications. For example, over-smoothing may affect personalized recommendations and the lack of an explanation for the GNN prediction hind the understanding of many business scenarios. To address these problems, in this paper, we propose a new second-order continuous GNN which naturally avoids over-smoothing and enjoys better interpretability. There is some research interest in continuous graph neural networks inspired by the recent success of neural ordinary differential equations (ODEs). However, there are some remaining problems w.r.t. the prevailing first-order continuous GNN frameworks. Firstly, augmenting node features is an essential, however heuristic step for the numerical stability of current frameworks; secondly, first-order methods characterize a diffusion process, in which the over-smoothing effect w.r.t. node representations are intrinsic; and thirdly, there are some difficulties to integrate the topology of graphs into the ODEs. Therefore, we propose a framework employing second-order graph neural networks, which usually learn a less stiff transformation than the first-order counterpart. Our method can also be viewed as a coupled first-order model, which is easy to implement. We propose a semi-model-agnostic method based on our model to enhance the prediction explanation using high-order information. We construct an analog between continuous GNNs and some famous partial differential equations and discuss some properties of the first and second-order models. Extensive experiments demonstrate the effectiveness of our proposed method, and the results outperform related baselines.
Graph contrastive learning (GCL) improves graph representation learning, leading to SOTA on various downstream tasks. The graph augmentation step is a vital but scarcely studied step of GCL. In this paper, we show that the node embedding obtained via the graph augmentations is highly biased, somewhat limiting contrastive models from learning discriminative features for downstream tasks.Thus, instead of investigating graph augmentation in the input space, we alternatively propose to perform augmentations on the hidden features (feature augmentation). Inspired by so-called matrix sketching, we propose COSTA, a novel Covariance-preServing feaTure space Augmentation framework for GCL, which generates augmented features by maintaining a "good sketch" of original features. To highlight the superiority of feature augmentation with COSTA, we investigate a single-view setting (in addition to multi-view one) which conserves memory and computations. We show that the feature augmentation with COSTA achieves comparable/better results than graph augmentation based models.
Automated event detection from news corpora is a crucial task towards mining fast-evolving structured knowledge. As real-world events have different granularities, from the top-level themes to key events and then to event mentions corresponding to concrete actions, there are generally two lines of research: (1) theme detection tries to identify from a news corpus major themes (e.g., "2019 Hong Kong Protests" versus "2020 U.S. Presidential Election") which have very distinct semantics; and (2) action extraction aims to extract from a single document mention-level actions (e.g., "the police hit the left arm of the protester") that are often too fine-grained for comprehending the real-world event. In this paper, we propose a new task, key event detection at the intermediate level, which aims to detect from a news corpus key events (e.g., "HK Airport Protest on Aug. 12-14"), each happening at a particular time/location and focusing on the same topic. This task can bridge event understanding and structuring and is inherently challenging because of (1) the thematic and temporal closeness of different key events and (2) the scarcity of labeled data due to the fast-evolving nature of news articles. To address these challenges, we develop an unsupervised key event detection framework, EvMine, that (1) extracts temporally frequent peak phrases using a novel ttf-itf score, (2) merges peak phrases into event-indicative feature sets by detecting communities from our designed peak phrase graph that captures document co-occurrences, semantic similarities, and temporal closeness signals, and (3) iteratively retrieves documents related to each key event by training a classifier with automatically generated pseudo labels from the event-indicative feature sets and refining the detected key events using the retrieved documents in each iteration. Extensive experiments and case studies show EvMine outperforms all the baseline methods and its ablations on two real-world news corpora.
Federated learning (FL) is vulnerable to model poisoning attacks, in which malicious clients corrupt the global model via sending manipulated model updates to the server. Existing defenses mainly rely on Byzantine-robust or provably robust FL methods, which aim to learn an accurate global model even if some clients are malicious. However, they can only resist a small number of malicious clients. It is still an open challenge how to defend against model poisoning attacks with a large number of malicious clients. Our FLDetector addresses this challenge via detecting malicious clients. FLDetector aims to detect and remove the majority of the malicious clients such that a Byzantine-robust or provably robust FL method can learn an accurate global model using the remaining clients. Our key observation is that, in model poisoning attacks,the model updates from a client in multiple iterations are inconsistent. Therefore, FLDetector detects malicious clients via checking their model-updates consistency. Roughly speaking, the server predicts a client's model update in each iteration based on historical model updates and flags a client as malicious if the received model update from the client and the predicted model update are inconsistent in multiple iterations. Our extensive experiments on three benchmark datasets show that FLDetector can accurately detect malicious clients in multiple state-of-the-art model poisoning attacks and adaptive attacks tailored to FLDetector. After removing the detected malicious clients, existing Byzantine-robust FL methods can learn accurate global models.
In plenty of real-world applications, data are collected in a streaming fashion, and their accurate labels are hard to obtain. For instance, in the environmental monitoring task, sensors are collecting the data all the time. Still, their labels are scarce because the labeling process requires human effort and can conceal annotation errors. This paper investigates the problem of learning with weakly labeled data streams, in which data are continuously collected, and only a limited subset of streaming data is labeled but potentially with noise. This setting is challenging and of great importance but rarely studied in the literature. When the data are constantly gathered with unknown noise on labels, it is quite challenging to design algorithms to obtain a well-generalized classifier. To address this difficulty, we propose a novel noise transition matrix estimation approach for data streams with scarce noisy labels by online anchor points identification. Based on that, we propose an adaptive learning algorithm for weakly labeled data streams via model reuse and effectively alleviate the negative influence of label noise with unlabeled data. Both theoretical analysis and extensive experiments justify and validate the effectiveness of the proposed approach.
The fairness-aware online learning framework has arisen as a powerful tool for the continual lifelong learning setting. The goal for the learner is to sequentially learn new tasks where they come one after another over time and the learner ensures the statistic parity of the new coming task across different protected sub-populations (e.g. race and gender). A major drawback of existing methods is that they make heavy use of the i.i.d assumption for data and hence provide static regret analysis for the framework. However, low static regret cannot imply a good performance in changing environments where tasks are sampled from heterogeneous distributions. To address the fairness-aware online learning problem in changing environments, in this paper, we first construct a novel regret metric FairSAR by adding long-term fairness constraints onto a strongly adapted loss regret. Furthermore, to determine a good model parameter at each round, we propose a novel adaptive fairness-aware online meta-learning algorithm, namely FairSAOML, which is able to adapt to changing environments in both bias control and model precision. The problem is formulated in the form of a bi-level convex-concave optimization with respect to the model's primal and dual parameters that are associated with the model's accuracy and fairness, respectively. The theoretic analysis provides sub-linear upper bounds for both loss regret and violation of cumulative fairness constraints. Our experimental evaluation on different real-world datasets with settings of changing environments suggests that the proposed FairSAOML significantly outperforms alternatives based on the best prior online learning approaches.
With the increasing demand for the protection of personal network meta-data, encrypted networks have grown in popularity, so do the challenge of monitoring and analyzing encrypted network traffic. Currently, some deep learning-based methods have been proposed to leverage statistical features for encrypted traffic classification, which are barely affected by encryption techniques. However, these works still suffer from two main intrinsic limitations: (1) the feature extraction process lacks a mechanism to take into account correlations between flows in the flow sequence; and (2) a large volume of manually-labeled data is required for training an effective deep classifier. In this paper, we propose a novel semi-supervised framework to address these problems. To be specific, an efficient classifier with attention mechanism is proposed to extract features from flow sequences with low computational cost. Then, a Mean Teacher-style semi-supervised framework is adopted to exploit the unlabeled traffic data, where a spatiotemporal data augmentation method is designed as the key component to explore the spatial and temporal relationship within the unlabeled traffic data. Experimental results on two real-world traffic datasets demonstrate that our method outperforms state-of-the-art methods with a large margin.
Tree models are very widely used in practice of machine learning and data mining. In this paper, we study the problem of model integrity authentication in tree models. In general, the task of model integrity authentication is the design & implementation of mechanisms for checking/detecting whether the model deployed for the end-users has been tampered with or compromised, e.g., malicious modifications on the model. We propose an authentication framework that enables the model builders/distributors to embed a signature to the tree model and authenticate the existence of the signature by only making a small number of black-box queries to the model. To the best of our knowledge, this is the first study of signature embedding on tree models. Our proposed method simply locates a collection of leaves and modifies their prediction values, which does not require any training/testing data nor any re-training. The experiments on a large number of public classification datasets confirm that the proposed signature embedding process has a high success rate while only introducing a minimal accuracy loss.
With the advent of big data across multiple high-impact applications, we are often facing the challenge of complex heterogeneity. The newly collected data usually consist of multiple modalities and are characterized with multiple labels, thus exhibiting the co-existence of multiple types of heterogeneity. Although state-of-the-art techniques are good at modeling the complex heterogeneity with sufficient label information, such label information can be quite expensive to obtain in real applications. Recently, researchers pay great attention to contrastive learning due to its prominent performance by utilizing rich unlabeled data. However, existing work on contrastive learning is not able to address the problem of false-negative pairs, i.e., some 'negative' pairs may have similar representations if they have the same label. To overcome the issues, in this paper, we propose a unified heterogeneous learning framework, which combines both the weighted unsupervised contrastive loss and the weighted supervised contrastive loss to model multiple types of heterogeneity. We first provide a theoretical analysis showing that the vanilla contrastive learning loss easily leads to the sub-optimal solution in the presence of false-negative pairs, whereas the proposed weighted loss could automatically adjust the weight based on the similarity of the learned representations to mitigate this issue. Experimental results on real-world data sets demonstrate the effectiveness and the efficiency of the proposed framework modeling multiple types of heterogeneity.
Graph Neural Networks (GNNs) have been widely used for modeling graph-structured data. Recent breakthroughs have been made in improving the scalability of GNNs to work on graphs with millions of nodes. However, how to instantly represent continuous changes of large-scale dynamic graphs with GNNs is still an open problem. Existing dynamic GNNs focus on modeling the periodic evolution of graphs, often on a snapshot basis. Such methods suffer from two drawbacks: first, there is a substantial delay for the changes in the graph to be reflected in the graph representations, resulting in losses on the model's accuracy; second, repeatedly calculating the representation matrix on the entire graph in each snapshot is predominantly time-consuming and severely limits the scalability. In this paper, we propose Instant Graph Neural Network (InstantGNN), an incremental computation approach for the graph representation matrix of dynamic graphs. Set to work with dynamic graphs with the edge-arrival model, our method avoids time-consuming, repetitive computations and allows instant updates on the representation and instant predictions. Graphs with dynamic structures and dynamic attributes are both supported. The upper bounds of time complexity of those updates are also provided. Furthermore, our method provides an adaptive training strategy, which guides the model to retrain at moments when it can make the greatest performance gains. We conduct extensive experiments on several real-world and synthetic datasets. Empirical results demonstrate that our model achieves state-of-the-art accuracy while having orders-of-magnitude higher efficiency than existing methods.
A common workflow for single-cell RNA-sequencing (sc-RNA-seq) data analysis is to orchestrate a three-step pipeline. First, conduct a dimension reduction of the input cell profile matrix; second, cluster the cells in the latent space; and third, extract the "gene panels" that distinguish a certain cluster from others. This workflow has the primary drawback that the three steps are performed independently, neglecting the dependencies among the steps and among the marker genes or gene panels. In our system, KRATOS, we alter the three-step workflow to a two-step one, where we jointly optimize the first two steps and add the third (interpretability) step to form an integrated sc-RNA-seq analysis pipeline. We show that the more compact workflow of KRATOS extracts marker genes that can better discriminate the target cluster, distilling underlying mechanisms guiding cluster membership. In doing so, KRATOS is significantly better than the two SOTA baselines we compare against, specifically 5.62% superior to Global Counterfactual Explanation (GCE) [ICML-20], and 3.31% better than Adversarial Clustering Explanation (ACE) [ICML-21], measured by the AUROC of a kernel-SVM classifier. We opensource our code and datasets here: https://github.com/icanforce/single-cell-genomics-kratos.
Molecular representation learning has attracted much attention recently. A molecule can be viewed as a 2D graph with nodes/atoms connected by edges/bonds, and can also be represented by a 3D conformation with 3-dimensional coordinates of all atoms. We note that most previous work handles 2D and 3D information separately, while jointly leveraging these two sources may foster a more informative representation. In this work, we explore this appealing idea and propose a new representation learning method based on a unified 2D and 3D pre-training. Atom coordinates and interatomic distances are encoded and then fused with atomic representations through graph neural networks. The model is pre-trained on three tasks: reconstruction of masked atoms and coordinates, 3D conformation generation conditioned on 2D graph, and 2D graph generation conditioned on 3D conformation. We evaluate our method on 11 downstream molecular property prediction tasks: 7 with 2D information only and 4 with both 2D and 3D information. Our method achieves state-of-the-art results on 10 tasks, and the average improvement on 2D-only tasks is 8.3%. Our method also achieves significant improvement on two 3D conformation generation tasks.
We bridge two research directions on graph neural networks (GNNs), by formalizing the relation between heterophily of node labels (i.e., connected nodes tend to have dissimilar labels) and the robustness of GNNs to adversarial attacks. Our theoretical and empirical analyses show that for homophilous graph data, impactful structural attacks always lead to reduced homophily, while for heterophilous graph data the change in the homophily level depends on the node degrees. These insights have practical implications for defending against attacks on real-world graphs: we deduce that separate aggregators for ego- and neighbor-embeddings, a design principle which has been identified to significantly improve prediction for heterophilous graph data, can also offer increased robustness to GNNs. Our comprehensive experiments show that GNNs merely adopting this design achieve improved empirical and certifiable robustness compared to the best-performing unvaccinated model. Additionally, combining this design with explicit defense mechanisms against adversarial attacks leads to an improved robustness with up to 18.33% performance increase under attacks compared to the best-performing vaccinated model.
Concomitant with the tremendous prevalence of online social media platforms, the interactions among individuals are unprecedentedly enhanced. People are free to interact with acquaintances, express and exchange their own opinions through commenting, liking, retweeting on online social media, leading to resistance, controversy and other important phenomena over controversial social issues, which have been the subject of many recent works. In this paper, we study the problem of minimizing risk of conflict in social networks by modifying the initial opinions of a small number of nodes. We show that the objective function of the combinatorial optimization problem is monotone and supermodular. We then propose a naive greedy algorithm with a (1-1/e) approximation ratio that solves the problem in cubic time. To overcome the computation challenge for large networks, we further integrate several effective approximation strategies to provide a nearly linear time algorithm with a (1-1/e-ε) approximation ratio for any error parameter ε>0. Extensive experiments on various real-world datasets demonstrate both the efficiency and effectiveness of our algorithms. In particular, the fast one scales to large networks with more than two million nodes, and achieves up to 20x speed-up over the state-of-the-art algorithm.
Business processes in workflows comprise of an ordered sequence of tasks and decisions to accomplish certain business goals. Each decision point requires the input of a decision-maker to distill complex case information and make an optimal decision given their experience, organizational policy, and external contexts. Overlooking some of the essential factors or lack of knowledge can impact the throughput and business outcomes. Therefore, we propose an end-to-end automated decision support system with explanation for business processes. The system uses the proposed process-aware feature engineering methodology that extracts features from process and business data attributes. The system helps a decision-maker to make quick and quality decisions by predicting the decision and providing an explanation of the factors which led to the prediction. We provide offline and online training methods robust to data drift that can also incorporate user feedback. The system also support predictions with live instance data i.e., allow decision-makers to conduct trials on current data instance by modifying its business data attribute values. We evaluate our system on real-world and synthetic datasets and benchmark the performance, achieving an average of 15% improvement over baselines.
The rapid increase in mobile data traffic and the number of connected devices and applications in networks is putting a significant pressure on the current network management approaches that heavily rely on human operators. Consequently, an automated network management system that can efficiently predict and detect anomalies is needed. In this paper, we propose, RCAD, a novel distributed architecture for detecting anomalies in network data forwarding latency in an unsupervised fashion. RCAD employs the hierarchical temporal memory (HTM) algorithm for the online detection of anomalies. It also involves a collaborative distributed learning module that facilitates knowledge sharing across the system. We implement and evaluate RCAD on real world measurements from a commercial mobile network. RCAD achieves over 0.7 F-1 score significantly outperforming current state-of-the-art methods.
In the recent years, the deep reinforcement learning community has achieved impressive success to tackle real-world challenges. In this work, we propose a novel deep reinforcement learning agent to perform floorplanning, one of the early stages of VLSI physical design. Traditional methods to solve floorplanning problem are intractable for large circuit netlists and impossible to learn from past experience. We adopt the domain knowledge of floorplanning representation and propose a learning-based method that directly predicts block id and location through an RL framework. The resulting solutions are platform-independent and can be converted into layout within $O(n)$ time. We encode the hypernet information in the circuit netlist in a one-to-one mapping through hypergraph neural networks. Furthermore, We deploy transformer-like action selection to allow for transferability and generalization across netlist circuits with different sizes and handle the large discrete action space. This allows the parameter space of our model to remain the same regardless of the number of blocks. Our RL agent is able to transfer previously learnt knowledge to quickly optimize a new design with different size and purpose. To our knowledge, this is the first work to select both id and block position with an entirely end-to-end learning-based framework that can generalize. Results on publicly available benchmarks of GSRC and MCNC demonstrate that our method can outperform the baselines while being able to generalize.
Learned embeddings for products are an important building block for web-scale e-commerce recommendation systems. At Pinterest, we build a single set of product embeddings called ItemSage to provide relevant recommendations in all shopping use cases including user, image and search based recommendations. This approach has led to significant improvements in engagement and conversion metrics, while reducing both infrastructure and maintenance cost. While most prior work focuses on building product embeddings from features coming from a single modality, we introduce a transformer-based architecture capable of aggregating information from both text and image modalities and show that it significantly outperforms single modality baselines. We also utilize multi-task learning to make ItemSage optimized for several engagement types, leading to a candidate generation system that is efficient for all of the engagement objectives of the end-to-end recommendation system. Extensive offline experiments are conducted to illustrate the effectiveness of our approach and results from online A/B experiments show substantial gains in key business metrics (up to +7% gross merchandise value/user and +11% click volume).
Company financial risk is ubiquitous and early risk assessment for listed companies can avoid considerable losses. Traditional methods mainly focus on the financial statements of companies and lack the complex relationships among them. However, the financial statements are often biased and lagged, making it difficult to identify risks accurately and timely. To address the challenges, we redefine the problem as company financial risk assessment on tribe-style graph by taking each listed company and its shareholders as a tribe and leveraging financial news to build inter-tribe connections. Such tribe-style graphs present different patterns to distinguish risky companies from normal ones. However, most nodes in the tribe-style graph lack attributes, making it difficult to directly adopt existing graph learning methods (e.g., Graph Neural Networks(GNNs)). In this paper, we propose a novel Hierarchical Graph Neural Network (TH-GNN) for Tribe-style graphs via two levels, with the first level to encode the structure pattern of the tribes with contrastive learning, and the second level to diffuse information based on the inter-tribe relations, achieving effective and efficient risk assessment. Extensive experiments on the real-world company dataset show that our method achieves significant improvements on financial risk assessment over previous competing methods. Also, the extensive ablation studies and visualization comprehensively show the effectiveness of our method.
Chit-chat has been shown effective in engaging users in human-computer interaction. We find with a user study that generating appropriate chit-chat for news articles can help expand user interest and increase the probability that a user reads a recommended news article. Based on this observation, we propose a method to generate personalized chit-chat for news recommendation. Different from existing methods for personalized text generation, our method only requires an external chat corpus obtained from an online forum, which can be disconnected from the recommendation dataset from both the user and item (news) perspectives. This is achieved by designing a weak supervision method for estimating users' personalized interest in a chit-chat post by transferring knowledge learned by a news recommendation model. Based on the method for estimating user interest, a reinforcement learning framework is proposed to generate personalized chit-chat. Extensive experiments, including the automatic offline evaluation and user studies, demonstrate the effectiveness of our method.
Click-Through Rate (CTR) prediction, estimating the probability of a user clicking on items, plays a key fundamental role in sponsored search. E-commerce platforms display organic search results and advertisements (ads), collectively called items, together as a mixed list. The items displayed around the predicted ad, i.e. external items, may affect the user clicking on the predicted. Previous CTR models assume the user click only relies on the ad itself, which overlooks the effects of external items, referred to as external effects, or externalities. During the advertising prediction, the organic results have been generated by the organic system, while the final displayed ads on multiple ad slots have not been figured out, which leads to two challenges: 1) the predicted (target) ad may win any ad slot, bringing about diverse externalities. 2) external ads are undetermined, resulting in incomplete externalities. Facing the above challenges, inspired by the Transformer, we propose EXternality TRansformer (EXTR) which regards target ad with all slots as query and external items as key&value to model externalities in all exposure situations in parallel. Furthermore, we design a Potential Allocation Generator (PAG) for EXTR, to learn the allocation of potential external ads to complete the externalities. Extensive experimental results on Alibaba datasets demonstrate the effectiveness of externalities in the task of CTR prediction and illustrate that our proposed approach can bring significant profits to the real-world e-commerce platform. EXTR now has been successfully deployed in the online search advertising system in Alibaba, serving the main traffic.
Epilepsy is one of the most serious neurological diseases, affecting 1-2% of the world's population. The diagnosis of epilepsy depends heavily on the recognition of epileptic waves, i.e., disordered electrical brainwave activity in the patient's brain. Existing works have begun to employ machine learning models to detect epileptic waves via cortical electroencephalogram (EEG), which refers to brain data obtained from a noninvasive examination performed on the patient's scalp surface to record electrical activity in the brain. However, the recently developed stereoelectrocorticography (SEEG) method provides information in stereo that is more precise than conventional EEG, and has been broadly applied in clinical practice. Therefore, in this paper, we propose the first data-driven study to detect epileptic waves in a real-world SEEG dataset. While offering new opportunities, SEEG also poses several challenges. In clinical practice, epileptic wave activities are considered to propagate between different regions in the brain. These propagation paths, also known as the epileptogenic network, are deemed to be a key factor in the context of epilepsy surgery. However, the question of how to extract an exact epileptogenic network for each patient remains an open problem in the field of neuroscience. Moreover, the nature of epileptic waves and SEEG data inevitably leads to extremely imbalanced labels and severe noise. To address these challenges, we propose a novel model (BrainNet) that jointly learns the dynamic diffusion graphs and models the brain wave diffusion patterns. In addition, our model effectively aids in resisting label imbalance and severe noise by employing several self-supervised learning tasks and a hierarchical framework. By experimenting with the extensive real SEEG dataset obtained from multiple patients, we find that BrainNet outperforms several latest state-of-the-art baselines derived from time-series analysis.
This paper proposes a graph-based meta learning approach to separately predict water quantity and quality variables for river segments in stream networks. Given the heterogeneous water dynamic patterns in large-scale basins, we introduce an additional meta-learning condition based on physical characteristics of stream segments, which allows learning different sets of initial parameters for different stream segments. Specifically, we develop a representation learning method that leverages physical simulations to embed the physical characteristics of each segment. The obtained embeddings are then used to cluster river segments and add the condition for the meta-learning process. We have tested the performance of the proposed method for predicting daily water temperature and streamflow for the Delaware River Basin (DRB) over a 14 year period. The results confirm the effectiveness of our method in predicting target variables even using sparse training samples. We also show that our method can achieve robust performance with different numbers of clusterings.
Benford's law describes the distribution of the first digit of numbers appearing in a wide variety of numerical data, including tax records, and election outcomes, and has been used to raise "red flags" about potential anomalies in the data such as tax evasion. In this work, we ask the following novel question:
Given a large transaction or financial graph, how do we find a set of nodes that perform many transactions among each other that also deviate significantly from Benford's law?
We propose the AntiBenford subgraph framework that is founded on well-established statistical principles. Furthermore, we design an efficient algorithm that finds AntiBenford subgraphs in near-linear time on real data. We evaluate our framework on both real and synthetic data against a variety of competitors. We show empirically that our proposed framework enables the detection of anomalous subgraphs in cryptocurrency transaction networks that go undetected by state-of-the-art graph-based anomaly detection methods. Our empirical findings show that our AntiBenford framework is able to mine anomalous subgraphs, and provide novel insights into financial transaction data.
The code and the datasets are available at https://github.com/tsourakakis-lab/antibenford-subgraphs.
Estimating the time of arrival is a crucial task in intelligent transportation systems. Although considerable efforts have been made to solve this problem, most of them decompose a trajectory into several segments and then compute the travel time by integrating the attributes from all segments. The segment view, though being able to depict the local traffic conditions straightforwardly, is insufficient to embody the intrinsic structure of trajectories on the road network. To overcome the limitation, this study proposes multi-view trajectory representation that comprehensively interprets a trajectory from the segment-, link-, and intersection-views. To fulfill the purpose, we design a hierarchical self-attention network (HierETA) that accurately models the local traffic conditions and the underlying trajectory structure. Specifically, a segment encoder is developed to capture the spatio-temporal dependencies at a fine granularity, within which an adaptive self-attention module is designed to boost performance. Further, a joint link-intersection encoder is developed to characterize the natural trajectory structure consisting of alternatively arranged links and intersections. Afterward, a hierarchy-aware attention decoder is designed to realize a tradeoff between the multi-view spatio-temporal features. The hierarchical encoders and the attentive decoder are simultaneously learned to achieve an overall optimality. Experiments on two large-scale practical datasets show the superiority of HierETA over the state-of-the-arts.
Incremental learning is one paradigm to enable model building and updating at scale with streaming data. For end-to-end automatic speech recognition (ASR) tasks, the absence of human annotated labels along with the need for privacy preserving policies for model building makes it a daunting challenge. Motivated by these challenges, in this paper we use a cloud based framework for production systems to demonstrate insights from privacy preserving incremental learning for automatic speech recognition (ILASR). By privacy preserving, we mean, usage of ephemeral data which are not human annotated. This system is a step forward for production level ASR models for incremental/continual learning that offers near real-time test-bed for experimentation in the cloud for end-to-end ASR, while adhering to privacy-preserving policies. We show that the proposed system can improve the production models significantly ($3%$) over a new time period of six months even in the absence of human annotated labels with varying levels of weak supervision and large batch sizes in incremental learning. This improvement is $20%$ over test sets with new words and phrases in the new time period. We demonstrate the effectiveness of model building in a privacy-preserving incremental fashion for ASR while further exploring the utility of having an effective teacher model and use of large batch sizes.
The large-scale nature of product catalog and the changing demands of customer queries makes product search a challenging problem. The customer queries are ambiguous and implicit. They may be looking for an exact match of their query, or a functional equivalent (i.e., substitute), or an accessory to go with it (i.e., complement). It is important to distinguish these three categories from merely classifying an item for a customer query as relevant or not. This information can help direct the customer and improve search applications to understand the customer mission. In this paper, we formulate search relevance as a multi-class classification problem and propose a graph-based solution to classify a given query-item pair as exact, substitute, complement, or irrelevant (ESCI). The customer engagement (clicks, add-to-cart, and purchases) between query and items serve as a crucial information for this problem. However, existing approaches rely purely on the textual information (such as BERT) and do not sufficiently focus on the structural relationships. Another challenge in including the structural information is the sparsity of such data in some regions. We propose Structure-Aware multilingual LAnguage Model (SALAM), that utilizes a language model along with a graph neural network, to extract region-specific semantics as well as relational information for the classification of query-product pairs. Our model is first pre-trained on a large region-agnostic dataset and behavioral graph data and then fine-tuned on region-specific versions to address the sparsity. We show in our experiments that SALAM significantly outperforms the current matching frameworks on the ESCI classification task in several regions. We also demonstrate the effectiveness of using a two-phased training setup (i.e., pre-training and fine-tuning) in capturing region-specific information. Also, we provide various challenges and solutions for using the model in an industrial setting and outline its contribution to the e-commerce engine.
Automated fact-checking systems have been proposed that quickly provide veracity prediction at scale to mitigate the negative influence of fake news on people and on public opinion. However, most studies focus on veracity classifiers of those systems, which merely predict the truthfulness of news articles. We posit that effective fact checking also relies on people's understanding of the predictions. In this paper, we propose elucidating fact-checking predictions using counterfactual explanations to help people understand why a specific piece of news was identified as fake.
In this work, generating counterfactual explanations for fake news involves three steps: asking good questions, finding contradictions, and reasoning appropriately. We frame this research question as contradicted entailment reasoning through question answering (QA). We first ask questions towards the false claim and retrieve potential answers from the relevant evidence documents. Then, we identify the most contradictory answer to the false claim by use of an entailment classifier. Finally, a counterfactual explanation is created using a matched QA pair with three different counterfactual explanation forms. Experiments are conducted on the FEVER dataset for both system and human evaluations. Results suggest that the proposed approach generates the most helpful explanations compared to state-of-the-art methods. Our code and data is publicly available https://github.com/yilihsu/AsktoKnowMore.
Mathematical decision-optimization (DO) models provide decision support in a wide range of scenarios. Often, hard-to-model constraints and objectives are learned from data. Learning, however, can give rise to DO models that fail to capture the real system, leading to poor recommendations. We introduce an open-source framework designed for large-scale testing and solution quality analysis of DO model learning algorithms. Our framework produces multiple optimization problems at random, feeds them to the user's algorithm and collects its predicted optima. By comparing predictions against the ground truth, our framework delivers a comprehensive prediction profile of the algorithm. Thus, it provides a playground for researchers and data scientists to develop, test, and tune their DO model learning algorithms. Our contributions include: (1) an open-source testing framework implementation, (2) a novel way to generate DO ground truth, and (3) a first-of-its-kind, generic, cloud-distributed Ray and Rayvens architecture. We demonstrate the use of our testing framework on two open-source DO model learning algorithms.
In this paper, we introduce Shop the Look, a web-scale fashion and home product visual search system deployed at Amazon. Building such a system poses great challenges to both science and engineering practices. We leverage large-scale image data from the Amazon product catalog and adopt effective strategies to reduce the human effort required to annotate data. By employing state-of-the-art computer vision techniques, we train detection, recognition, and feature extraction models to bridge the domain gap between in-the-wild query images and product images which are taken under controlled settings. Our system is designed to achieve a balance between result accuracy and efficiency. The run-time service is optimized to provide retrieval results to users with low-latency. The scalable offline index-building pipeline adapts to the dynamic Amazon catalog that contains billions of products. We present both quantitative and qualitative evaluation results to demonstrate the performance of our system. We believe that the fast-growing Shop the Look service is shaping the way that customers shop on Amazon.
People come to social media to satisfy a variety of needs, such as being informed, entertained and inspired, or connected to their friends and community. Hence, to design a ranking function that gives useful and personalized post recommendations, it would be helpful to be able to predict the affective response a user may have to a post (e.g., entertained, informed, angered). This paper describes the challenges and solutions we developed to apply Affective Computing to social media recommendation systems.
We address several types of challenges. First, we devise a taxonomy of affects that was small (for practical purposes) yet covers the important nuances needed for the application. Second, to collect training data for our models, we balance between signals that are already available to us (namely, different types of user engagement) and data we collected through a carefully crafted human annotation effort on 800k posts. We demonstrate that affective response information learned from this dataset improves a module in the recommendation system by more than 8%. Online experimentation also demonstrates statistically significant decreases in surfaced violating content and increases in surfaced content that users find valuable.
Social networks, such as Twitter, form a heterogeneous information network (HIN) where nodes represent domain entities (e.g., user, content, advertiser, etc.) and edges represent one of many entity interactions (e.g, a user re-sharing content or "following" another). Interactions from multiple relation types can encode valuable information about social network entities not fully captured by a single relation; for instance, a user's preference for accounts to follow may depend on both user-content engagement interactions and the other users they follow. In this work, we investigate knowledge-graph embeddings for entities in the Twitter HIN (TwHIN); we show that these pretrained representations yield significant offline and online improvement for a diverse range of downstream recommendation and classification tasks: personalized ads rankings, account follow-recommendation, offensive content detection, and search ranking. We discuss design choices and practical challenges of deploying industry-scale HIN embeddings, including compressing them to reduce end-to-end model latency and handling parameter drift across versions.
Product images are essential for providing desirable user experience in an e-commerce platform. For a platform with billions of products, it is extremely time-costly and labor-expensive to manually pick and organize qualified images. Furthermore, there are the numerous and complicated image rules that a product image needs to comply in order to be generated/selected. To address these challenges, in this paper, we present a new learning framework in order to achieve Automatic Generation of Product-Image Sequence (AGPIS) in e-commerce. To this end, we propose a Multi-modality Unified Image-sequence Classifier (MUIsC), which is able to simultaneously detect all categories of rule violations through learning. MUIsC leverages textual review feedback as the additional training target and utilizes product textual description to provide extra semantic information. %Without using prior knowledge or manually-crafted task, a single MUIsC model is able to learn the holistic knowledge of image reviewing and detect all categories of rule violations simultaneously. Based on offline evaluations, we show that the proposed MUIsC significantly outperforms various baselines. Besides MUIsC, we also integrate some other important modules in the proposed framework, such as primary image selection, non-compliant content detection, and image deduplication. With all these modules, our framework works effectively and efficiently in JD.com recommendation platform. By Dec 2021, our AGPIS framework has generated high-standard images for about 1.5 million products and achieves 13.6% in reject rate. Code of this work is available at https://github.com/efan3000/muisc.
The goal of spatially explainable artificial intelligence (AI) classification approach is to build a classifier to distinguish two classes (e.g., responder, non-responder) based on the their spatial arrangements (e.g., spatial interactions between different point categories) given multi-category point data from two classes. This problem is important for generating hypotheses towards discovering new immunotherapies for cancer treatment as well as for other applications in biomedical research and microbial ecology. This problem is challenging due to an exponential number of category subsets which may vary in the strength of their spatial interactions. Most prior efforts on using human selected spatial association measures may not be sufficient for capturing the relevant spatial interactions (e.g., surrounded by) which may be of biological significance. In addition, the related deep neural networks are limited to category pairs and do not explore larger subsets of point categories. To overcome these limitations, we propose a Spatial-interaction Aware Multi-Category deep neural Network (SAMCNet) architecture and contribute novel local reference frame characterization and point pair prioritization layers for spatially explainable classification. Experimental results on multiple cancer datasets (e.g., MxIF) show that the proposed architecture provides higher prediction accuracy over baseline methods. A real-world case study demonstrates that the proposed work discovers patterns that are missed by the existing methods and has the potential to inspire new scientific discoveries.
In this paper we present AMPNet, an acoustic abnormality detection model deployed at ACV Auctions to automatically identify engine faults of vehicles listed on the ACV Auctions platform. We investigate the problem of engine fault detection and discuss our approach of deep-learning based audio classification on a large-scale automobile dataset collected at ACV Auctions. Specifically, we discuss our data collection pipeline and its challenges, dataset preprocessing and training procedures, and deployment of our trained models into a production setting. We perform empirical evaluations of AMPNet and demonstrate that our framework is able to successfully capture various engine anomalies agnostic of vehicle type. Finally we demonstrate the effectiveness and impact of AMPNet in the real world, specifically showing a 20.85% reduction in vehicle arbitrations on ACV Auctions' live auction platform.
To control the outbreak of COVID-19, efficient individual mobility intervention for EPidemic Control (EPC) strategies are of great importance, which cut off the contact among people at epidemic risks and reduce infections by intervening the mobility of individuals. Reinforcement Learning (RL) is powerful for decision making, however, there are two major challenges in developing an RL-based EPC strategy: (1) the unobservable information about asymptomatic infections in the incubation period makes it difficult for RL's decision-making, and (2) the delayed rewards for RL causes the deficiency of RL learning. Since the results of EPC are reflected in both daily infections (including unobservable asymptomatic infections) and long-term cumulative cases of COVID-19, it is quite daunting to design an RL model for precise mobility intervention. In this paper, we propose a Variational hiErarcHICal reinforcement Learning method for Epidemic control via individual-level mobility intervention, namely Vehicle. To tackle the above challenges, Vehicle first exploits an information rebuilding module that consists of a contact-risk bipartite graph neural network and a variational LSTM to restore the unobservable information. The contact-risk bipartite graph neural network estimates the possibility of an individual being an asymptomatic infection and the risk of this individual spreading the epidemic, as the current state of RL. Then, the Variational LSTM further encodes the state sequence to model the latency of epidemic spreading caused by unobservable asymptomatic infections. Finally, a Hierarchical Reinforcement Learning framework is employed to train Vehicle, which contains dual-level agents to solve the delayed reward problem. Extensive experimental results demonstrate that Vehicle can effectively control the spread of the epidemic. Vehicle outperforms the state-of-the-art baseline methods with remarkably high-precision mobility interventions on both symptomatic and asymptomatic infections.
We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system. Though we train using 70% spoken-form data, our teacher models perform comparably to XLM-R and mT5 when evaluated on the written-form Cross-lingual Natural Language Inference (XNLI) corpus. We perform a second stage of pretraining on our teacher models using in-domain data from our system, improving error rates by 3.86% relative for intent classification and 7.01% relative for slot filling. We find that even a 170M-parameter model distilled from our Stage 2 teacher model has 2.88% better intent classification and 7.69% better slot filling error rates when compared to the 2.3B-parameter teacher trained only on public data (Stage 1), emphasizing the importance of in-domain data for pretraining. When evaluated offline using labeled NLU data, our 17M-parameter Stage 2 distilled model outperforms both XLM-R Base (85M params) and DistillBERT (42M params) by 4.23% to 6.14%, respectively. Finally, we present results from a full virtual assistant experimentation platform, where we find that models trained using our pretraining and distillation pipeline outperform models distilled from 85M-parameter teachers by 3.74%-4.91% on an automatic measurement of full-system user dissatisfaction.
Predicting disease progression is key to provide stratified patient care and enable good utilization of healthcare resources. The availability of longitudinal images has enabled image-based disease progression prediction. In this work, we propose a framework called DP-GAT to identify regions containing significant biological structures and model the relationships among these regions as a graph along with their respective contexts. We perform reasoning via Graph Attention Network to generate representations that enable accurate disease progression prediction. We further extend DP-GAT to perform 3D medical volume segmentation. Experiments on real world medical image datasets demonstrate the advantage of our approach over strong baseline methods for both disease progression prediction and 3D segmentation tasks.
Autonomous Mobility-on-Demand (AMoD) systems represent an attractive alternative to existing transportation paradigms, currently challenged by urbanization and increasing travel needs. By centrally controlling a fleet of self-driving vehicles, these systems provide mobility service to customers and are currently starting to be deployed in a number of cities around the world. Current learning-based approaches for controlling AMoD systems are limited to the single-city scenario, whereby the service operator is allowed to take an unlimited amount of operational decisions within the same transportation system. However, real-world system operators can hardly afford to fully re-train AMoD controllers for every city they operate in, as this could result in a high number of poor-quality decisions during training, making the single-city strategy a potentially impractical solution. To address these limitations, we propose to formalize the multi-city AMoD problem through the lens of meta-reinforcement learning (meta-RL) and devise an actor-critic algorithm based on recurrent graph neural networks. In our approach, AMoD controllers are explicitly trained such that a small amount of experience within a new city will produce good system performance. Empirically, we show how control policies learned through meta-RL are able to achieve near-optimal performance on unseen cities by learning rapidly adaptable policies, thus making them more robust not only to novel environments, but also to distribution shifts common in real-world operations, such as special events, unexpected congestion, and dynamic pricing schemes.
On-demand food delivery service has widely served people's daily demands worldwide, e.g., customers place over 40 million online orders in Meituan food delivery platform per day in Q3 of 2021. Predicting the food preparation time (FPT) of each order accurately is very significant for the courier and customer experience over the platform. However, there are two challenges, namely incomplete label and huge uncertainty in FPT data, to make the prediction of FPT in practice. In this paper, we apply probabilistic forecasting to FPT for the first time and propose a non-parametric method based on deep learning. Apart from the data with precise label of FPT, we make full use of the lower/upper bound of orders without precise label, during feature extraction and model construction. A number of categories of meaningful features are extracted based on the detailed data analysis to produce sharp probability distribution. For probabilistic forecasting, we propose S-QL and prove its relationship with S-CRPS for interval-censored data for the first time, which serves the quantile discretization of S-CRPS and optimization for the constructed neural network model. Extensive offline experiments over the large-scale real-world dataset, and online A/B test both demonstrate the effectiveness of our proposed method.
While annotating decent amounts of data to satisfy sophisticated learning models can be cost-prohibitive for many real-world applications. Active learning (AL) and semi-supervised learning (SSL) are two effective, but often isolated, means to alleviate the data-hungry problem. Some recent studies explored the potential of combining AL and SSL to better probe the unlabeled data. However, almost all these contemporary SSL-AL works use a simple combination strategy, ignoring SSL and AL's inherent relation. Further, other methods suffer from high computational costs when dealing with large-scale, high-dimensional datasets. Motivated by the industry practice of labeling data, we propose an innovative Inconsistency-based virtual aDvErsarial Active Learning (IDEAL) algorithm to further investigate SSL-AL's potential superiority and achieve mutual enhancement of AL and SSL, i.e., SSL propagates label information to unlabeled samples and provides smoothed embeddings for AL, while AL excludes samples with inconsistent predictions and considerable uncertainty for SSL. We estimate unlabeled samples' inconsistency by augmentation strategies of different granularities, including fine-grained continuous perturbation exploration and coarse-grained data transformations. Extensive experiments, in both text and image domains, validate the effectiveness of the proposed algorithm, comparing it against state-of-the-art baselines. Two real-world case studies visualize the practical industrial value of applying and deploying the proposed data sampling algorithm.
Automatic product description generation for e-commerce has witnessed significant advancement in the past decade. Product copy- writing aims to attract users' interest and improve user experience by highlighting product characteristics with textual descriptions. As the services provided by e-commerce platforms become diverse, it is necessary to adapt the patterns of automatically-generated descriptions dynamically. In this paper, we report our experience in deploying an E-commerce Prefix-based Controllable Copywriting Generation (EPCCG) system into the JD.com e-commerce product recommendation platform. The development of the system contains two main components: 1) copywriting aspect extraction; 2) weakly supervised aspect labelling; 3) text generation with a prefix-based language model; and 4) copywriting quality control. We conduct experiments to validate the effectiveness of the proposed EPCCG. In addition, we introduce the deployed architecture which cooperates the EPCCG into the real-time JD.com e-commerce recommendation platform and the significant payoff since deployment. The codes for implementation are provided at https://github.com/xguo7/Automatic-Controllable-Product-Copywriting-for-E-Commerce.git.
Talent demand and supply forecasting aims to model the variation of the labor market, which is crucial to companies for recruitment strategy adjustment and to job seekers for proactive career path planning. However, existing approaches either focus on talent demand or supply forecasting, but overlook the interconnection between demand-supply sequences among different companies and positions. To this end, in this paper, we propose a Dynamic Heterogeneous Graph Enhanced Meta-learning (DH-GEM) framework for fine-grained talent demand-supply joint prediction. Specifically, we first propose a Demand-Supply Joint Encoder-Decoder (DSJED) and a Dynamic Company-Position Heterogeneous Graph Convolutional Network (DyCP-HGCN) to respectively capture the intrinsic correlation between demand and supply sequences and company-position pairs. Moreover, a Loss-Driven Sampling based Meta-learner (LDSM) is proposed to optimize long-tail forecasting tasks with a few training data. Extensive experiments have been conducted on three real-world datasets to demonstrate the effectiveness of our approach compared with five baselines. DH-GEM has been deployed as a core component of the intelligent human resource system of a cooperative partner.
In this paper, we present Online Supply Values (OSV), a system for estimating the return of available rideshare drivers to match drivers to ride requests at Lyft. Because a future driver state can be accurately predicted from a request destination, it is possible to estimate the expected action value of assigning a ride request to an available driver as a Markov Decision Process using the Bellman Equation. These estimates are updated using temporal difference and are shown to adapt to changing marketplace conditions in real-time. While reinforcement learning has been studied for rideshare dispatch, fully-online approaches without offline priors or other guardrails had never been evaluated in the real world. This work presents the algorithmic changes needed to bridge this gap. OSV is now deployed globally as a core component of Lyft's dispatch matching system. Our A/B user experiments in major US cities measure a +(0.96±0.53)% increase in the request fulfillment rate and a +(0.73±0.22)% increase to profit per passenger session over the previous algorithm.
Anomaly detection in high-dimensional time series is typically tackled using either reconstruction- or forecasting-based algorithms due to their abilities to learn compressed data representations and model temporal dependencies, respectively. However, most existing methods disregard the relationships between features, information that would be extremely useful when incorporated into a model. In this work, we introduce Fused Sparse Autoencoder and Graph Net (FuSAGNet), which jointly optimizes reconstruction and forecasting while explicitly modeling the relationships within multivariate time series. Our approach combines Sparse Autoencoder and Graph Neural Network, the latter of which predicts future time series behavior from sparse latent representations learned by the former as well as graph structures learned through recurrent feature embedding. Experimenting on three real-world cyber-physical system datasets, we empirically demonstrate that the proposed method enhances the overall anomaly detection performance, outperforming baseline approaches. Moreover, we show that mining sparse latent patterns from high-dimensional time series improves the robustness of the graph-based forecasting model. Lastly, we conduct visual analyses to investigate the interpretability of both recurrent feature embeddings and sparse latent representations.
The performance of logistics highly depends on the time efficiency, and hence, plenty of efforts have been devoted to ensuring the on-time delivery in modern logistics industry. However, the delay in logistics transportation and delivery can still happen due to various practical issues, which significantly impact the quality of logistics service. In order to address this issue, this work investigates the root causes impacting the time efficiency, thereby facilitating the operation of logistics systems such that resources can be appropriately allocated to improve the performance. The proposed solution comprises three stages, where statistical methods are employed in the first stage to analyze the pattern of on-time delivery rate and detect the abnormalities induced by non-ideality of operations. Subsequently, a machine learning model is trained to capture the underlying correlations between time efficiency and potential impacting factors. Finally, explainable machine learning techniques are utilized to quantify the contributions of the impacting factors to the time efficiency, thereby recognizing the root causes. The proposed method is comprehensively studied on the real JD Logistics data through experiments, where it can identify the root causes that impact the time efficiency of logistics delivery with high accuracy. Furthermore, it is also demonstrated to outperform the baselines including a recent state-of-the-art method.
Online education, which educates students that cannot be present at school, has become an important supplement to traditional education. Without the direct supervision and instruction of teachers, online education is always concerned with potential distractions and misunderstandings. Learning Style Classification (LSC) is proposed to analyze the learning behavior patterns of online learning users, based on which personalized learning paths are generated to help them learn and maintain their interests.
Existing LSC studies rely on expert-labored labeling, which is infeasible in large-scale applications, so we resort to unsupervised classification techniques. However, current unsupervised classification methods are not applicable due to two important challenges: C1) the unawareness of the LSC problem formulation and pedagogy domain knowledge; C2) the absence of any supervision signals. In this paper, we give a formal definition of the unsupervised LSC problem and summarize the domain knowledge into problem-solving heuristics (which addresses C1). A rule-based approach is first designed to provide a tentative solution in a principled manner (which addresses C2). On top of that, a novel Deep Unsupervised Classifier with domain Knowledge (DUCK) is proposed to convert the discovered conclusions and domain knowledge into learnable model components (which addresses both C1 and C2), which significantly improves the effectiveness, efficiency, and robustness. Extensive offline experiments on both public and industrial datasets demonstrate the superiority of our proposed methods. Moreover, the proposed methods are now deployed in the Huawei Education Center, and the ongoing A/B testing results verify the effectiveness of the methods.
Forecasts help businesses allocate resources and achieve objectives. At LinkedIn, product owners use forecasts to set business targets, track outlook, and monitor health. Engineers use forecasts to efficiently provision hardware. Developing a forecasting solution to meet these needs requires accurate and interpretable forecasts on diverse time series with sub-hourly to quarterly frequencies. We present Greykite, an open-source Python library for forecasting that has been deployed on over twenty use cases at LinkedIn. Its flagship algorithm, Silverkite, provides interpretable, fast, and highly flexible univariate forecasts that capture effects such as time-varying growth and seasonality, autocorrelation, holidays, and regressors. The library enables self-serve accuracy and trust by facilitating data exploration, model configuration, execution, and interpretation. Our benchmark results show excellent out-of-the-box speed and accuracy on datasets from a variety of domains. Over the past two years, Greykite forecasts have been trusted by Finance, Engineering, and Product teams for resource planning and allocation, target setting and progress tracking, anomaly detection and root cause analysis. We expect Greykite to be useful to forecast practitioners with similar applications who need accurate, interpretable forecasts that capture complex dynamics common to time series related to human activity.
Embeddings, low-dimensional vector representation of objects, are fundamental in building modern machine learning systems. In industrial settings, there is usually an embedding team that trains an embedding model to solve intended tasks (e.g., product recommendation). The produced embeddings are then widely consumed by consumer teams to solve their unintended tasks (e.g., fraud detection). However, as the embedding model gets updated and retrained to improve performance on the intended task, the newly-generated embeddings are no longer compatible with the existing consumer models. This means that historical versions of the embeddings can never be retired or all consumer teams have to retrain their models to make them compatible with the latest version of the embeddings, both of which are extremely costly in practice.
Here we study the problem of embedding version updates and their backward compatibility. We formalize the problem where the goal is for the embedding team to keep updating the embedding version, while the consumer teams do not have to retrain their models. We develop a solution based on learning backward compatible embeddings, which allows the embedding model version to be updated frequently, while also allowing the latest version of the embedding to be quickly transformed into any backward compatible historical version of it, so that consumer teams do not have to retrain their models. Our key idea is that whenever a new embedding model is trained, we learn it together with a light-weight backward compatibility transformation that aligns the new embedding to the previous version of it. Our learned backward transformations can then be composed to produce any historical version of embedding. Under our framework, we explore six methods and systematically evaluate them on a real-world recommender system application. We show that the best method, which we call BC-Aligner, maintains backward compatibility with existing unintended tasks even after multiple model version updates. Simultaneously, BC-Aligner achieves the intended task performance similar to the embedding model that is solely optimized for the intended task. Code is publicly available at https://github.com/snap-stanford/bc-emb
Pre-trained models (PTMs) have become a fundamental backbone for downstream tasks in natural language processing and computer vision. Despite initial gains that were obtained by applying generic PTMs to geo-related tasks at Baidu Maps, a clear performance plateau over time was observed. One of the main reasons for this plateau is the lack of readily available geographic knowledge in generic PTMs. To address this problem, in this paper, we present ERNIE-GeoL, which is a geography-and-language pre-trained model designed and developed for improving the geo-related tasks at Baidu Maps. ERNIE-GeoL is elaborately designed to learn a universal representation of geography-language by pre-training on large-scale data generated from a heterogeneous graph that contains abundant geographic knowledge. Extensive quantitative and qualitative experiments conducted on large-scale real-world datasets demonstrate the superiority and effectiveness of ERNIE-GeoL. ERNIE-GeoL has already been deployed in production at Baidu Maps since April 2021, which significantly benefits the performance of various downstream tasks. This demonstrates that ERNIE-GeoL can serve as a fundamental backbone for a wide range of geo-related tasks.
Mobile map apps such as the Baidu Maps app have become a ubiquitous and essential tool for users to find optimal routes and get turn-by-turn navigation services while driving. However, interacting with such apps while driving through visual-manual interaction modality inevitably causes driver distraction, due to the highly conspicuous nature of the time-sharing, multi-tasking behavior of the driver. In this paper, we present our efforts and findings of a 4-year longitudinal study on designing and implementing DuIVA, which is an intelligent voice assistant (IVA) embedded in the Baidu Maps app for hands-free, eyes-free human-to-app interaction in a fully voice-controlled manner. Specifically, DuIVA is designed to enable users to control the functionalities of Baidu Maps (e.g., navigation and location search) through voice interaction, rather than visual-manual interaction, which minimizes driver distraction and promotes safe driving by allowing the driver to keep "eyes on the road and hands on the wheel'' while interacting with the Baidu Maps app. DuIVA has already been deployed in production at Baidu Maps since November 2017, which facilitates a better interaction modality with the Baidu Maps app and improves the accessibility and usability of the app by providing users with in-app voice activation, natural language queries, and multi-round dialogue. As of December 31, 2021, over 530 million users have used DuIVA, which demonstrates that DuIVA is an industrial-grade and production-proven solution for in-app intelligent voice assistants.
Rax is a library for composable Learning-to-Rank (LTR) written entirely in JAX. The goal of Rax is to facilitate easy prototyping of LTR systems by leveraging the flexibility and simplicity of JAX. Rax provides a diverse set of popular ranking metrics and losses that integrate well with the rest of the JAX ecosystem. Furthermore, Rax implements a system of ranking-specific function transformations which allows fine-grained customization of ranking losses and metrics. Most notably Rax provides approx_t12n: a function transformation (t12n) that can transform any of our ranking metrics into an approximate and differentiable form that can be optimized. This provides a systematic way to directly optimize neural ranking models for ranking metrics that are not easily optimizable in other libraries. We empirically demonstrate the effectiveness of Rax by benchmarking neural models implemented using Flax and trained using Rax on two popular LTR benchmarks: WEB30K and Istella. Furthermore, we show that integrating ranking losses with T5, a large language model, can improve overall ranking performance on the MS MARCO passage ranking task. We are sharing the Rax library with the open source community as part of the larger JAX ecosystem at https://github.com/google/rax.
Neural networks can leverage self-supervision to learn integrated representations across multiple data modalities. This makes them suitable to uncover complex relationships between vastly different data types, thus lowering the dependency on labor-intensive feature engineering methods. Leveraging deep representation learning, we propose a generic, robust and systematic model that is able to combine multiple data modalities in a permutation and modes-number-invariant fashion, both fundamental properties to properly face changes in data type content and availability. To this end, we treat each multi-modal data sample as a set and utilise autoencoders to learn a fixed size, permutation invariant representation that can be used in any decision making process. We build upon previous work that demonstrates the feasibility of presenting a set as an input to autoencoders through content-based attention mechanisms. However, since model inputs and outputs are permutation invariant, we develop an end-to-end architecture that approximates the solution of a linear sum assignment problem, i.e., a minimum-cost bijective mapping problem, to ensure a match between the elements of the input and the output set for effective loss calculation. We demonstrate the model capability to learn a combined representation while preserving individual mode characteristics focusing on the task of reconstructing multi-omic cancer data. The code is made publicly available on Github https://github.com/PaccMann/fdsa ).
With the unprecedented development of industrialization and urbanization, many hazardous chemicals have become an indispensable part of our daily life. They are produced, transported, and consumed in modern cities every day, which breeds many unknown hazardous chemicals-related locations (HCLs) that are out of the supervision of management departments and accompanying huge threats to urban safety. How to recognize these unknown HCLs and identify their risk levels is an essential task for urban hazardous chemicals management. To accomplish this task, in this work, we propose a system named as CityShield to discover hidden HCLs and classify their risk levels based on trajectories of hazardous chemicals transportation vehicles. The CityShield system consists of three components. The first component is Data Pre-processing, which filters noises in raw trajectories and probes stable transportation vehicles' stay points from massive uncertain GPS points. The second is HCL Recognition, which adopts the proposed HCL-Rec algorithm to cluster stay points into polygonal HCLs, and avoids the improper location merging problem caused by the skewed spatial distribution of HCLs. The third component is HCL Classification, which introduces the HCL relation graph as auxiliary information to overcome the label scarcity problem of HCLs. It adopts a self-supervised method consisting of four pre-training tasks to learn high-quality representations for HCLs from the graph, which are finally used to classify the categories and risk levels of HCLs.
The CityShield system has been deployed in Nantong, an important hazardous chemicals import and export city in China. Experiments and case studies on two large-scale real-world datasets collected from Nantong demonstrated the effectiveness of the proposed system. In real-world applications, the CityShield system discovered 173 high-risk unknown HCLs for the Nantong government, and successfully moved the hazardous chemicals management of Nantong to the prevention rather than emergency response side.
With the increasing complexity of modern software systems, it is essential yet hard to detect anomalies and diagnose problems precisely. Existing log-based anomaly detection approaches rely on a few key assumptions on system logs and perform well in some experimental systems. However, real-world industrial systems are often with poor logging quality, in which system logs are noisy and often violate the assumptions of existing approaches. This makes these approaches inefficient. This paper first conducts a comprehensive study on the system logs of three large-scale industrial software systems. Through the study, we identify four typical anti-patterns that affect the detection results the most. Based on these patterns, we propose HiLog, an effective human-in-the-loop log-based anomaly detection approach that integrates human knowledge to augment anomaly detection models. With little human labeling effort, our approach can significantly improve the effectiveness of existing models. Experiment results on three large-scale industrial software systems show that our method improves over 50% precision rate on average.
Predicting the interactions between T-cell receptors (TCRs) and peptides is crucial for the development of personalized medicine and targeted vaccine in immunotherapy. Current datasets for training deep learning models of this purpose remain constrained without diverse TCRs and peptides. To combat the data scarcity issue presented in the current datasets, we propose to extend the training dataset by physical modeling of TCR-peptide pairs. Specifically, we compute the docking energies between auxiliary unknown TCR-peptide pairs as surrogate training labels. Then, we use these extended example-label pairs to train our model in a supervised fashion. Finally, we find that the AUC score for the prediction of the model can be further improved by pseudo-labeling of such unknown TCR-peptide pairs (by a trained teacher model), and re-training the model with those pseudo-labeled TCR-peptide pairs. Our proposed method that trains the deep neural network with physical modeling and data-augmented pseudo-labeling improves over baselines in the available two datasets. We also introduce a new dataset that contains over 80,000 unknown TCR-peptide pairs with docking energy scores.
Network motif is a kind of frequently occurring subgraph that reflects local topology in graphs. Although network motif has been studied in graph analytics, e.g., social network and biological network, it is yet unclear whether network motif is useful for analyzing online transaction network that is generated in applications such as instant messaging and e-commerce. In this work, we analyze online transaction networks from the perspective of network motif. We define vertex features based on size-2 and size-3 motifs, and introduce motif-based centrality measurements. We further design motif-based vertex embedding that integrates weighted motif counts and centrality measurements. Afterward, we implement a distributed framework for motif detection in large-scale online transaction networks. To understand the effectiveness of motif for analyzing online transaction network, we study the statistical distribution of motifs in various kinds of graphs in Tencent and assess the benefit of motif-based embedding in a range of downstream graph analytical tasks. Empirical results show that our proposed method can efficiently find motifs in large-scale graphs, help interpretability, and benefit downstream tasks.
In the pharmaceutical industry, the maintenance of production machines must be audited by the regulator. In this context, the problem of predictive maintenance is not when to maintain a machine, but what parts to maintain at a given point in time. The focus shifts from the entire machine to its component parts and prediction becomes a classification problem. In this paper, we focus on rolling-elements bearings and we propose a framework for predicting their degradation stages automatically. Our main contribution is a k-means bearing lifetime segmentation method based on high-frequency bearing vibration signal embedded in a latent low-dimensional subspace using an AutoEncoder. Given high-frequency vibration data, our framework generates a labeled dataset that is used to train a supervised model for bearing degradation stage detection. Our experimental results, based on the publicly available FEMTO Bearing run-to-failure dataset, show that our framework is scalable and that it provides reliable and actionable predictions for a range of different bearings.
Here we leverage the power of the crowd: online users who are willing to answer questions about dish availability at restaurants visited. While motivated users are happy to contribute knowledge, they are much less likely to respond to "silly'' or embarrassing questions (e.g., "DoesPizza Hut serve pizza?'' or "DoesMike's Vegan Restaurant serve steak?'')
In this paper, we study the problem of Vexation-Aware Active Learning (VAAL), where judiciously selected questions are targeted towards improving restaurant-dish model prediction, subject to a limit on the percentage of "unsure'' answers or "dismissals'' (e.g., swiping the app closed) measuring user vexation. We formalize the selection problem as an integer program and solve it efficiently using a distributed solution that scales linearly with the number of candidate questions. Since our algorithm relies on an accurate estimation of the unsure-dismiss rate (UDR), we present a regression model that provides high-quality results compared to baselines including collaborative filtering. Finally, we demonstrate in a live system that our proposed VAAL strategy performs competitively against classical (margin-based) active learning approaches while reducing the UDR for the questions being asked.
Online ads are essential to all businesses and ad headlines are one of their core creative component. Existing methods can generate headlines automatically and also optimize their click-through-rate (CTR) and quality. However, evolving ad formats and changing creative requirements make it difficult to generate optimized & customized headlines. We propose a novel method that uses prefix control tokens along with BART [16] fine-tuning. It yields the highest CTR and also allows users to control the length of generated headlines for use across different ad formats. The method is also flexible and can easily be adapted to other architectures, creative requirements and optimization criteria. Our experiments demonstrate a 25.82% increment in Rouge-L and a 5.82% increment in estimated CTR over previously published strong ad headline generation baseline.
Multi-Task Learning (MTL) is widely-accepted in Natural Language Processing as a standard technique for learning multiple related tasks in one model. Training an MTL model requires having the training data for all tasks available at the same time. As systems usually evolve over time, (e.g., to support new functionalities), adding a new task to an existing MTL model usually requires retraining the model from scratch on all the tasks and this can be time-consuming and computationally expensive. Moreover, in some scenarios, the data used to train the original training may be no longer available, for example, due to storage or privacy concerns.
In this paper, we approach the problem of incrementally expanding MTL models' capability to solve new tasks over time by distilling the knowledge of an already trained model on n tasks into a new one for solving n+1 tasks. To avoid catastrophic forgetting, we propose to exploit unlabeled data from the same distributions of the old tasks. Our experiments on publicly available benchmarks show that such a technique dramatically benefits the distillation by preserving the already acquired knowledge (i.e. preventing up to 20% performance drops on old tasks) while obtaining good performance on the incrementally added tasks. Further, we also show that our approach is beneficial in practical settings by using data from a leading voice assistant.
In fluid team sports such as soccer and basketball, analyzing team formation is one of the most intuitive ways to understand tactics from domain participants' point of view. However, existing approaches either assume that team formation is consistent throughout a match or assign formations frame-by-frame, which disagree with real situations. To tackle this issue, we propose a change-point detection framework named SoccerCPD that distinguishes tactically intended formation and role changes from temporary changes in soccer matches. We first assign roles to players frame-by-frame and perform two-step change-point detections: (1) formation change-point detection based on the sequence of role-adjacency matrices and (2) role change-point detection based on the sequence of role permutations. The evaluation of SoccerCPD using the ground truth annotated by domain experts shows that our method accurately detects the points of tactical changes and estimates the formation and role assignment per segment. Lastly, we introduce practical use-cases that domain participants can easily interpret and utilize.
Given a large, semi-infinite collection of co-evolving epidemiological data containing the daily counts of cases/deaths/recovered in multiple locations, how can we incrementally monitor current dynamical patterns and forecast future behavior? The world faces the rapid spread of infectious diseases such as SARS-CoV-2 (COVID-19), where a crucial goal is to predict potential future outbreaks and pandemics, as quickly as possible, using available data collected throughout the world. In this paper, we propose a new streaming algorithm, EPICAST, which is able to model, understand and forecast dynamical patterns in large co-evolving epidemiological data streams. Our proposed method is designed as a dynamic and flexible system, and is based on a unified non-linear differential equation. Our method has the following properties: (a) Effective: it operates on large co-evolving epidemiological data streams, and captures important world-wide trends, as well as location-specific patterns. It also performs real-time and long-term forecasting; (b) Adaptive: it incrementally monitors current dynamical patterns, and also identifies any abrupt changes in streams; (c) Scalable: our algorithm does not depend on data size, and thus is applicable to very large data streams. In extensive experiments on real datasets, we demonstrate that EPICAST outperforms the best existing state-of-the-art methods as regards accuracy and execution speed.
A/B tests, or online controlled experiments, are heavily used in industry to evaluate implementations of ideas. While the statistics behind controlled experiments are well documented and some basic pitfalls known, we have observed some seemingly intuitive concepts being touted, including by A/B tool vendors and agencies, which are misleading, often badly so. Our goal is to describe these misunderstandings, the "intuition" behind them, and to explain and bust that intuition with solid statistical reasoning. We provide recommendations that experimentation platform designers can implement to make it harder for experimenters to make these intuitive mistakes.
Prior work in Dense Retrieval usually encodes queries and documents using single-vector representations (also called embeddings) and performs retrieval in the embedding space using approximate nearest neighbor search. This paradigm enables efficient semantic retrieval. However, the single-vector representations can be ineffective at capturing different aspects of the queries and documents in relevance matching, especially for some vertical domains. For example, in e-commerce search, these aspects could be category, brand and color. Given a query ''white nike socks", a Dense Retrieval model may mistakenly retrieve some ''white adidas socks" while missing out the intended brand. We propose to explicitly represent multiple aspects using one embedding per aspect. We introduce an aspect prediction task to teach the model to capture aspect information with particular aspect embeddings. We design a lightweight network to fuse the aspect embeddings for representing queries and documents. Our evaluation using an e-commerce dataset shows impressive improvements over strong Dense Retrieval baselines. We also discover that the proposed aspect embeddings can enhance the interpretability of Dense Retrieval models as a byproduct.
Multi-lingual text advertisement generation is a critical task for international companies, such as Microsoft. Due to the lack of training data, scaling out text advertisements generation to low-resource languages is a grand challenge in the real industry setting. Although some methods transfer knowledge from rich-resource languages to low-resource languages through a pre-trained multi-lingual language model, they fail in balancing the transferability from the source language and the smooth expression in target languages. In this paper, we propose a unified Self-Supervised Augmentation and Generation (SAG) architecture to handle the multi-lingual text advertisements generation task in a real production scenario. To alleviate the problem of data scarcity, we employ multiple data augmentation strategies to synthesize training data in target languages. Moreover, a self-supervised adaptive filtering structure is developed to alleviate the impact of the noise in the augmented data. The new state-of-the-art results on a well-known benchmark verify the effectiveness and generalizability of our proposed framework, and deployment in Microsoft Bing demonstrates the superior performance of our method.
On the world wide web, toxic content detectors are a crucial line of defense against potentially hateful and offensive messages. As such, building highly effective classifiers that enable a safer internet is an important research area. Moreover, the web is a highly multilingual, cross-cultural community that develops its own lingo over time. As such, it is crucial to develop models that are effective across a diverse range of languages, usages, and styles. In this paper, we present the fundamentals behind the next version of the Perspective API from Google Jigsaw. At the heart of the approach is a single multilingual token-free Charformer model that is applicable across a range of languages, domains, and tasks. We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings. We additionally outline the techniques employed to make such a byte-level model efficient and feasible for productionization. Through extensive experiments on multilingual toxic comment classification benchmarks derived from real API traffic and evaluation on an array of code-switching, covert toxicity, emoji-based hate, human-readable obfuscation, distribution shift, and bias evaluation settings, we show that our proposed approach outperforms strong baselines. Finally, we present our findings from deploying this system in production.
Mobile edge computing (MEC) offers the infrastructure for improving data caching performance structurally by deploying edge servers at the network edge within users' close geographic proximity. Popular data like viral videos can be cached on edge servers to serve users with low latency. Investigating the integrity of these edge data is critical and challenging as edge servers often suffer from unreliability and constrained resources. Meanwhile, EDI (edge data integrity) investigation must be performed by edge servers collaboratively at the edge to avoid excessive backhaul network traffic. There are two main challenges in practice: 1) there is a lack of Byzantine-tolerant collaborative investigation method; and 2) edge servers may be reluctant to collaborate without proper incentives. To tackle these challenges systematically, this paper proposes a novel scheme named EdgeWatch to enable robust and collaborative EDI investigation in a decentralized manner based on blockchain. Under EdgeWatch, edge servers collaborate on EDI investigation following a novel integrity consensus. A blockchain system comprises of three main components is built as the infrastructure to facilitate integrity consensus: 1) an incentive mechanism that motivates edge servers to participate in EDI investigation; 2) a reputation system that elects reliable leaders for block consensus; and 3) a leader randomization technique that protects leaders from targeted attacks. We evaluate it against three representative schemes experimentally. The results demonstrate the high precision, efficiency, and robustness of EdgeWatch.
Deep sequence networks such as multi-head self-attention networks provide a promising way to extract effective representations from raw sequence data in an end-to-end fashion and have shown great success in various domains such as natural language processing, computer vision, $etc$. However, in domains such as financial risk management and anti-fraud where expert-derived features are heavily relied on, deep sequence models struggle to dominate the game.In this paper, we introduce a simple framework called symbolic testing to verify the learnability of certain expert-derived features over sequence data. A systematic investigation over simulated data reveals the fact that the self-attention architecture fails to learn some standard symbolic expressions like the count distinct operation. To overcome this deficiency, we propose a novel architecture named SHORING, which contains two components:event network andsequence network. Theevent network efficiently learns arbitrary high-orderevent-level conditional embeddings via a reparameterization trick while thesequence network integrates domain-specific aggregations into the sequence-level representation, thereby providing richer inductive biases compare to standard sequence architectures like self-attention. We conduct comprehensive experiments and ablation studies on synthetic datasets that mimic sequence data commonly seen in anti-fraud domain and three real-world datasets. The results show that SHORING learns commonly used symbolic features well, and experimentally outperforms the state-of-the-art methods by a significant margin over real-world online transaction datasets. The symbolic testing framework and SHORING have been applied in anti-fraud model development at Alipay and improved performance of models for real-time fraud-detection.
Fault diagnosis is critical in many domains, as faults may lead to safety threats or economic losses. In the field of online service systems, operators rely on enormous monitoring data to detect and mitigate failures. Quickly recognizing a small set of root cause indicators for the underlying fault can save much time for failure mitigation. In this paper, we formulate the root cause analysis problem as a new causal inference task namedintervention recognition. We proposed a novel unsupervised causal inference-based method namedCausal Inference-based Root Cause Analysis (CIRCA). The core idea is a sufficient condition for a monitoring variable to be a root cause indicator,i.e., the change of probability distribution conditioned on the parents in the Causal Bayesian Network (CBN). Towards the application in online service systems, CIRCA constructs a graph among monitoring metrics based on the knowledge of system architecture and a set of causal assumptions. The simulation study illustrates the theoretical reliability of CIRCA. The performance on a real-world dataset further shows that CIRCA can improve the recall of the top-1 recommendation by 25% over the best baseline method.
Industrial search and recommendation systems mostly follow the classic multi-stage information retrieval paradigm: matching, pre-ranking, ranking, and re-ranking stages. To account for system efficiency, simple vector-product based models are commonly deployed in the pre-ranking stage. Recent works consider distilling the high knowledge of large ranking models to small pre-ranking models for better effectiveness. However, two major challenges in pre-ranking system still exist: (i) without explicitly modeling the performance gain versus computation cost, the predefined latency constraint in the pre-ranking stage inevitably leads to suboptimal solutions; (ii) transferring the ranking teacher's knowledge to a pre-ranking student with a predetermined handcrafted architecture still suffers from the loss of model performance. In this work, a novel framework AutoFAS is proposed which jointly optimizes the efficiency and effectiveness of the pre-ranking model: (i) AutoFAS for the first time simultaneously selects the most valuable features and network architectures using Neural Architecture Search (NAS) technique; (ii) equipped with ranking model guided reward during NAS procedure, AutoFAS can select the best pre-ranking architecture for a given ranking teacher without any computation overhead. Experimental results in our real world search system show AutoFAS consistently outperforms the previous state-of-the-art (SOTA) approaches at a lower computing cost. Notably, our model has been adopted in the pre-ranking module in the search system of Meituan, bringing significant improvements.
The purpose of Inventory Pricing is to bid the right prices to online ad opportunities, which is crucial for a Demand-Side Platform (DSP) to win advertising auctions in Real-Time Bidding (RTB). In the planning stage, advertisers need the forecast of probabilistic models to make bidding decisions. However, most of the previous works made strong assumptions on the distribution form of the winning price, which reduced their accuracy and weakened their ability to make generalizations. Though some works recently tried to fit the distribution directly, their complex structure lacked efficiency on online inference, which is critical for advertising systems. In this paper, we devise a novel loss function, Neighborhood Likelihood Loss (NLL), collaborating with a proposed framework, Arbitrary Distribution Modeling (ADM), to predict the winning price distribution under censorship with no pre-assumption required. We conducted experiments on two real-world experimental datasets and one large-scale, non-simulated production dataset in our system. Experiments showed that ADM outperformed the baselines both on algorithm and business metrics. This method has been released for one year and led to good yield in our system. Without any pre-assumed specific distribution form, ADM showed significant advantages in effectiveness and efficiency, demonstrating its great capability in modeling sophisticated price landscapes.
Consumption intent, defined as the decision-driven force of consumption behaviors, is crucial for improving the explainability and performance of user-modeling systems, with various downstream applications like recommendation and targeted marketing. However, consumption intent is implicit, and only a few known intents have been explored from the user consumption data in Meituan. Hence, discovering new consumption intents is a crucial but challenging task, which suffers from two critical challenges: 1) how to encode the consumption intent related to multiple aspects of preferences, and 2) how to discover the new intents with only a few known ones. In Meituan, we designed the AutoIntent system, consisting of the disentangled intent encoder and intent discovery decoder, to address the above challenges. Specifically, for the disentangled intent encoder, we construct three groups of dual hypergraphs to capture the high-order relations under the three aspects of preferences and then utilize the designed hypergraph neural networks to extract disentangled intent features. For the intent discovery decoder, we propose to build intent-pair pseudo labels based on the denoised feature similarities to transfer knowledge from known intents to new ones. Extensive offline evaluations verify that AutoIntent can effectively discover unknown consumption intents. Moreover, we deploy AutoIntent in the recommendation engine of the Meituan APP, and the further online evaluation verifies its effectiveness.
Promising progress has been made toward learning efficient time series representations in recent years, but the learned representations often lack interpretability and do not encode semantic meanings by the complex interactions of many latent factors. Learning representations that disentangle these latent factors can bring semantic-rich representations of time series and further enhance interpretability. However, directly adopting the sequential models, such as Long Short-Term Memory Variational AutoEncoder (LSTM-VAE), would encounter a Kullback?Leibler (KL) vanishing problem: the LSTM decoder often generates sequential data without efficiently using latent representations, and the latent spaces sometimes could even be independent of the observation space. And traditional disentanglement methods may intensify the trend of KL vanishing along with the disentanglement process, because they tend to penalize the mutual information between the latent space and the observations. In this paper, we propose Disentangle Time-Series, a novel disentanglement enhancement framework for time series data. Our framework achieves multi-level disentanglement by covering both individual latent factors and group semantic segments. We propose augmenting the original VAE objective by decomposing the evidence lower-bound and extracting evidence linking factorial representations to disentanglement. Additionally, we introduce a mutual information maximization term between the observation space to the latent space to alleviate the KL vanishing problem while preserving the disentanglement property. Experimental results on five real-world IoT datasets demonstrate that the representations learned by DTS achieve superior performance in various tasks with better interpretability.
Taxonomies describe the definitions of entities, entities' attributes and the relations among the entities, and thus play an important role in building a knowledge graph. In this paper, we tackle the task of taxonomy entity translation, which is to translate the names of taxonomy entities in a source language to a target language. The translations then can be utilized to build a knowledge graph in the target language. Despite its importance, taxonomy entity translation remains a hard problem for AI models due to two major challenges. One challenge is understanding the semantic context in very short entity names. Another challenge is having deep understanding for the domain where the knowledge graph is built.
We present TaxoTrans, a novel method for taxonomy entity translation that can capture the context in entity names and the domain knowledge in taxonomy. To achieve this, TaxoTrans creates a heterogeneous graph to connect entities, and formulates the entity name translation problem as link prediction in the heterogeneous graph: given a pair of entity names across two languages, TaxoTrans applies a graph neural network to determine whether they form a translation pair or not. Because of this graph, TaxoTrans can capture both the semantic context and the domain knowledge. Our offline experiments on LinkedIn's skill and title taxonomies show that by modeling semantic information and domain knowledge in the heterogeneous graph, TaxoTrans outperforms the state-of-the-art translation methods by ∼10%. Human annotation and A/B test results further demonstrate that the accurately translated entities significantly improves user engagements and advertising revenue at LinkedIn.
Recent years have witnessed an exponential growth of model scale in deep learning-based recommender systems---from Google's 2016 model with 1 billion parameters to the latest Facebook's model with 12 trillion parameters. Significant quality boost has come with each jump of the model capacity, which makes us believe the era of 100 trillion parameters is around the corner. However, the training of such models is challenging even within industrial scale data centers. We resolve this challenge by careful co-design of both optimization algorithm and distributed system architecture. Specifically, to ensure both the training efficiency and the training accuracy, we design a novel hybrid training algorithm, where the embedding layer and the dense neural network are handled by different synchronization mechanisms; then we build a system called Persia (short for parallel recommendation training system with hybrid acceleration) to support this hybrid training algorithm. Both theoretical demonstrations and empirical studies with up to 100 trillion parameters have been conducted to justify the system design and implementation of Persia. We make Persia publicly available (at github.com/PersiaML/Persia) so that anyone can easily train a recommender model at the scale of 100 trillion parameters.
In this paper, we present Duplex Conversation, a multi-turn, multimodal spoken dialogue system that enables telephone-based agents to interact with customers like a human. We use the concept of full-duplex in telecommunication to demonstrate what a human-like interactive experience should be and how to achieve smooth turn-taking through three subtasks: user state detection, backchannel selection, and barge-in detection. Besides, we propose semi-supervised learning with multimodal data augmentation to leverage unlabeled data to increase model generalization. Experimental results on three sub-tasks show that the proposed method achieves consistent improvements compared with baselines. We deploy the Duplex Conversation to Alibaba intelligent customer service and share lessons learned in production. Online A/B experiments show that the proposed system can significantly reduce response latency by 50%.
Feature selection plays an impactful role in deep recommender systems, which selects a subset of the most predictive features, so as to boost the recommendation performance and accelerate model optimization. The majority of existing feature selection methods, however, aim to select only a fixed subset of features. This setting cannot fit the dynamic and complex environments of practical recommender systems, where the contribution of a specific feature varies significantly across user-item interactions. In this paper, we propose an adaptive feature selection framework, AdaFS, for deep recommender systems. To be specific, we develop a novel controller network to automatically select the most relevant features from the whole feature space, which fits the dynamic recommendation environment better. Besides, different from classic feature selection approaches, the proposed controller can adaptively score each example of user-item interactions, and identify the most informative features correspondingly for subsequent recommendation tasks. We conduct extensive experiments based on two public benchmark datasets from a real-world recommender system. Experimental results demonstrate the effectiveness of AdaFS, and its excellent transferability to the most popular deep recommendation models.
The most notable neural data-to-text approaches generate natural language from structural data relying on the surface form of the structural content, which ignores the underlying logical correlation between the input data and the target text. Moreover, identifying such logical associations and explaining them in natural language is desirable but not yet studied. In this paper, we introduce a practical data-to-text method for the logic-critical scenario, specifically for anti-money laundering applications. It involves detecting risks from input data and explaining any abnormal behaviors in natural language. The proposed method is a Logic Aware Neural Generation framework (LANG), which is a preliminary attempt to explore the integration of logic modeling and text generation. Concretely, we first convert expert rules to a logic graph. Then, the model utilizes meta path based encoder to exploit the expert knowledge. Besides, a retriever module with the encoded logic knowledge is used to bridge the gap between numeric input and target text. Finally, a rule-constrained loss is leveraged to improve the generation probability of tokens in rule recalled statements to ensure accuracy. We conduct extensive experiments on anti-money laundering data. Results show that the proposed method significantly outperforms baselines in both objective measures with relative 35% improvements in F1 score and subjective measures with 30% improvement in human preference.
Relevant recommendation is a special recommendation scenario which provides relevant items when users express interests on one target item (e.g., click, like and purchase). Besides considering the relevance between recommendations and trigger item, the recommendations should also be diversified to avoid information cocoons. However, existing diversified recommendation methods mainly focus on item-level diversity which is insufficient when the recommended items are all relevant to the target item. Moreover, redundant or noisy item features might affect the performance of simple feature-aware recommendation approaches. Faced with these issues, we propose a Feature Disentanglement Self-Balancing Re-ranking framework (FDSB) to capture feature- aware diversity. The framework consists of two major modules, namely disentangled attention encoder (DAE) and self-balanced multi-aspect ranker. In DAE, we use multi-head attention to learn disentangled aspects from rich item features. In the ranker, we develop an aspect-specific ranking mechanism that is able to adaptively balance the relevance and diversity for each aspect. In experiments, we conduct offline evaluation on the collected dataset and deploy FDSB on KuaiShou app for online ??/?? test on the function of relevant recommendation. The significant improvements on both recommendation quality and user experience verify the effectiveness of our approach.
The practice of continuous deployment has enabled companies to reduce time-to-market by increasing the rate at which software can be deployed. However, deploying more frequently bears the risk that occasionally defective changes are released. For Internet companies, this has the potential to degrade the user experience and increase user abandonment. Therefore, quality control gates are an important component of the software delivery process. These are used to build confidence in the reliability of a release or change. Towards this end, a common approach is to perform a canary test to evaluate new software under production workloads. Detecting defects as early as possible is necessary to reduce exposure and to provide immediate feedback to the developer.
We present a statistical framework for rapidly detecting regressions in software deployments. Our approach is based on sequential tests of stochastic order and of equality in distribution. This enables canary tests to be continuously monitored, permitting regressions to be rapidly detected while strictly controlling the false detection probability throughout. The utility of this approach is demonstrated based on two case studies at Netflix.
This paper reports our recent practice of recommending articles to cold-start users at Tencent. Transferring knowledge from information-rich domains to help user modeling is an effective way to address the user-side cold-start problem. Our previous work demonstrated that general-purpose user embeddings based on mobile app usage helped article recommendations. However, high-dimensional embeddings are cumbersome for online usage, thus limiting the adoption. On the other hand, user clustering, which partitions users into several groups, can provide a lightweight, online-friendly, and explainable way to help recommendations. Effective user clustering for article recommendations based on mobile app usage faces unique challenges, including (1) the gap between an active user's behavior of mobile app usage and article reading, and (2) the gap between mobile app usage patterns of active and cold-start users. To address the challenges, we propose a tailored Dual Alignment User Clustering (DAUC) model, which applies a sample-wise contrastive alignment to eliminate the gap between active users' mobile app usage and article reading behavior, and a distribution-wise adversarial alignment to eliminate the gap between active users' and cold-start users' app usage behavior. With DAUC, cold-start recommendation-optimized user clustering based on mobile app usage can be achieved. On top of the user clusters, we further build candidate generation strategies, real-time features, and corresponding ranking models without much engineering difficulty. Both online and offline experiments demonstrate the effectiveness of our work.
The outbreak of COVID-19 burgeons newborn services on online platforms and simultaneously buoys multifarious online fraud activities. Due to the rapid technological and commercial innovation that opens up an ever-expanding set of products, the insufficient labeling data renders existing supervised or semi-supervised fraud detection models ineffective in these emerging services. However, the ever accumulated user behavioral data on online platforms might be helpful in improving the performance of fraud detection on newborn services. To this end, in this paper, we propose to pre-train user behavior sequences, which consist of orderly arranged actions, from the large-scale unlabeled data sources for online fraud detection. Recent studies illustrate accurate extraction of user intentions~(formed by consecutive actions) in behavioral sequences can propel improvements in the performance of online fraud detection. By anatomizing the characteristic of online fraud activities, we devise a model named UB-PTM that learns knowledge of fraud activities by three agent tasks at different granularities, i.e., action, intention, and sequence levels, from large-scale unlabeled data. Extensive experiments on three downstream transaction and user-level online fraud detection tasks demonstrate that our UB-PTM is able to outperform the state-of-the-art designing for specific tasks.
In online information systems, users make decisions based on factors of several specific aspects, such as brand, price, etc. Existing recommendation engines ignore the explicit modeling of these factors, leading to sub-optimal recommendation performance. In this paper, we focus on the real-world scenario where these factors can be explicitly captured (the users are exposed with decision factor-based persuasion texts, i.e., persuasion factors). Although it allows us for explicit modeling of user-decision process, there are critical challenges including the persuasion factor's representation learning and effect estimation, along with the data-sparsity problem. To address them, in this work, we present our POEM (short for Persuasion factOr Effect Modeling) system. We first propose the persuasion-factor graph convolutional layers for encoding and learning representations from the persuasion-aware interaction data. Then we develop a prediction layer that fully considers the user sensitivity to the persuasion factors. Finally, to address the data-sparsity issue, we propose a counterfactual learning-based data augmentation method to enhance the supervision signal. Real-world experiments demonstrate the effectiveness of our proposed framework of modeling the effect of persuasion factors.
Burnout is a significant public health concern affecting nearly half of the healthcare workforce. This paper presents the first end-to-end deep learning framework for predicting physician burnout based on electronic health record (EHR) activity logs, digital traces of physician work activities that are available in any EHR system. In contrast to prior approaches that exclusively relied on surveys for burnout measurement, our framework directly learns deep representations of physician behaviors from large-scale clinician activity logs to predict burnout. We propose the Hierarchical burnout Prediction based on Activity Logs (HiPAL), featuring a pre-trained time-dependent activity embedding mechanism tailored for activity logs and a hierarchical predictive model, which mirrors the natural hierarchical structure of clinician activity logs and captures physicians' evolving burnout risk at both short-term and long-term levels. To utilize the large amount of unlabeled activity logs, we propose a semi-supervised framework that learns to transfer knowledge extracted from unlabeled clinician activities to the HiPAL-based prediction model. The experiment on over 15 million clinician activity logs collected from the EHR at a large academic medical center demonstrates the advantages of our proposed framework in predictive performance of physician burnout and training efficiency over state-of-the-art approaches.
Deep Neural Network (DNN) based recommendation systems are widely used in the modern internet industry for a variety of services. However, the rapid expansion of application scenarios and the explosive global internet traffic growth have caused the industry to face increasing challenges to serve the complicated recommendation workflow regarding online recommendation efficiency and compute resource overhead. In this paper, we present a GPU-accelerated online serving system, namely Lion, which consists of the staged event-driven heterogeneous pipeline, unified memory manager, and automatic execution optimizer to handle web-scale traffic in a real-time and cost-effective way. Moreover, Lion provides a heterogeneous template library to enable fast development and migration for diverse in-house web-scale recommendation systems without requiring knowledge of heterogeneous programming. The system is currently deployed at Baidu, supporting over twenty recommendation services, including news feed, short video clips, and the search engine. Extensive experimental studies on five real-world deployed online recommendation services demonstrate the superiority of the proposed GPU-accelerated online serving system. Since launched in early 2020, Lion has answered billions of recommendation requests per day, and has helped Baidu successfully save millions of U.S. dollars in hardware and utility costs per year.
Federated learning (FL) is an important paradigm for training global models from decentralized data in a privacy-preserving way. Existing FL methods usually assume the global model can be trained on any participating client. However, in real applications, the devices of clients are usually heterogeneous, and have different computing power. Although big models like BERT have achieved huge success in AI, it is difficult to apply them to heterogeneous FL with weak clients. The straightforward solutions like removing the weak clients or using a small model to fit all clients will lead to some problems, such as under-representation of dropped clients and inferior accuracy due to data loss or limited model representation ability. In this work, we propose InclusiveFL, a client-inclusive federated learning method to handle this problem. The core idea of InclusiveFL is to assign models of different sizes to clients with different computing capabilities, bigger models for powerful clients and smaller ones for weak clients. We also propose an effective method to share the knowledge among local models with different sizes. In this way, all the clients can participate in FL training, and the final model can be big and powerful enough. Besides, we propose a momentum knowledge distillation method to better transfer knowledge in big models on powerful clients to the small models on weak clients. Extensive experiments on many real-world benchmark datasets demonstrate the effectiveness of InclusiveFL in learning accurate models from clients with heterogeneous devices under the FL framework.
On-demand delivery is a new form of logistics where customers place orders through online platforms and the platform arranges couriers to deliver them within a short time. The acquisition of indoor status (i.e., arrival or departure at the merchants) of couriers plays an important role in order dispatching and route planning. The Bluetooth Low Energy (BLE) device is a promising solution for city-wide indoor status estimation due to the low hardware and deployment costs and low power consumption. However, the environment and smartphone model heterogeneities affect the status characteristics contained in the Bluetooth signal, resulting in the decline of status estimation performance. The previous methods to alleviate the heterogeneity are not suitable for city-wide scenarios with thousands of merchants and hundreds of smartphone models. In this paper, we propose Para-Pred, an indoor status estimation framework based on the graph neural network, which directly Predicts the effective indoor status estimation model Parameters for unseen scenarios. Our key idea is to utilize similarity between the influence patterns of heterogeneities on the Bluetooth signal to directly infer unseen scenarios' influence patterns. We evaluate the Para-Pred on 109,378 couriers with 672 smartphone models in 12,109 merchants from an on-demand delivery company. The evaluation results show that across environment and smartphone model heterogeneities, the accuracy and recall of our method achieve 93.62% and 95.20%, outperforming state-of-the-art solutions.
Academic Knowledge Services have substantially facilitated the development of human science and technology, providing a plenitude of useful research tools. However, many applications highly depend on ad-hoc models and expensive human labeling to understand professional contents, hindering deployments in real world. To create a unified backbone language model for various knowledge-intensive academic knowledge mining challenges, based on the world's largest public academic graph Open Academic Graph (OAG), we pre-train an academic language model, namely OAG-BERT, to integrate massive heterogeneous entity knowledge beyond scientific corpora. We develop novel pre-training strategies along with zero-shot inference techniques. OAG-BERT's superior performance on 9 knowledge-intensive academic tasks (including 2 demo applications) demonstrates its qualification to serve as a foundation for academic knowledge services. Its zero-shot capability also offers great potential to mitigate the need of costly annotations. OAG-BERT has been deployed to multiple real-world applications, such as reviewer recommendations for NSFC (National Nature Science Foundation of China) and paper tagging in the AMiner system. All codes and pre-trained models are available via the CogDL.
The importance of modeling contextual information within a search session has been widely acknowledged. However, learning representations of multi-query multi-modal (MM) search, in which Mobile Taobao users repeatedly submit textual and visual queries, remains unexplored in literature. Previous work which learns task-specific representations of textual query sessions fails to capture diverse query types and correlations in MM search sessions. This paper presents to represent MM search sessions by heterogeneous graph neural network (HGN). A multi-view contrastive learning framework is proposed to pretrain the HGN, with two views to model different intra-query, inter-query, and inter-modality information diffusion in MM search. Extensive experiments demonstrate that, the pretrained session representation can benefit state-of-the-art baselines on various downstream tasks, such as personalized click prediction, query suggestion, and intent classification.
One of the most common threats to online service system's reliability is disk failure. Many disk failure prediction techniques have been developed to predict failures before they actually occur, allowing proactive steps to be taken to minimize service disruption and increase service reliability. Existing approaches for disk failure prediction do not differentiate among various types of disk failure. In industrial practice, however, different product teams treat distinct types of disk failures as different prediction tasks in large-scale online service systems like Microsoft 365. For example, hardware operation team is concerned with physical disk errors, while database service team focuses on I/O delay. In this paper, we propose MTHC (Multi-Task Hierarchical Classification) to enhance the performance of disk failure prediction for each task via multi-task learning. In addition, MTHC introduces a novel hierarchy-aware mechanism to deal with the data imbalance problem, which is a severe issue in the area of disk failure prediction. We show that MTHC can be easily utilized to enhance most state-of-the-art disk failure prediction models. Our experiments on both industrial and public datasets demonstrate that such disk failure prediction models enhanced by MTHC performs much better than those models working without MTHC. Furthermore, our experiments also present that the hierarchical-aware mechanism underlying MTHC can alleviate the data imbalance problem and thus improve the practical performance of various disk failure prediction models. More encouragingly, the proposed MTHC has been successfully applied to Microsoft 365 online service systems, and averagely reduces the number of virtual machine interruptions by 10% per month.
Managing discount promotional events ("markdown") is a significant part of running an e-commerce business, and inefficiencies here can significantly hamper a retailer's profitability. Traditional approaches for tackling this problem rely heavily on price elasticity modelling. However, the partial information nature of price elasticity modelling, together with the non-negotiable responsibility for protecting profitability, mean that machine learning practitioners must often go through great lengths to define strategies for measuring offline model quality. In the face of this, many retailers fall back on rule-based methods, thus forgoing significant gains in profitability that can be captured by machine learning. In this paper, we introduce two novel end-to-end markdown management systems for optimising markdown at different stages of a retailer's journey. The first system, "Ithax," enacts a rational supply-side pricing strategy without demand estimation, and can be usefully deployed as a "cold start" solution to collect markdown data while maintaining revenue control. The second system, "Promotheus," presents a full framework for markdown optimization with price elasticity. We describe in detail the specific modelling and validation procedures that, within our experience, have been crucial to building a system that performs robustly in the real world. Both markdown systems achieve superior profitability compared to decisions made by our experienced operations teams in a controlled online test, with improvements of 86% (Promotheus) and 79% (Ithax) relative to manual strategies. These systems have been deployed to manage markdown at ASOS.com, and both systems can be fruitfully deployed for price optimization across a wide variety of retail e-commerce settings.
Preference diversity arouses much research attention in recent years, as it is believed to be closely related to many profound problems such as user activeness in social media or recommendation systems. However, due to the lack of large-scale data with comprehensive user behavior log and accurate content labels, the real quantitative effect of preference diversity on user activeness is still largely unknown. This paper studies the heterogeneous effect of preference diversity on user activeness in social media. We examine large-scale real-world datasets collected from two of the most popular video-sharing social platforms in China, including the behavior logs of more than 787 thousand users and 1.95 million videos, with accurate content category information. We investigate the distribution and evolution of preference diversity, and find rich heterogeneity in the effect of preference diversity on the dynamic activeness. Furthermore, we discover the divergence of preference diversity mechanisms for the same user under different usage scenarios, such as active (where users actively seek information) and passive (where users passively receive information) modes. Unlike existing qualitative studies, we propose a universal mixture model with the capability of accurately fitting dynamic activeness curves while reflecting the heterogeneous patterns of preference diversity. To our best knowledge, this is the first quantitative model that incorporates the effect of preference diversity on user activeness. With the modeling parameters, we are able to make accurate churn and activeness predictions and provide decision support for increasing user activity through the intervention of diversity. Our findings and model comprehensively reveal the significance of preference diversity and provide potential implications for the design of future recommendation systems and social media.
In recent years, machine learning methods have been widely used in modern electronic health record (EHR) systems, and have shown more accurate prediction performance on disease risk assessment tasks than traditional methods. However, most of the existing machine learning methods make the assessment solely based on features of the target case but ignore the cross-sample feature interactions between the target case and other similar cases, which is inconsistent with the general practice of evidence-based medicine of making diagnoses based on existing clinical experience. Moreover, current methods that focus on mining cross-sample information rely on deep neural networks to extract cross-sample feature interactions, which would suffer from the problems of data insufficiency, data heterogeneity and lack of interpretability in disease risk assessment tasks. In this work, we propose a novel retrieval-based gradient boosting decision trees (RB-GBDT) model with a cross-sample extractor to mine cross-sample information while exploiting the superiority of GBDT of robustness, generalization and interpretability. Experiments on real-world clinical datasets show the superiority and efficacy of RB-GBDT on disease risk assessment tasks. The developed software has been deployed in hospital as an auxiliary diagnosis tool for risk assessment of venous thromboembolism.
In this paper, we introduce a large scale online multi-task deep learning framework for modeling multiple feed ads auction prediction tasks on an industry-scale feed ads recommendation platform. Multiple prediction tasks are combined into one single model which is continuously trained on real time new ads data. Multi-tasking ads auction models in real-time faces many real-world challenges. For example, each task may be trained on different set of training data; the labels of different tasks may have different arrival time due to label delay; different tasks will interact with each other; combining the losses of each task is non-trivial. We tackle these challenges using practical and novel techniques such as multi-stage training for handling label delay, Multi-gate Mixture-of-Experts (MMoE) to optimize model interaction and an auto-parameter learning algorithm to optimize the loss weights of different tasks. We demonstrate that our proposed techniques can lead to quality improvements and substantial resource saving compared to modeling each single task independently.
In a social network environment, member status represents a member's social value in the network. A member's abilities represent the potential of a member projecting his/her social values to others, and also represent the level of credibility and authority for a member to hold certain status. Therefore, the concepts of status and ability are deeply related, and should be consistent with each other. In this paper, we establish the consistency models among different member status and their abilities through analyzing member data and integrating domain knowledge. We use these models to help our members refine their inconsistent status, at the same time, identify ability gaps. To reliably refine a member status, we introduce a practical and human-in-the-loop methodology to build status hierarchy. Conditioned on the hierarchical structure, our modeling process exploits the associations between status and abilities. We applied the technique to LinkedIn member titles -- one of the major types of the member status, and member skills -- the main ability representations at LinkedIn. We showed that our models are intuitive and perform well. The skill gaps identified are actionable and concise. In this paper, we also discuss the aspects of building such systems, and how we could deploy the models in production.
In product search, the retrieval of candidate products before re-ranking is more mission critical and challenging than other search like web search, especially for tail queries, which have a complex and specific search intent. In this paper, we present a hybrid system for e-commerce search deployed at Walmart that combines traditional inverted index and embedding-based neural retrieval to better answer user tail queries. Our system significantly improved the relevance of the search engine, measured by both offline and online evaluations. The improvements were achieved through a combination of different approaches. We present a new technique to train the neural model at scale. and describe how the system was deployed in production with little impact on response time. We highlight multiple learnings and practical tricks that were used in the deployment of this system.
Pre-trained language models like BERT have reported state-of-the-art performance on several Natural Language Processing (NLP) tasks, but high computational demands hinder its widespread adoption for large scale NLP tasks. In this work, we propose a novel routing based early exit model called BE3R (BERT based Early-Exit using Expert Routing), where we learn to dynamically exit in the earlier layers without needing to traverse through the entire model. Unlike the exiting early-exit methods, our approach can be extended to a batch inference setting. We consider the specific application of search relevance filtering in Amazon India marketplace services (a large e-commerce website). Our experimental results show that BE3R improves the batch inference throughput by 46.5% over the BERT-Base model and 35.89% over the DistilBERT-Base model on large dataset with 50 Million samples without any trade-off on the performance metric. We conduct thorough experimentation using various architectural choices, loss functions and perform qualitative analysis. We perform experiments on public GLUE Benchmark and demonstrate comparable performance to corresponding baseline models with 23% average throughput improvement across tasks in batch inference setting.
Modern software systems and products increasingly rely on machine learning models to make data-driven decisions based on interactions with users, infrastructure and other systems. For broader adoption, this practice must (i) accommodate product engineers without ML backgrounds, (ii) support finegrain product-metric evaluation and (iii) optimize for product goals. To address shortcomings of prior platforms, we introduce general principles for and the architecture of an ML platform, Looper, with simple APIs for decision-making and feedback collection. Looper covers the end-to-end ML lifecycle from collecting training data and model training to deployment and inference, and extends support to personalization, causal evaluation with heterogenous treatment effects, and Bayesian tuning for product goals. During the 2021 production deployment, Looper simultaneously hosted 440-1,000 ML models that made 4-6 million real-time decisions per second. We sum up experiences of platform adopters and describe their learning curve.
Curbing online hate speech has become the need of the hour; however, a blanket ban on such activities is infeasible for several geopolitical and cultural reasons. To reduce the severity of the problem, in this paper, we introduce a novel task, hate speech normalization, that aims to weaken the intensity of hatred exhibited by an online post. The intention of hate speech normalization is not to support hate but instead to provide the users with a stepping stone towards non-hate while giving online platforms more time to monitor any improvement in the user's behavior. To this end, we manually curated a parallel corpus - hate texts and their normalized counterparts (a normalized text is less hateful and more benign). We introduce NACL, a simple yet efficient hate speech normalization model that operates in three stages - first, it measures the hate intensity of the original sample; second, it identifies the hate span(s) within it; and finally, it reduces hate intensity by paraphrasing the hate spans. We perform extensive experiments to measure the efficacy of NACL via three-way evaluation (intrinsic, extrinsic, and human-study). We observe that NACL outperforms six baselines - NACL yields a score of 0.1365 RMSE for the intensity prediction, 0.622 F1-score in the span identification, and 82.27 BLEU and 80.05 perplexity for the normalized text generation. We further show the generalizability of NACL across other platforms (Reddit, Facebook, Gab). An interactive prototype of NACL was put together for the user study. Further, the tool is being deployed in a real-world setting at Wipro AI as a part of its mission to tackle harmful content on online platforms.
Systems that utilize and manage predictive models have become increasingly significant in industry. In the services offered by Yahoo! JAPAN, once a predictive model is utilized for recommendations, it is thrown away. Such models could however be reused for expanding the coverage of other recommendations. Here, our goal is to construct recommendation systems that expand the coverage of recommendations by effectively utilizing models which would otherwise be discarded. Another goal is to deploy such a recommendation system on real services and make practical use of it. In this paper, we describe a recommendation system that achieves these two goals by overcoming the challenges facing its deployment on real services. Specifically, we developed an optimization method that alleviates the psychological barrier against using the recommendation system and clarified the performance of our method in making real recommendations. An offline test and a large-scale online test on making real recommendations showed that our method substantially expands the coverage of recommendations. As a highlight of the results, our method made recommendations to 76.9 times more users at the same level of recommendation performance as the currently used recommendation system by the service. Overall, the results show that our method has a huge impact on services and can be applied to real recommendations.
With the surging development of information technology, to provide a high quality of network services, there are increasing demands and challenges for network analysis. As all data on the Internet are encapsulated and transferred by network packets, packets are widely used for various network traffic analysis tasks, from application identification to intrusion detection. Considering the choice of features and how to represent them can greatly affect the performance of downstream tasks, it is critical to learn high-quality packet representations. In addition, existing packet-level works ignore packet representations but focus on trying to get good performance with independent analysis of different classification tasks. In the real world, although a packet may have different class labels for different tasks, the packet representation learned from one task can also help understand its complex packet patterns in other tasks, while existing works omit to leverage them.
Taking advantage of this potential, in this work, we propose a novel framework to tackle the problem of packet representation learning for various traffic classification tasks. We learn packet representation, preserving both semantic and byte patterns of each packet, and utilize contrastive loss with a sample selector to optimize the learned representations so that similar packets are closer in the latent semantic space. In addition, the representations are further jointly optimized by class labels of multiple tasks with loss of reconstructed representations and of class probabilities. Evaluations demonstrate that the learned packet representation of our proposed framework can outperform the state-of-the-art baseline methods on extensive popular downstream classification tasks by a wide margin in both the close-world and open-world scenario.
Graph Neural Networks (GNNs) have shown success in learning from graph-structured data, with applications to fraud detection, recommendation, and knowledge graph reasoning. However, training GNN efficiently is challenging because: 1) GPU memory capacity is limited and can be insufficient for large datasets, and 2) the graph-based data structure causes irregular data access patterns. In this work, we provide a method to statistically analyze and identify more frequently accessed data ahead of GNN training. Our data tiering method not only utilizes the structure of input graph, but also an insight gained from actual GNN training process to achieve a higher prediction result. With our data tiering method, we additionally provide a new data placement and access strategy to further minimize the CPU-GPU communication overhead. We also take into account of multi-GPU GNN training as well and we demonstrate the effectiveness of our strategy in a multi-GPU system. The evaluation results show that our work reduces CPU-GPU traffic by 87-95% and improves the training speed of GNN over the existing solutions by 1.6-2.1x on graphs with hundreds of millions of nodes and billions of edges.
This paper introduces a thermographic vision system to detect different types of hotspots on a variety of cable junctions commonly found in Hydro-Québec underground electrical distribution network. Cable junctions of underground distribution networks operate in harsh conditions, potentially leading to failure overtime. Faults can be prevented by the timely detection of local hotspot on these junctions. Hotspot detection is carried out by mean of image segmentation using a deep neural network. Special care is given to uncertainty estimation and validation. Uncertainty is used to assess the quality of a segmentation to avoid misdiagnosis or returning in the field to recapture images. It is also proposed as a tool to evaluate whether unannotated images should be included in the dataset. System performance has been evaluated on a test dataset as well as in the field by regular inspection teams. Promising results obtained so far led to the deployment of the vision system on a fleet of five inspection trucks performing inspection over the province over the last year Authorization was granted to scale the solution to 35 trucks starting this year.
Continuous evolution in modern software often causes documentation, tutorials, and examples to be out of sync with changing interfaces and frameworks. Relying on outdated documentation and examples can lead programs to fail or be less efficient or even less secure. In response, programmers need to regularly turn to other resources on the web, such as StackOverflow for examples to guide them in writing software. We recognize that this inconvenient, error-prone, and expensive process can be improved by using machine learning applied to software usage data. In this paper, we present a practical system, which uses machine learning on large-scale telemetry data and documentation corpora, generating appropriate and complex examples that can be used to improve documentation. We discuss both feature-based and transformer-based machine learning approaches and demonstrate that our system achieves 100% coverage for the used functionalities in the product, providing up-to-date examples upon every release and reduces the numbers of PRs submitted by software owners writing and editing documentation by >68%. We also share valuable lessons learnt during the 3 years that our production quality system has been deployed for Azure Cloud Command Line Interface (Azure CLI)
Speed of delivery is critical for the success of e-commerce platforms. Faster delivery promise to the customer results in increased conversion and revenue. There are typically two mechanisms to control the delivery speed - a) replication of products across warehouses, and b) air-shipping the product. In this paper, we present a machine learning based framework to recommend air-shipping eligibility for products. Specifically, we develop a causal inference framework (referred to as Air Shipping Recommendation or ASPIRE) that balances the trade-off between revenue or conversion and delivery cost to decide whether a product should be shipped via air. We propose a doubly-robust estimation technique followed by an optimization algorithm to determine air eligibility of products and calculate the uplift in revenue and shipping cost.
We ran extensive experiments (both offline and online) to demonstrate the superiority of our technique as compared to the incumbent policies and baseline approaches. ASPIRE resulted in a lift of +79 bps of revenue as measured through an A/B experiment in an emerging marketplace on Amazon.
DNA-stabilized silver nanoclusters (AgN-DNAs) are a class of nanomaterials comprised of 10-30 silver atoms held together by short synthetic DNA template strands. AgN-DNAs are promising biosensors and fluorophores due to their small sizes, natural compatibility with DNA, and bright fluorescence---the property of absorbing light and re-emitting light of a different color. The sequence of the DNA template acts as a "genome" for AgN-DNAs, tuning the size of the encapsulated silver nanocluster, and thus its fluorescence color. However, current understanding of the AgN-DNA genome is still limited. Only a minority of DNA sequences produce highly fluorescent AgN-DNAs, and the bulky DNA strands and complex DNA-silver interactions make it challenging to use first principles chemical calculations to understand and design AgN-DNAs. Thus, a major challenge for researchers studying these nanomaterials is to develop methods to employ observational data about studied AgN-DNAs to design new nanoclusters for targeted applications.
In this work, we present an approach to design AgN-DNAs by employing variational autoencoders (VAEs) as generative models. Specifically, we employ an LSTM-based β-VAE architecture and regularize its latent space to correlate with AgN-DNA properties such as color and brightness. The regularization is adaptive to skewed sample distributions of available observational data along our design axes of properties. We employ our model for design of AgN-DNAs in the near-infrared (NIR) band, where relatively few AgN-DNAs have been observed to date. Wet lab experiments validate that when employed for designing new AgN-DNAs, our model significantly shifts the distribution of AgN-DNA colors towards the NIR while simultaneously achieving bright fluorescence. This work shows that VAE-based generative models are well-suited for the design of AgN-DNAs with multiple targeted properties, with significant potential to advance the promising applications of these nanomaterials for bioimaging, biosensing, and other critical technologies.
We present GradMask, a simple adversarial example detection scheme for natural language processing (NLP) models. It uses gradient signals to detect adversarially perturbed tokens in an input sequence and occludes such tokens by a masking process. GradMask provides several advantages over existing methods including improved detection performance and an interpretation of its decision with a only moderate computational cost. Its approximated inference cost is no more than a single forward- and back-propagation through the target model without requiring any additional detection module. Extensive evaluation on widely adopted NLP benchmark datasets demonstrates the efficiency and effectiveness of GradMask. Code and models are available at https://github.com/Han8931/grad_mask_detection
The ability to accurately pinpoint the location of an event (e.g. loss, fault or bug) is of fundamental requirement in many systems. While we have state-of-the-art models to predict likelihood of an outcome, being able to pinpoint to the entity responsible for the outcome is also important. For example, in an e-commerce setup, a lost package detection system needs to infer the reason or location (delivery station, sort center, trucks) in case of a missing item, a network management system would like to diagnose nodes that are faulty based on end-end packet flow traces or a compiler needs to point out the exact location of a code that is erroneous. In this paper, we present an Attention based neural architecture for entity localization to accurately pinpoint the location of package loss in delivery network and bugs in erroneous programs. Our model performs well in scenarios where there is no annotation/ground truth for entities for localization. It can also adapt itself if annotations/ground truth is available for even a subset of entities by leveraging semi-supervision. The core of our model is a ladder-style architecture that helps us achieve state-of-the-art performance in both entity localization and detection. Further, to show the generality of our approach, we demonstrate its performance on a bug localization task for software programs. On a publicly available data-set, our solution outperforms the state-of-the-art technique by a significant margin.
In several e-commerce scenarios, pricing long-tail products effectively is a central task for the companies, and there is broad agreement that Artificial Intelligence (AI) will play a prominent role in doing that in the next future. Nevertheless, dealing with long-tail products raises major open technical issues due to data scarcity which preclude the adoption of the mainstream approaches requiring usually a huge amount of data, such as, e.g., deep learning. In this paper, we provide a novel online learning algorithm for dynamic pricing that deals with non-stationary settings due to, e.g., the seasonality or adaptive competitors, and is very efficient in terms of the need for data thanks to assumptions such as, e.g., the monotonicity of the demand curve in the price that are customarily satisfied in long-tail markets. Furthermore, our dynamic pricing algorithm is paired with a clustering algorithm for the long-tail products which aggregates similar products such that the data of all the products of the same cluster are merged and used to choose their best price. We first evaluate our algorithms in an offline synthetic setting, comparing their performance with the state of the art and showing that our algorithms are more robust and data-efficient in long-tail settings. Subsequently, we evaluate our algorithms in an online setting with more than 8,000 products, including popular and long-tail, in an A/B test with humans for about two months. The increase of revenue thanks to our algorithms is about 18% for the popular products and about 90% for the long-tail products.
Estimation of treatment efficacy of real-world clinical interventions involves working with continuous time-to-event outcomes such as time-to-death, re-hospitalization, or a composite event that may be subject to censoring. Counterfactual reasoning in such scenarios requires decoupling the effects of confounding physiological characteristics that affect baseline survival rates from the effects of the interventions being assessed. In this paper, we present a latent variable approach to model heterogeneous treatment effects by proposing that an individual can belong to one of latent clusters with distinct response characteristics. We show that this latent structure can mediate the base survival rates and help determine the effects of an intervention. We demonstrate the ability of our approach to discover actionable phenotypes of individuals based on their treatment response on multiple large randomized clinical trials originally conducted to assess appropriate treatment strategies to reduce cardiovascular risk.
We study a crowdsourcing setting where we need to infer the latent truth about a task given observed labels together with context in the form of a classifier score. We present Theodon, a hierarchical non-parametric Bayesian model, developed and deployed at Meta, that captures both the prevalence of label categories and the accuracy of labelers as functions of the classifier score. Theodon uses Gaussian processes to model the non-uniformity of mistakes over the range of classifier scores. For our experiments, we used data generated from integrity applications at Meta as well as public datasets. We showed that Theodon (1) obtains 1-4% improvement in AUC-PR predictions on items' true labels compared to state-of-the-art baselines for public datasets, (2) is effective as a calibration method, and (3) provides detailed insights on labelers' performances.
With the increasing adoption of machine learning (ML) models and systems in high-stakes settings across different industries, guaranteeing a model's performance after deployment has become crucial. Monitoring models in production is a critical aspect of ensuring their continued performance and reliability. We present Amazon SageMaker Model Monitor, a fully managed service that continuously monitors the quality of machine learning models hosted on Amazon SageMaker. Our system automatically detects data, concept, bias, and feature attribution drift in models in real-time and provides alerts so that model owners can take corrective actions and thereby maintain high quality models. We describe the key requirements obtained from customers, system design and architecture, and methodology for detecting different types of drift. Further, we provide quantitative evaluations followed by use cases, insights, and lessons learned from more than two years of production deployment.
Predictive maintenance (PdM) is the task of scheduling maintenance operations based on a statistical analysis of the system's condition. We propose a human-in-the-loop PdM approach in which a machine learning system predicts future problems in sets of workstations (computers, laptops, and servers). Our system interacts with domain experts to improve predictions and elicit their knowledge. In our approach, domain experts are included in the loop not only as providers of correct labels, as in traditional active learning, but as a source of explicit decision rule feedback. The system is automated and designed to be easily extended to novel domains, such as maintaining workstations of several organizations. In addition, we develop a simulator for reproducible experiments in a controlled environment and deploy the system in a large-scale case of real-life workstations PdM with thousands of workstations for dozens of companies.
Despite advances in the field of Graph Neural Networks (GNNs), only a small number (~5) of datasets are currently used to evaluate new models. This continued reliance on a handful of datasets provides minimal insight into the performance differences between models, and is especially challenging for industrial practitioners who are likely to have datasets which are very different from academic benchmarks. In the course of our work on GNN infrastructure and open-source software at Google, we have sought to develop benchmarks that are robust, tunable, scalable, and generalizable.
In this work we introduce GraphWorld, a novel methodology and system for benchmarking GNN models on an arbitrarily-large population ofsynthetic graphs for any conceivable GNN task. GraphWorld allows a user to efficiently generate a world with millions of statistically diverse datasets. It is accessible, scalable, and easy to use. GraphWorld can be run on a single machine without specialized hardware, or it can be easily scaled up to run on arbitrary clusters or cloud frameworks. Using GraphWorld, a user has fine-grained control over graph generator parameters, and can benchmark arbitrary GNN models with built-in hyperparameter tuning. We present insights from GraphWorld experiments regarding the performance characteristics of thirteen GNN models and baselines over millions of benchmark datasets. We further show that GraphWorld efficiently explores regions of benchmark dataset space uncovered by standard benchmarks, revealing comparisons between models that have not been historically obtainable. Using GraphWorld, we also are able to study in-detail the relationship between graph properties and task performance metrics, which is nearly impossible with the classic collection of real-world benchmarks.
Sequential models have become increasingly popular in powering personalized recommendation systems over the past several years. These approaches traditionally model a user's actions on a website as a sequence to predict the user's next action. While theoretically simplistic, these models are quite challenging to deploy in production, commonly requiring streaming infrastructure to reflect the latest user activity and potentially managing mutable data for encoding a user's hidden state. Here we introduce PinnerFormer, a user representation trained to predict a user's future long-term engagement using a sequential model of a user's recent actions. Unlike prior approaches, we adapt our modeling to a batch infrastructure via our new dense all-action loss, modeling long-term future actions instead of next action prediction. We show that by doing so, we significantly close the gap between batch user embeddings that are generated once a day and realtime user embeddings generated whenever a user takes an action. We describe our design decisions via extensive offline experimentation and ablations and validate the efficacy of our approach in A/B experiments showing substantial improvements in Pinterest's user retention and engagement when comparing PinnerFormer against our previous user representation. PinnerFormer is deployed in production as of Fall 2021.
As the fundamental basis of sponsored search, relevance modeling measures the closeness between the input queries and the candidate ads. Conventional relevance models solely rely on the textual data, which suffer from the scarce semantic signals within the short queries. Recently, user historical click behaviors are incorporated in the format of click graphs to provide additional correlations beyond pure textual semantics, which contributes to advancing the relevance modeling performance. However, user behaviors are usually arbitrary and unpredictable, leading to the noisy and sparse graph topology. In addition, there exist other types of user behaviors besides clicks, which may also provide complementary information. In this paper, we study the novel problem of heterogeneous behavior graph learning to facilitate relevance modeling task. Our motivation lies in learning an optimal and task-relevant heterogeneous behavior graph consisting of multiple types of user behaviors. We further propose a novel HBGLR model to learn the behavior graph structure by mining the sophisticated correlations between node semantics and graph topology, and encode the textual semantics and structural heterogeneity into the learned representations. Our proposal is evaluated over real-world industry datasets, and has been mainstreamed in the Bing ads. Both offline and online experimental results demonstrate its superiority.
We introduce temporal multimodal multivariate learning, a new family of decision making models that can indirectly learn and transfer online information from simultaneous observations of a probability distribution with more than one peak or more than one outcome variable from one time stage to another. We approximate the posterior by sequentially removing additional uncertainties across different variables and time, based on data-physics driven correlation, to address a broader class of challenging time-dependent decision-making problems under uncertainty. Extensive experiments on real-world datasets ( i.e., urban traffic data and hurricane ensemble forecasting data) demonstrate the superior performance of the proposed targeted decision-making over the state-of-the-art baseline prediction methods across various settings.
Modern climate models offer simulation results that provide unprecedented details at the local level. However, even with powerful supercomputing facilities, their computational complexity and associated costs pose a limit on simulation resolution that is needed for agile planning of resource allocation, parameter calibration, and model reproduction. As regional information is vital for policymakers, data from coarse-grained resolution simulations undergo the process of "statistical downscaling" to generate higher-resolution projection at a local level. We present a new method for downscaling climate simulations called GINE (Geospatial INformation Encoded statistical downscaling). To preserve the characteristics of climate simulation data during this process, our model applies the latest computer vision techniques over topography-driven spatial and local-level information. The comprehensive evaluations on 2x, 4x, and 8x resolution factors show that our model substantially improves performance in terms of RMSE and the visual quality of downscaled data.
Accurate document layout analysis is a key requirement for high-quality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at layout detection and segmentation. While these datasets are of adequate size to train such models, they severely lack in layout variability since they are sourced from scientific article repositories such as PubMed and arXiv only. Consequently, the accuracy of the layout segmentation drops significantly when these models are applied on more challenging and diverse layouts. In this paper, we presentDocLayNet, a new, publicly available, document-layout annotation dataset in COCO format. It contains 80863 manually annotated pages from diverse data sources to represent a wide variability in layouts. For each PDF page, the layout annotations provide labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine the inter-annotator agreement. In multiple experiments, we provide baseline accuracy scores (in mAP) for a set of popular object detection models. We also demonstrate that these models fall approximately 10% behind the inter-annotator agreement. Furthermore, we provide evidence that DocLayNet is of sufficient size. Lastly, we compare models trained on PubLayNet, DocBank and DocLayNet, showing that layout predictions of the DocLayNet-trained models are more robust and thus the preferred choice for general-purpose document-layout analysis.
Mobile notification systems play a major role in a variety of applications to communicate, send alerts and reminders to the users to inform them about news, events or messages. In this paper, we formulate the near-real-time notification decision problem as a Markov Decision Process where we optimize for multiple objectives in the rewards. We propose an end-to-end offline reinforcement learning framework to optimize sequential notification decisions. We address the challenge of offline learning using a Double Deep Q-network method based on Conservative Q-learning that mitigates the distributional shift problem and Q-value overestimation. We illustrate our fully-deployed system and demonstrate the performance and benefits of the proposed approach through both offline and online experiments.
Self-supervised learning establishes a new paradigm of learning representations with much fewer or even no label annotations. Recently there has been remarkable progress on large-scale contrastive learning models which require substantial computing resources, yet such models are not practically optimal for small-scale tasks. To fill the gap, we aim to study contrastive learning on the wearable-based activity recognition task. Specifically, we conduct an in-depth study of contrastive learning from both algorithmic-level and task-level perspectives. For algorithmic-level analysis, we decompose contrastive models into several key components and conduct rigorous experimental evaluations to better understand the efficacy and rationale behind contrastive learning. More importantly, for task-level analysis, we show that the wearable-based signals bring unique challenges and opportunities to existing contrastive models, which cannot be readily solved by existing algorithms. Our thorough empirical studies suggest important practices and shed light on future research challenges. In the meantime, this paper presents an open-source PyTorch library CL-HAR, which can serve as a practical tool for researchers. The library is highly modularized and easy to use, which opens up avenues for exploring novel contrastive models quickly in the future.
Waterfall Recommender System (RS), a popular form of RS in mobile applications, is a stream of recommended items consisting of successive pages that can be browsed by scrolling. In waterfall RS, when a user finishes browsing a page, the edge (e.g., mobile phones) would send a request to the cloud server to get a new page of recommendations, known as the paging request mechanism. RSs typically put a large number of items into one page to reduce excessive resource consumption from numerous paging requests, which, however, would diminish the RSs' ability to timely renew the recommendations according to users' real-time interest and lead to a poor user experience. Intuitively, inserting additional requests inside pages to update the recommendations with a higher frequency can alleviate the problem. However, previous attempts, including only non-adaptive strategies (e.g., insert requests uniformly), would eventually lead to resource overconsumption. To this end, we envision a new learning task of edge intelligence named Intelligent Request Strategy Design (IRSD). It aims to improve the effectiveness of waterfall RSs by determining the appropriate occasions of request insertion based on users' real-time intention. Moreover, we propose a new paradigm of adaptive request insertion strategy named Uplift-based On-edge Smart Request Framework (AdaRequest). AdaRequest 1) captures the dynamic change of users' intentions by matching their real-time behaviors with their historical interests based on attention-based neural networks. 2) estimates the counterfactual uplift of user purchase brought by an inserted request based on causal inference. 3) determines the final request insertion strategy by maximizing the utility function under online resource constraints. We conduct extensive experiments on both offline dataset and online A/B test to verify the effectiveness of AdaRequest. Remarkably, AdaRequest has been deployed on the Waterfall RS of Taobao and brought over 3% lift on Gross Merchandise Value (GMV).
In this paper we develop a framework for analyzing patterns of a disease or pandemic such as Covid. Given a dataset which records information about the spread of a disease over a set of locations, we consider the problem of identifying both the disease's intrinsic waves (temporal patterns) and their respective spatial epicenters. To do so we introduce a new method of spatio-temporal decomposition which we call diffusion NMF (D-NMF). Building upon classic matrix factorization methods, D-NMF takes into consideration a spatial structuring of locations (features) in the data and supports the idea that locations which are spatially close are more likely to experience the same set of waves. To illustrate the use of D-NMF, we analyze Covid case data at various spatial granularities. Our results demonstrate that D-NMF is very useful in separating the waves of an epidemic and identifying a few centers for each wave.
In this paper, we present NxtPost, a deployed user-to-post content based sequential recommender system for Facebook Groups. Inspired by recent advances in NLP, we have adapted a Transformer based model to the domain of sequential recommendation. We explore causal masked multi-head attention that optimizes both short and long-term user interests. From a user's past activities validated by defined safety process, NxtPost seeks to learn a representation for the user's dynamic content preference and to predict the next post user may be interested in. In contrast to previous Transformer based methods, we do not assume that the recommendable posts have a fixed corpus. Accordingly, we use an external item/token embedding to extend a sequence-based approach to a large vocabulary. We achieve 49% abs. improvement in offline evaluation. As a result of NxtPost deployment, 0.6% more users are meeting new people, engaging with the community, sharing knowledge and getting support. The paper shares our experience in developing a personalized sequential recommender system, lessons deploying the model for cold start users, how to deal with freshness, and tuning strategies to reach higher efficiency in online A/B experiments.
With the rise of deep learning (DL), machine learning (ML) has become compute and data intensive, typically requiring multi-node multi-GPU clusters. As state-of-the-art models grow in size in the order of trillions of parameters, their computational complexity and cost also increase rapidly. Since 2012, the cost of deep learning doubled roughly every quarter, and this trend is likely to continue. ML practitioners have to cope with common challenges of efficient resource utilization when training such large models. In this paper, we propose a new profiling tool that cross-correlates relevant system utilization metrics and framework operations. The tool supports profiling DL models at scale, identifies performance bottlenecks, and provides insights with recommendations. We deployed the profiling functionality as an add-on to Amazon SageMaker Debugger, a fully-managed service that leverages an on-the-fly analysis system (called rules) to automatically identify complex issues in DL training jobs. By presenting deployment results and customer case studies, we show that it enables users to identify and fix issues caused by inefficient hardware resource usage, thereby reducing training time and cost.
In recent years, automatic computational systems based on deep learning are widely used in medical fields, such as automatic diagnosing and disease prediction. Most of these systems are designed for data sufficient scenarios. However, due to the disease rarity or privacy, the medical data are always insufficient. When applying these data-hungry deep learning models with insufficient data, it is likely to lead to issues of over-fitting and cause serious performance problems. Many data augmentation methods have been proposed to solve the data insufficiency problem, such as using GAN (Generative Adversarial Networks) to generate training data. However, the augmented data usually contains lots of noise. Directly using them to train sensitive medical models is very difficult to achieve satisfactory results.
To overcome this problem, we propose a novel deep model learning method for insufficient EHR (Electronic Health Record) data modeling, namely GRACE, which stands GeneRative Adversarial networks enhanCed prE-training. In the method, we propose an item-relation-aware GAN to capture changing trends and correlations among data for generating high-quality EHR records. Furthermore, we design a pre-training mechanism consisting of a masked records prediction task and a real-fake contrastive learning task to learn representations for EHR data using both generated and real data. After the pre-training, only the representations of real data is used to train the final prediction model. In this way, we can fully exploit useful information in generated data through pre-training, and also avoid the problems caused by directly using noisy generated data to train the final prediction model. The effectiveness of the proposed method is evaluated using extensive experiments on three healthcare-related real-world datasets. We also deploy our method in a maternal and child health care hospital for the online test. Both offline and online experimental results demonstrate the effectiveness of the proposed method. We believe doctors and patients can benefit from our effective learning method in various healthcare-related applications.
In this paper, we introduce ChemicalX, a PyTorch-based deep learning library designed for providing a range of state of the art models to solve the drug pair scoring task. The primary objective of the library is to make deep drug pair scoring models accessible to machine learning researchers and practitioners in a streamlined framework. The design of ChemicalX reuses existing high level model training utilities, geometric deep learning, and deep chemistry layers from the PyTorch ecosystem. Our system provides neural network layers, custom pair scoring architectures, data loaders, and batch iterators for end users. We showcase these features with example code snippets and case studies to highlight the characteristics of ChemicalX. A range of experiments on real world drug-drug interaction, polypharmacy side effect, and combination synergy prediction tasks demonstrate that the models available in ChemicalX are effective at solving the pair scoring task. Finally, we show that ChemicalX could be used to train and score machine learning models on large drug pair datasets with hundreds of thousands of compounds on commodity hardware.
Service time is a part of time cost in the last-mile delivery, which is the time spent on delivering parcels at a certain location. Predicting the service time is fundamental for many downstream logistics applications, e.g., route planning with time windows, courier workload balancing and delivery time prediction. Nevertheless, it is non-trivial given the complex delivery circumstances, location heterogeneity, and skewed observations in space. The existing solution trains a supervised model based on aggregated features extracted from parcels to deliver, which cannot handle above challenges well. In this paper, we propose MetaSTP, a meta-learning based neural network model to predict the service time. MetaSTP treats the service time prediction at each location as a learning task, leverages a Transformer-based representation layer to encode the complex delivery circumstances, and devises a model-based meta-learning method enhanced by location prior knowledge to reserve the uniqueness of each location and handle the imbalanced distribution issue. Experiments show MetaSTP outperforms baselines by at least 9.5% and 7.6% on two real-world datasets. Finally, an intelligent waybill assignment system based on MetaSTP is deployed and used internally in JD Logistics.
In this study, a scalable and real-time dispatching algorithm based on reinforcement learning is proposed and for the first time, is deployed in large scale. Current dispatching methods in ridehailing platforms are dominantly based on myopic or rule-based non-myopic approaches. Reinforcement learning enables dispatching policies that are informed of historical data and able to employ the learned information to optimize returns of expected future trajectories. Previous studies in this field yielded promising results, yet have left room for further improvements in terms of performance gain, self-dependency, transferability, and scalable deployment mechanisms. The present study proposes a standalone RL-based dispatching solution that is equipped with multiple novel mechanisms to ensure robust and efficient on-policy learning and inference while being adaptable for full-scale deployment. In particular, a new form of value updating based on temporal difference is proposed that is more adapted to the inherent uncertainty of the problem. For the driver-order assignment problem, a customized utility function is proposed that when tuned based on the statistics of the market, results in remarkable performance improvement and interpretability. In addition, for reducing the risk of cancellation after drivers' assignment, an adaptive graph pruning strategy based on the multi-arm bandit problem is introduced. The method is evaluated using offline simulation with real data and yields notable performance improvement. In addition, the algorithm is deployed online in multiple cities under DiDi's operation for A/B testing and more recently, is launched in one of the major international markets as the primary mode of dispatch. The deployed algorithm shows over 1.3% improvement in total driver income from A/B testing. In addition, by causal inference analysis, as much as 5.3% improvement in major performance metrics is detected after full-scale deployment.
Selecting an appropriate and relevant context forms an essential component for the efficacy of several information retrieval applications like Question Answering (QA) systems. The problem of Answer Sentence Selection (AS2) refers to the task of selecting sentences, from a larger text, that are relevant and contain the answer to users' queries. While there has been a lot of success in building AS2 systems trained on open-domain data (e.g., SQuAD, NQ), they do not generalize well in closed-domain settings, since domain adaptation can be challenging due to poor availability and annotation expense of domain-specific data. This paper proposes SEDAN, an effective self-learning framework to adapt AS2 models for domain-specific applications. We leverage large pre-trained language models to automatically generate domain-specific QA pairs for domain adaptation. We further fine-tune a pre-trained Sentence-BERT architecture to capture semantic relatedness between questions and answer sentences for AS2. Extensive experiments demonstrate the effectiveness of our proposed approach (over existing state-of-the-art AS2 baselines) on different Question Answering benchmark datasets.
Anomalous sound detection (ASD) is one of the most significant tasks of mechanical equipment monitoring and maintaining in complex industrial systems. In practice, it is vital to efficiently identify abnormal status of the working mechanical system, which can further facilitate the failure troubleshooting. In this paper, we propose a multi-pattern adversarial learning one-class classification framework, which allows us to use both the generator and the discriminator of an adversarial model for efficient ASD. The core idea is to learn reconstructing the normal patterns of acoustic data through two different patterns from auto-encoding generators, which succeeds in generalizing the fundamental role of a discriminator from identifying real and fake data to distinguishing between regional and local pattern reconstructions. Moreover, we design a novel balanceable detection strategy using both generators and a discriminator to achieve anomaly detection efficiently. Furthermore, we present a global filter layer for long-term interactions in the frequency domain space, which directly learns from the original data without introducing any human priors. Extensive experiments are performed on four real-world datasets from different industrial domains (three cavitation datasets from SAMSON AG, and one existing publicly) for anomaly detection, all showing superior results and outperform recent state-of-the-art ASD methods.
We introduce generalized deep mixed model (GDMix), a class of machine learning models for large-scale recommender systems that combines the power of deep neural networks and the efficiency of logistic regression. GDMix leverages state-of-the-art deep neural networks (DNNs) as the global models (fixed effects), and further improves the performance by adding entity-specific personalized models (random effects). For instance, the click response from a particular user m to a job posting j may consist of contributions from a DNN model common to all users and job postings, a model specific to the user m and a model specific to the job j. GDMix models not only possess powerful modeling capabilities but also enjoy high training efficiency especially for web-scale recommender systems. We demonstrate the capabilities by detailing their use in Feed and Ads recommendation at LinkedIn. The source code for the GDMix training framework is available at https://github.com/linkedin/gdmix https://github.com/linkedin/gdmix under the BSD-2-Clause License.
With the current advancements in mobile and sensing technologies used to collect real-time data in offline stores, retailers and wholesalers have attempted to develop recommender systems to enhance sales and customer experience. However, existing studies on recommender systems have primarily focused on e-commerce platforms and other online services. They did not consider the unique features of indoor shopping in real stores such as the physical environments and objects, which significantly affect the movement and purchase behaviors of customers, thereby representing the "spatiotemporal contexts" that are critical to identifying recommendable items. In this study, we propose a gamification approach wherein a real store is emulated in a pixel world and a recurrent convolutional network is trained to learn the spatiotemporal representation of offline shopping. The superiority and advantages of our method over existing sequential recommender systems are demonstrated through a real-world application in a hypermarket. We believe that our work can significantly contribute to promoting the practice of providing recommendations in offline stores and services.
The depth of a seismic event is an essential feature to discriminate natural earthquakes from events induced or created by humans. However, estimating the depth of a seismic event with a sparse set of seismic stations is a daunting task, and there is no globally usable method. This paper focuses on developing a machine learning model to accurately estimate the depth of arbitrary seismic events directly from seismograms. Our proposed deep learning architecture is not-so-deep compared to commonly found models in the literature for related tasks, consisting of two loosely connected levels of neural networks, associated with the seismic stations at the higher level and the individual channels of a station at the lower level. Thus, the model has significant advantages, including a reduced number of parameters for tuning and better interpretability to geophysicists. We evaluate our solution on seismic data collected from the SCEDC (Southern California Earthquake Data Center) catalog for regional events in California. The model can learn waveform features specific to a set of stations, while it struggles to generalize to completely novel sets of event sources and stations. In a simplified setting of separating shallow events from deep ones, the model achieved an 86.5% F1-score using the Southern California stations.
Soccer is a sport characterised by open and dynamic play, with player actions and roles aligned according to team strategies simultaneously and at multiple temporal scales with high spatial freedom. This complexity presents an analytics challenge, which to date has largely been solved by decomposing the game according to specific criteria to analyse specific problems. We propose a more holistic approach, utilising Transformer or RNN components in the novel Seq2Event model, in which the next match event is predicted given prior match events and context. We show metric creation using a general purpose context-aware model as a deployable practical application, and demonstrate development of the poss-util metric using a Seq2Event model. Summarising the expectation of key attacking events (shot, cross) during each possession, our metric is shown to correlate over matches (r=0.91, n=190) with the popular xG metric. Example practical application of poss-util to analyse behaviour over possessions and matches is made. Potential in sports with stronger sequentiality, such as rugby union, is discussed.
Friend recommendation service plays an important role in shaping and facilitating the growth of online social networks. Graph embedding models, which can learn low-dimensional embeddings for nodes in the social graph to effectively represent the proximity between nodes, have been widely adopted for friend recommendations. Recently, Graph Neural Networks (GNNs) have demonstrated superiority over shallow graph embedding methods, thanks to their ability to explicitly encode neighborhood context. This is also verified in our Xbox friend recommendation scenario, where some simplified GNNs, such as LightGCN and PPRGo, achieve the best performance. However, we observe that many GNN variants, including LightGCN and PPRGo, use a static and pre-defined normalizer in neighborhood aggregation, which is decoupled with the representation learning process and can cause the scale distortion issue. As a consequence, the true power of GNNs has not yet been fully demonstrated in friend recommendations.
In this paper, we propose a simple but effective self-rescaling network (SSNet) to alleviate the scale distortion issue. At the core of SSNet is a generalized self-rescaling mechanism, which bridges the neighborhood aggregator's normalization with the node embedding learning process in an end-to-end framework. Meanwhile, we provide some theoretical analysis to help us understand the benefit of SSNet. We conduct extensive offline experiments on three large-scale real-world datasets. Results demonstrate that our proposed method can significantly improve the accuracy of various GNNs. When deployed online for one month's A/B test, our method achieves 24% uplift in adding suggested friends actions. At last, we share some interesting findings and hope the experience can motivate future applications and research in social link predictions.
The psychotherapy intervention technique is a multifaceted conversation between a therapist and a patient. Unlike general clinical discussions, psychotherapy's core components (viz. symptoms) are hard to distinguish, thus becoming a complex problem to summarize later. A structured counseling conversation may contain discussions about symptoms, history of mental health issues, or the discovery of the patient's behavior. It may also contain discussion filler words irrelevant to a clinical summary. We refer to these elements of structured psychotherapy as counseling components. In this paper, the aim is mental health counseling summarization to build upon domain knowledge and to help clinicians quickly glean meaning. We create a new dataset after annotating 12.9K utterances of counseling components and reference summaries for each dialogue. Further, we propose ConSum, a novel counseling-component guided summarization model. ConSum undergoes three independent modules. First, to assess the presence of depressive symptoms, it filters utterances utilizing the Patient Health Questionnaire (PHQ-9), while the second and third modules aim to classify counseling components. At last, we propose a problem-specific Mental Health Information Capture (MHIC) evaluation metric for counseling summaries. Our comparative study shows that we improve on performance and generate cohesive, semantic, and coherent summaries. We comprehensively analyze the generated summaries to investigate the capturing of psychotherapy elements. Human and clinical evaluations on the summary show that ConSum generates quality summary. Further, mental health experts validate the clinical acceptability of the ConSum. Lastly, we discuss the uniqueness in mental health counseling summarization in the real world and show evidences of its deployment on an online application with the support of mpathic.ai
Huawei is currently undertaking an effort to build map and web search services using query understanding and semantic search techniques. We present our efforts to built a low-latency type mention detection and linking service for map search. In addition to latency challenges, we only had access to low quality and biased training data plus we had to support 13 languages. Consequently, our service is based mostly on unsupervised term- and vector-based methods. Nevertheless, we trained a Transformer-based query tagger which we integrated with the rest of the pipeline using a reward and penalisation approach. We present techniques that we designed in order to address challenges with the type dictionary, incompatibilities in scoring between the term-based and vector-based methods as well as over-segmentation issues in Thai, Chinese, and Japanese. We have evaluated our approach on the Huawei map search use case as well as on community Question Answering benchmarks.
With the emerging of smartphones, mobile games have attracted billions of players and occupied most of the share for game companies. On the other hand, mobile game cheating, aiming to gain improper advantages by using programs that simulate the players' inputs, severely damages the game's fairness and harms the user experience. Therefore, detecting mobile game cheating is of great importance for mobile game companies. Many PC game-oriented cheating detection methods have been proposed in the past decades, however, they can not be directly adopted in mobile games due to the concern of privacy, power, and memory limitations of mobile devices. Even worse, in practice, the cheating programs are quickly updated, leading to the label scarcity for novel cheating patterns. To handle such issues, we in this paper introduce a mobile game cheating detection framework, namely FCDGame, to detect the cheats under the few-shot learning framework. FCDGame only consumes the screen sensor data, recording users' touch trajectories, which is less sensitive and more general for almost all mobile games. Moreover, a Hierarchical Trajectory Encoder and a Cross-pattern Meta Learner are designed in FCDGame to capture the intrinsic characters of mobile games and solve the label scarcity problem, respectively. Extensive experiments on two real online games show that FCDGame achieves almost 10% improvements in detection accuracy with only few fine-tuned samples.
The ride-hailing service offered by mobility-on-demand platforms, such as Uber and Didi Chuxing, has greatly facilitated people's traveling and commuting, and become increasingly popular in recent years. Efficiency (e.g., gross merchandise volume) has always been an important metric for such platforms. However, only focusing on the efficiency inevitably ignores the fairness of driver incomes, which could impair the sustainability of the overall ride-hailing system in the long run. To optimize the aforementioned two essential metrics, order dispatching and driver repositioning play an important role, as they impact not only the immediate, but also the future order-serving outcomes of drivers. Thus, in this paper, we aim to exploit joint order dispatching and driver repositioning to optimize both the long-term efficiency and fairness for ride-hailing platforms. To address this problem, we propose a novel multi-agent reinforcement learning framework, referred to as JDRL, to help drivers make distributed order selection and repositioning decisions. Specifically, to cope with the variable action space, JDRL segments the action space into a fixed number of action groups, and fixes the policy output dimension for order selection as the number of action groups. In terms of the fairness criterion, JDRL adopts the max-min fairness, and augments the vanilla policy gradient to an iterative training algorithm that alternates between a minimization step and a policy improvement step to maximize both the worst and the overall performance of agents. In addition, we provide the theoretical convergence guarantee of our JDRL training algorithm even under non-convex policy networks and stochastic gradient updating. Extensive experiments are conducted with three public real-world ride-hailing order datasets, including over 2 million orders in Haikou, China, over 5 million orders in Chengdu, China, and over 6 million orders in New York City, USA. Experimental results show that JDRL demonstrates a consistent advantage compared to state-of-the-art baselines in terms of both efficiency and fairness. To the best of our knowledge, this is the first work that exploits joint order dispatching and driver repositioning to optimize both the long-term efficiency and fairness in a ride-hailing system.
Games are one of the safest source of realizing self-esteem and relaxation at the same time. An online gaming platform typically has massive data coming in, e.g., in-game actions, player moves, clickstreams, transactions etc. It is rather interesting, as something as simple as data on gaming moves can help create a psychological imprint of the user at that moment, based on her impulsive reactions and response to a situation in the game. Mining this knowledge can: (a) immediately help better explain observed and predicted player behavior; and (b) consequently propel deeper understanding towards players' experience, growth and protection.
To this effect, we focus on discovery of the "game behaviours" as micro-patterns formed by continuous sequence of games and the persistent "play styles" of the players' as a sequence of such sequences on an online skill gaming platform for Rummy. The complex sequences of intricate sequences is analysed through a novel collaborative two stage deep neural network, CognitionNet. The first stage focuses on mining game behaviours as cluster representations in a latent space while the second aggregates over these micro patterns (e.g., transitions across patterns) to discover play styles via a supervised classification objective around player engagement. The dual objective allows CognitionNet to reveal several player psychology inspired decision making and tactics. To our knowledge, this is the first and one-of-its-kind research to fully automate the discovery of: (i) player psychology and game tactics from telemetry data; and (ii) relevant diagnostic explanations to players' engagement predictions. The collaborative training of the two networks with differential input dimensions is enabled using a novel formulation of "bridge loss". The network plays pivotal role in obtaining homogeneous and consistent play style definitions and significantly outperforms the SOTA baselines wherever applicable.
Drug recommendation is an important task of AI for healthcare. To recommend proper drugs, existing methods rely on various clinical records (e.g., diagnosis and procedures), which are commonly found in data such as electronic health records (EHRs). However, detailed records as such are often not available and the inputs might merely include a set of symptoms provided by doctors. Moreover, existing drug recommender systems usually treat drugs as individual items, ignoring the unique requirements that drug recommendation has to be done on a set of items (drugs), which should be as small as possible and safe without harmful drug-drug interactions (DDIs).
To deal with the challenges above, in this paper, we propose a novel framework of Symptom-based Set-to-set Small and Safe drug recommendation (4SDrug). To enable set-to-set comparison, we design set-oriented representation and similarity measurement for both symptoms and drugs. Further, towards the symptom sets, we devise importance-based set aggregation to enhance the accuracy of symptom set representation; towards the drug sets, we devise intersection-based set augmentation to ensure smaller drug sets, and apply knowledge-based and data-driven penalties to ensure safer drug sets. Extensive experiments on two real-world EHR datasets, i.e., the public benchmark one of MIMIC-III and the industrial large-scale one of NELL, show drastic performance gains brought by 4SDrug, which outperforms all baselines in most effectiveness measures, while yielding the smallest sets of recommended drugs and 26.83% DDI rate reduction from the ground-truth data.
In Disability Employment Services (DES), an emerging problem is recommending to disabled workers the right skill to upgrade and the right upgrade level to achieve a maximum increase in their job retention time. This problem involves causal reasoning to estimate the individual causal effect (ICE) on the survival outcome, i.e., job retention time, to determine the most effective intervention for a worker. Existing methods are not suitable to solve our problem. They are mostly developed for non-causal or non-survival challenges, while methods for causal survival analysis are under-explored. This paper proposes a representation learning method for recommending personalized interventions that can generate a maximum increase in job retention time for workers with disability. In our method, observed covariates are disentangled into latent variables based on which confounding and censoring biases are eliminated, and the ICE prediction model is built. Since true ICE values are not directly measurable in observational data, a reverse engineering technique is developed to estimate ICE for training samples. These estimated ICE values are then used as the pseudo ground truth to train the prediction model. Experiments with a case study of Australian workers with disability show that by adopting personalized interventions recommended by our method, disabled workers can increase their job retention time by up to 2.8 months. Additional evaluations with public datasets also show the technical strengths of our method in other applications.
The transition from conventional mobility to electromobility largely depends on charging infrastructure availability and optimal placement. This paper examines the optimal placement of charging stations in urban areas. We maximise the charging infrastructure supply over the area and minimise waiting, travel, and charging times while setting budget constraints. Moreover, we include the possibility of charging vehicles at home to obtain a more refined estimation of the actual charging demand throughout the urban area. We formulate the Placement of Charging Stations problem as a non-linear integer optimisation problem that seeks the optimal positions for charging stations and the optimal number of charging piles of different charging types. We design a novel Deep Reinforcement Learning approach to solve the charging station placement problem (PCRL). Extensive experiments on real-world datasets show how the PCRL reduces the waiting and travel time while increasing the benefit of the charging plan compared to five baselines. Compared to the existing infrastructure, we can reduce the waiting time by up to 97% and increase the benefit up to 497%.
Offline user identification is a scenario that users use their bio-information like faces as identification in offline venues, which has been applied in many offline scenarios such as verification in banks, check-in in hotels and making a purchase in offline merchants. In such a scenario, designing an identification approach to do extremely accurate offline user identification is critical. Most scenarios use faces to identify users and previous algorithms are mainly based on visual features and computer-vision models. However, due to the large variations such as pose, illumination and occlusions in offline scenarios, it remains a challenging problem for existing computer-vision algorithms to get a satisfying accuracy in real-world scenarios. Furthermore, billion-scale candidate users also require high efficiency and high accuracy for the approach.
In offline scenarios, users, venues and their different kinds of interactions can form a heterogeneous graph. Mining the graph can tell much information about users' offline habits and behaviors, which can be regarded as a great information supplement for user identification. In this paper, we elaborately design an offline identification framework considering two aspects. First, given a scanning face, we propose a 'local-global' retrieval mechanism to find one user from billion-scale candidate users. Second and most importantly, to make the verification between the scanning face, the retrieved user and the venue, we propose a novel Wide & Deep Based Graph Convolution Network to model both the visual information and the heterogeneous graph. Extensive offline and online A/B experimental results on a real-world industrial dataset demonstrate the effectiveness of our proposed approach. Nowadays, the whole approach has been deployed to serve billion-scale users to do offline identification in the industrial production environment.
In sponsored search engines, pre-trained language models have shown promising performance improvements on Click-Through-Rate (CTR) prediction. A widely used approach for utilizing pre-trained language models in CTR prediction consists of fine-tuning the language models with click labels and early stopping on peak value of the obtained Area Under the ROC Curve (AUC). Thereafter the output of these fine-tuned models, i.e., the final score or intermediate embedding generated by language model, is used as a new Natural Language Processing (NLP) feature into CTR prediction baseline. This cascade approach avoids complicating the CTR prediction baseline, while keeping flexibility and agility. However, we show in this work that calibrating separately the language model based on the peak single model AUC does not always yield NLP features that give the best performance in CTR prediction model ultimately. Our analysis reveals that the misalignment is due to overlap and redundancy between the new NLP features and the existing features in CTR prediction baseline. In other words, the NLP features can improve CTR prediction better if such overlap can be reduced.
For this purpose, we introduce a simple and general joint-training framework for fine-tuning of language models, combined with the already existing features in CTR prediction baseline, to extract supplementary knowledge for NLP feature. Moreover, we develop an efficient Supplementary Knowledge Distillation (SuKD) that transfers the supplementary knowledge learned by a heavy language model to a light and serviceable model. Comprehensive experiments on both public data and commercial data presented in this work demonstrate that the new NLP features resulting from the joint-training framework can outperform significantly the ones from the independent fine-tuning based on click labels. we also show that the light model distilled with SuKD can provide obvious AUC improvement in CTR prediction over the traditional feature-based knowledge distillation.
Real-Time Bidding (RTB) is an important mechanism in modern online advertising systems. Advertisers employ bidding strategies in RTB to optimize their advertising effects subject to various financial requirements, especially the return-on-investment (ROI) constraint. ROIs change non-monotonically during the sequential bidding process, and often induce a see-saw effect between constraint satisfaction and objective optimization. While some existing approaches show promising results in static or mildly changing ad markets, they fail to generalize to highly dynamic ad markets with ROI constraints, due to their inability to adaptively balance constraints and objectives amidst non-stationarity and partial observability. In this work, we specialize in ROI-Constrained Bidding in non-stationary markets. Based on a Partially Observable Constrained Markov Decision Process, our method exploits an indicator-augmented reward function free of extra trade-off parameters and develops a Curriculum-Guided Bayesian Reinforcement Learning (CBRL) framework to adaptively control the constraint-objective trade-off in non-stationary ad markets. Extensive experiments on a large-scale industrial dataset with two problem settings reveal that CBRL generalizes well in both in-distribution and out-of-distribution data regimes, and enjoys superior learning efficiency and stability.
Large-scale distributed systems, such as Microsoft 365's database system, require timely mitigation solutions to address failures and improve service availability and reliability. Still, mitigation actions can be costly as they may cause temporal performance degradation and even incur monetary expenses. Mitigation actions can be either administrated in a reactive fashion to contain detected failures or a proactive fashion to reduce potential failures. The proactive mitigation approach typically relies on a two-stage strategy: the prediction model will firstly identify instances (such as databases or disks) with high failure risk, then appropriate mitigation actions chosen by engineers or an automatic bandit learning model can be applied. As information is not fully shared across those two stages, important factors such as mitigation costs and states of instances are often ignored in one of those two stages. To address these issues, we propose NENYA, an end-to-end mitigation solution for a large-scale database system powered by a novel cascade reinforcement learning model. By taking the states of databases as input, NENYA directly outputs mitigation actions and is optimized based on jointly cumulative feedback on mitigation costs and failure rates. As the overwhelming majority of databases do not require mitigation actions, NENYA utilizes a novel cascade decision structure to firstly reliably filter out such databases and then focus on choosing appropriate mitigation actions for the rest. Extensive offline and online experiments have shown that our methods can outperform existing practices in reducing both failure rates of databases and mitigation costs. NENYA has been integrated into Microsoft 365, a productive platform, with sounding success.
Traffic congestion incurs long delay in travel time, which seriously affects our daily travel experiences. Exploring why traffic congestion occurs is significantly important to effectively address the problem of traffic congestion and improve user experience. Traditional approaches to mine the congestion causes depend on human efforts, which is time consuming and cost-intensive. Hence, we aim to discover the known and unknown causes of traffic congestion in a systematic way. However, to achieve it, there are three challenges: 1) traffic congestion is affected by several factors with complex spatio-temporal relations; 2) the amount of congestion data with known causes is small due to the limitation of human label; 3) more unknown congestion causes are unexplored since several factors contribute to traffic congestion. To address above challenges, we design a congestion cause discovery system consisting of two modules: 1) congestion feature extraction, which extracts the important features influencing congestion; and 2) congestion cause discovery, which utilize a deep semi-supervised learning based method to discover the causes of traffic congestion with limited labeled causes. Specifically, it first leverages a few labeled data as prior knowledge to pre-train the model. Then, the k-means algorithm is performed to produce the clusters. Extensive experiments show that the performance of our proposed method is superior to the baselines. Additionally, our system is deployed and used in the practical production environment at Amap.
Real-time Vehicle-of-Interest (VoI) detection is becoming a core application to smart cities, especially in areas with high accident rates. With the increasing number of surveillance cameras and the advanced developments in edge computing, video tasks prefer to run on edge devices close to cameras due to the constraints of bandwidth, latency, and privacy concerns. However, resource-constrained edge devices are not competent for dynamic traffic loads with resource-intensive video analysis models. To address this challenge, we propose RT-VeD, a real-time VoI detection system based on the limited resources of edge nodes. RT-VeD utilizes multi-granularity computer vision models with different resource-accuracy trade-offs. It schedules vehicle tasks based on a traffic-aware actor-critic framework to maximize the accuracy of VoI detection while ensuring an inference time-bound. To evaluate the proposed RT-VeD, we conduct extensive experiments based on a real-world vehicle dataset. The experiment results demonstrate that our model outperforms other competitive methods.
In this work, we study how to find the k most representative routes over large scale trajectory data, which is a fundamental operation that benefits various real-world applications, such as traffic monitoring and public transportation planning. The operator is time-sensitive as it must be able to adapt the results as traffic conditions change. We first prove the NP-hardness of the problem, and then propose a range of effective approximate solutions that have rapid response times. Specifically, we first build a lookup table that stores the trajectories covered by each edge in a given road network. Rather than performing a depth-first search for all possible routes, we find a 1/η approximate solution by developing a maximum-weight algorithm. Since each edge in a route may be close to several trajectories, we further propose a coverage-first algorithm to locate the edges with the greatest coverage gain in the solution route set. By observing that in the real world each edge is connected to only a few other edges in a road network, we have developed a connect-first algorithm that finds consecutive edges for k representative routes by greedily selecting edges with the maximum marginal gain for each route. Finally, comprehensive experiments over two real-world datasets are conducted to verify the effectiveness and efficiency of our proposed algorithms, and provide evidence of the usefulness of our solution and rapid response times in traffic monitoring tasks.
Guaranteed delivery (GD) and real-time bidding (RTB) constitute two parallel profit streams for the publisher. The diverse advertiser demands (brand or instant effect) result in different selling (in bulk or via auction) and pricing (fixed unit price or various bids) patterns, which naturally raises the fusion allocation issue of breaking the two markets' barrier and selling out at the global highest price boosting the total revenue. The fusion process complicates the competition between GD and RTB, and GD contracts with overlapping targeting. The non-stationary user traffic and bid landscape further worsen the situation, making the assignment unsupervised and hard to evaluate. Thus, a static policy or coarse-grained modeling from existing work is inferior to facing the above challenges.
This paper proposes CONFLUX, a fusion framework located at the confluence of the parallel GD and RTB markets. CONFLUX functions in a cascaded process: a paradigm is first forged via linear programming to supervise CONFLUX's training, then a cumbersome network distills such paradigm by precisely modeling the competition at a request level and further transfers the generalization ability to a lightweight student via knowledge distillation. Finally, fine-tuning is periodically executed at the online stage to remedy the student's degradation, and a temporal distillation loss between the current and the previous model serves as a regularizer to prevent over-fitting. The procedure is analogous to a cascade distillation and hence its name. CONFLUX has been deployed on the Tencent advertising system for over six months through extensive experiments. Online A/B tests present a lift of 3.29%, 1.77%, and 3.63% of ad income, overall click-through rate, and cost-per-mille, respectively, which jointly contribute a revenue increase by hundreds of thousands RMB per day. Our code is publicly available at https://github.com/zslomo/CONFLUX.
Learning based order dispatching has witnessed tremendous success in ride hailing. However, the success halts within individual ride hailing platforms because sharing raw order dispatching data across platforms may leak user privacy and business secrets. Such data isolation not only impairs user experience but also decreases the potential revenues of the platforms. In this paper, we advocate federated order dispatching for cross-platform ride hailing, where multiple platforms collaboratively make dispatching decisions without sharing their local data. Realizing this concept calls for new federated learning strategies that tackle the unique challenges on effectiveness, privacy and efficiency in the context of order dispatching. In response, we devise Federated Learning-to-Dispatch (Fed-LTD), a framework that allows effective order dispatching by sharing both dispatching models and decisions while providing privacy protection of raw data and high efficiency. We validate Fed-LTD via large-scale trace-driven experiments with Didi GAIA dataset. Extensive evaluations show that Fed-LTD outperforms single-platform order dispatching by 10.24% to 54.07% in terms of total revenue.
Building appropriate scenarios to meet the personalized demands of different user groups is a common practice. Despite various scenario brings personalized service, it also leads to challenges for the recommendation on multiple scenarios, especially the scenarios with limited traffic. To give desirable recommendation service for all scenarios and reduce the cost of resource consumption, how to leverage the information from multiple scenarios to construct a unified model becomes critical. Unfortunately, the performance of existing multi-scenario recommendation approaches is poor since they introduce unnecessary information from other scenarios to target scenario. In this paper, we show it is possible to selectively utilize the information from different scenarios to construct the scenario-aware estimators in a unified model. Specifically, we first do analysis on multi-scenario modeling with causal graph from the perspective of users and modeling processes, and then propose the Causal Inspired Intervention (CausalInt) framework for multi-scenario recommendation. CausalInt consists of three modules: (1) Invariant Representation Modeling module to squeeze out the scenario-aware information through disentangled representation learning and obtain a scenario-invariant representation; (2) Negative Effects Mitigating module to resolve conflicts between different scenarios and conflicts between scenario-specific and scenario-invariant representations via gradient based orthogonal regularization and model-agnostic meta learning, respectively; (3) Inter-Scenario Transferring module designs a novel TransNet to simulate a counterfactual intervention and effectively fuse the information from other scenarios. Offline experiments over two real-world dataset and online A/B test are conducted to demonstrate the superiority of CausalInt.
Over the years we have seen recommender systems shifting focus from optimizing short-term engagement toward improving long-term user experience on the platforms. While defining good long-term user experience is still an active research area, we focus on one specific aspect of improved long-term user experience here, which is user revisiting the platform. These long term outcomes however are much harder to optimize due to the sparsity in observing these events and low signal-to-noise ratio (weak connection) between these long-term outcomes and a single recommendation. To address these challenges, we propose to establish the association between these long-term outcomes and a set of more immediate term user behavior signals that can serve as surrogates for optimization.
To this end, we conduct a large-scale study of user behavior logs on one of the largest industrial recommendation platforms serving billions of users. We study a broad set of sequential user behavior patterns and standardize a procedure to pinpoint the subset that has strong predictive power of the change in users' long-term visiting frequency. Specifically, they are predictive of users' increased visiting to the platform in $5$ months among the group of users with the same visiting frequency to begin with. We validate the identified subset of user behaviors by incorporating them as reward surrogates for long-term user experience in a reinforcement learning (RL) based recommender. Results from multiple live experiments on the industrial recommendation platform demonstrate the effectiveness of the proposed set of surrogates in improving long-term user experience.
The incredible development of federated learning (FL) has benefited various tasks in the domains of computer vision and natural language processing, and the existing frameworks such as TFF and FATE has made the deployment easy in real-world applications. However, federated graph learning (FGL), even though graph data are prevalent, has not been well supported due to its unique characteristics and requirements. The lack of FGL-related framework increases the efforts for accomplishing reproducible research and deploying in real-world applications. Motivated by such strong demand, in this paper, we first discuss the challenges in creating an easy-to-use FGL package and accordingly present our implemented package FederatedScope-GNN (FS-G), which provides (1) a unified view for modularizing and expressing FGL algorithms; (2) comprehensive DataZoo and ModelZoo for out-of-the-box FGL capability; (3) an efficient model auto-tuning component; and (4) off-the-shelf privacy attack and defense abilities. We validate the effectiveness of FS-G by conducting extensive experiments, which simultaneously gains many valuable insights about FGL for the community. Moreover, we employ FS-G to serve the FGL application in real-world E-commerce scenarios, where the attained improvements indicate great potential business benefits. We publicly release FS-G, as submodules of FederatedScope, at https://github.com/alibaba/FederatedScope to promote FGL's research and enable broad applications that would otherwise be infeasible due to the lack of a dedicated package.
Pinpointing the geographic location of an IP address is important for a range of location-aware applications spanning from targeted advertising to fraud prevention. The majority of traditional measurement-based and recent learning-based methods either focus on the efficient employment of topology or utilize data mining to find clues of the target IP in publicly available sources. Motivated by the limitations in existing works, we propose a novel framework named GraphGeo, which provides a complete processing methodology for street-level IP geolocation with the application of graph neural networks. It incorporates IP hosts knowledge and kinds of neighborhood relationships into the graph to infer spatial topology for high-quality geolocation prediction. We explicitly consider and alleviate the negative impact of uncertainty caused by network jitter and congestion, which are pervasive in complicated network environments. Extensive evaluations across three large-scale real-world datasets demonstrate that GraphGeo significantly reduces the geolocation errors compared to the state-of-the-art methods. Moreover, the proposed framework has been deployed on the web platform as an online service for 6 months.
Machine learning (ML) interpretability techniques can reveal undesirable patterns in data that models exploit to make predictions-potentially causing harms once deployed. However, how to take action to address these patterns is not always clear. In a collaboration between ML and human-computer interaction researchers, physicians, and data scientists, we develop GAM Changer, the first interactive system to help domain experts and data scientists easily and responsibly edit Generalized Additive Models (GAMs) and fix problematic patterns. With novel interaction techniques, our tool puts interpretability into action-empowering users to analyze, validate, and align model behaviors with their knowledge and values. Physicians have started to use our tool to investigate and fix pneumonia and sepsis risk prediction models, and an evaluation with 7 data scientists working in diverse domains highlights that our tool is easy to use, meets their model editing needs, and fits into their current workflows. Built with modern web technologies, our tool runs locally in users' web browsers or computational notebooks, lowering the barrier to use. GAM Changer is available at the following public demo link: https://interpret.ml/gam-changer.
Pick-up and delivery (P&D) services such as food delivery have achieved explosive growth in recent years by providing customers with daily-life convenience. Though many service providers have invested considerably in routing tools, more and more practitioners realize that significant deviations exist between workers' actual routes and planned ones. So it is not wise to feed "optimal routes" as workers' actual service routes into downstream tasks (e.g., arrival-time prediction and order dispatching), whose performances count on the accuracy of route prediction, i.e., to predict the future service route of a worker's unfinished tasks. Therefore, to meet the rising calling for route prediction models that can capture workers' future routing behaviors, in this paper, we formulate the Pick-up and Delivery Route Prediction task (PDRP task for short) from the graph perspective for the first time, then propose a dynamic spatial-temporal graph-based model, named Graph2Route. Unlike previous sequence-based models, our model leverages the underlying graph structure and features into the encoding and decoding process. Moreover, the dynamic graph-based nature can spontaneously describe the evolving relationship between different problem instances. As a result, abundant decision context information and various spatial-temporal information of node/edge can be fully utilized in Graph2Route to improve the prediction performance. Offline experiments over two real-world industry-scale datasets under different P&D services (i.e., food delivery and package pick-up) and online A/B test demonstrate the superiority of our proposed model.
Recent advances in multimodal single-cell technologies have enabled simultaneous acquisitions of multiple omics data from the same cell, providing deeper insights into cellular states and dynamics. However, it is challenging to learn the joint representations from the multimodal data, model the relationship between modalities, and, more importantly, incorporate the vast amount of single-modality datasets into the downstream analyses. To address these challenges and correspondingly facilitate multimodal single-cell data analyses, three key tasks have been introduced: Modality prediction, Modality matching andJoint embedding. In this work, we present a general Graph Neural Network framework scMoGNN to tackle these three tasks and show that scMoGNN demonstrates superior results in all three tasks compared with the state-of-the-art and conventional approaches. Our method is an official winner in the overall ranking ofModality prediction from NeurIPS 2021 Competition (https://openproblems.bio/neurips_2021/), and all implementations of our methods have been integrated into DANCE package (https://github.com/OmicsML/dance).
Federated learning (FL) is a feasible technique to learn personalized recommendation models from decentralized user data. Unfortunately, federated recommender systems are vulnerable to poisoning attacks by malicious clients. Existing recommender system poisoning methods mainly focus on promoting the recommendation chances of target items due to financial incentives. In fact, in real-world scenarios, the attacker may also attempt to degrade the overall performance of recommender systems. However, existing general FL poisoning methods for degrading model performance are either ineffective or not concealed in poisoning federated recommender systems. In this paper, we propose a simple yet effective and covert poisoning attack method on federated recommendation, named FedAttack. Its core idea is using globally hardest samples to subvert model training. More specifically, the malicious clients first infer user embeddings based on local user profiles. Next, they choose the candidate items that are most relevant to the user embeddings as hardest negative samples, and find the candidates farthest from the user embeddings as hardest positive samples. The model gradients inferred from these poisoned samples are then uploaded for aggregation. Extensive experiments on two benchmark datasets show that FedAttack can effectively degrade the performance of various federated recommender systems, meanwhile cannot be effectively detected nor defended by many existing methods.
Black-box heterogeneous treatment effect (HTE) models are increasingly being used to create personalized policies that assign individuals to their optimal treatments. However, they are difficult to understand, and can be burdensome to maintain in a production environment. In this paper, we present a scalable, interpretable personalized experimentation system, implemented and deployed in production at Meta. The system works in a multiple treatment, multiple outcome setting typical at Meta to: (1) learn explanations for black-box HTE models; (2) generate interpretable personalized policies. We evaluate the methods used in the system on publicly available data and Meta use cases, and discuss lessons learnt during the development of the system.
Subsurface simulations use computational models to predict the flow of fluids (e.g., oil, water, gas) through porous media. These simulations are pivotal in industrial applications such as petroleum production, where fast and accurate models are needed for high-stake decision making, for example, for well placement optimization and field development planning. Classical finite difference numerical simulators require massive computational resources to model large-scale real-world reservoirs. Alternatively, streamline simulators and data-driven surrogate models are computationally more efficient by relying on approximate physics models, however they are insufficient to model complex reservoir dynamics at scale.
Here we introduce Hybrid Graph Network Simulator (HGNS), which is a data-driven surrogate model for learning reservoir simulations of 3D subsurface fluid flows. To model complex reservoir dynamics at both local and global scale, HGNS consists of a subsurface graph neural network (SGNN) to model the evolution of fluid flows, and a 3D-U-Net to model the evolution of pressure. HGNS is able to scale to grids with millions of cells per time step, two orders of magnitude higher than previous surrogate models, and can accurately predict the fluid flow for tens of time steps (years into the future).
Using an industry-standard subsurface flow dataset (SPE-10) with 1.1 million cells, we demonstrate that HGNS is able to reduce the inference time up to 18 times compared to standard subsurface simulators, and that it outperforms other learning-based models by reducing long-term prediction errors by up to 21%.
Online meal delivery is undergoing explosive growth, as this service is becoming increasingly popular. A meal delivery platform aims to provide excellent and stable services for customers and restaurants. However, in reality, several hundred thousand orders are canceled per day in the Meituan meal delivery platform since they are not accepted by the crowd soucing drivers. The cancellation of the orders is incredibly detrimental to the customer's repurchase rate and the reputation of the Meituan meal delivery platform. To solve this problem, a certain amount of specific funds is provided by Meituan's business managers to encourage the crowdsourcing drivers to accept more orders. To make better use of the funds, in this work, we propose a framework to deal with the multi-stage bonus allocation problem for a meal delivery platform. The objective of this framework is to maximize the number of accepted orders within a limited bonus budget. This framework consists of a semi-black-box acceptance probability model, a Lagrangian dual-based dynamic programming algorithm, and an online allocation algorithm. The semi-black-box acceptance probability model is employed to forecast the relationship between the bonus allocated to order and its acceptance probability, the Lagrangian dual-based dynamic programming algorithm aims to calculate the empirical Lagrangian multiplier for each allocation stage offline based on the historical data set, and the online allocation algorithm uses the results attained in the offline part to calculate a proper delivery bonus for each order. To verify the effectiveness and efficiency of our framework, both offline experiments on a real-world data set and online A/B tests on the Meituan meal delivery platform are conducted. Our results show that using the proposed framework, the total order cancellations can be decreased by more than 25% in reality.
An emerging dilemma that faces practitioners in large scale online experimentation for e-commence is whether to use Multi-Armed Bandit (MAB) algorithms for testing or traditional A/B testing (A/B). This paper provides a comprehensive comparison between the two, from the perspectives of confidence intervals, hypothesis test powers, and their relationships with traffic split and sample size both theoretically and numerically. We first make comparisons between MAB with A/B tests in terms of conditions under which disjoint confidence intervals occur, and analyze their connection with the traffic split. Then we explore the relationship between the two in terms of sample sizes needed to achieve the required hypothesis test power, and analyze under what conditions MAB could have a higher test power than A/B given the same sample size. Based on the theoretical analysis, we propose two new MAB algorithms that combine the strengths of traditional MAB and A/B together, with higher (or equal) test power and higher (or equal) expected rewards than A/B testing under certain common conditions in e-commerce. Last, we evaluate and compare the performance among the classical MAB algorithms, our newly proposed MAB algorithms, and A/B testing in terms of their accuracy of identifying ground truth winner with practical significance, power rewards trade-off, sample sizes etc. in both simulated datasets and industrial historical datasets. We hope the work can not only facilitate a better understanding of pros and cons of MAB and A/B testing, but also help build the connections between the two and provide possible approaches that can leverage the best from both worlds.
News recommendation calls for deep insights of news articles' underlying semantics. Therefore, pretrained language models (PLMs), like BERT and RoBERTa, may substantially contribute to the recommendation quality. However, it's extremely challenging to have news recommenders trained together with such big models: the learning of news recommenders requires intensive news encoding operations, whose cost is prohibitive if PLMs are used as the news encoder. In this paper, we propose a novel framework, SpeedyFeed, which efficiently trains PLMs-based news recommenders of superior quality. SpeedyFeed is highlighted for its light-weight encoding pipeline, which gives rise to three major advantages. Firstly, it makes the intermediate results fully reusable for the training workflow, which removes most of the repetitive but redundant encoding operations. Secondly, it improves the data efficiency of the training workflow, where non-informative data can be eliminated from encoding. Thirdly, it further saves the cost by leveraging simplified news encoding and compact news representation.
SpeedyFeed leads to more than 100x acceleration of the training process, which enables big models to be trained efficiently and effectively over massive user data. The well-trained PLMs-based model significantly outperforms the state-of-the-art news recommenders in comprehensive offline experiments. It is applied to Microsoft News to empower the training of large-scale production models, which demonstrate highly competitive online performances. SpeedyFeed is also a model-agnostic framework, thus being potentially applicable to a wide spectrum of content-based recommender systems. We've made the source code open to the public so as to facilitate research and applications in related areas.
Cross-domain recommendation (CDR) aims to provide better recommendation results in the target domain with the help of the source domain, which is widely used and explored in real-world systems. However, CDR in the matching (i.e., candidate generation) module struggles with the data sparsity and popularity bias issues in both representation learning and knowledge transfer. In this work, we propose a novel Contrastive Cross-Domain Recommendation (CCDR) framework for CDR in matching. Specifically, we build a huge diversified preference network to capture multiple information reflecting user diverse interests, and design an intra-domain contrastive learning (intra-CL) and three inter-domain contrastive learning (inter-CL) tasks for better representation learning and knowledge transfer. The intra-CL enables more effective and balanced training inside the target domain via a graph augmentation, while the inter-CL builds different types of cross-domain interactions from user, taxonomy, and neighbor aspects. In experiments, CCDR achieves significant improvements on both offline and online evaluations in a real-world system. Currently, we have deployed our CCDR on WeChat Top Stories, affecting plenty of users. The source code is in https://github.com/lqfarmer/CCDR.
Hotel search ranking is the core function of Online Travel Platforms (OTPs), while geography information of location entities involved in it plays a critically important role in guaranteeing its ranking quality. The closest line of works to the hotel search ranking problem is thus the next POI (or location) recommendation problem, which has extensive works but fails to cope with two new challenges, i.e., consideration of two more location entities and effective utilization of geographical information, in a hotel search ranking scenario. To this end, we propose a General Geography-aware representation NETwork (G2NET for short) to better represent geography information of location entities so as to optimize the hotel search ranking. In G2NET, to address the first challenge, we first propose the concept of Geography Interaction Schema (GIS) which is a meta template for representing the arbitrary number of location entity types and their interactions. Then, a novel geography interaction encoder is devised providing general representation ability for an instance of GIS, followed by an attentive operation that aggregates representations of instances corresponding to all historically interacted hotels of a user in a weighted manner. The second challenge is handled by the combined application of three proposed geography embedding modules in G2NET, each of which focuses on computing embeddings of location entities based on a certain aspect of geographical information of location entities. Moreover, a self-attention layer is deployed in G2NET, to capture correlations among historically interacted hotels of a user which provides non-trivial functionality of understanding the user's behaviors. Both offline and online experiments show that G2NET outperforms the state-of-the-art methods. G2NET has now been successfully deployed to provide the high-quality hotel search ranking service at Fliggy, one of the most popular OTPs in China, serving tens of millions of users.
In medical insurance industry, a lot of human labor is required to collect information of claimants. Human assessors need to converse with claimants in order to record key information and organize it into a structured summary. With the purpose of helping save human labor, we propose the task of conversation-oriented structured summarization which aims to automatically produce the desired structured summary from a conversation automatically. One major challenge of the task is that the structured summary contains multiple fields of different types. To tackle this problem, we propose a unified approach COSSUM based on prompting to generate the values of all fields simultaneously. By learning all fields together, our approach can capture the inherent relationship between them. Moreover, we propose a specially designed curriculum learning strategy for model training. Both automatic and human evaluations are performed, and the results show the effectiveness of our proposed approach.
In industrial applications like online advertising and recommendation systems, diverse and accurate user profiles can greatly help improve personalization. Deep learning is widely applied to mine expressive tags to users from their historical interactions in the system, e.g., click, conversion action in the advertising chain. The usual approach is to take a certain action as the objective, and introduce multiple independent Two-Tower models to predict the possibility of users' action on tags (known as CTR or CVR prediction). The predicted users' high probably attractive tags are to represent their preferences. However, the single-action models cannot learn complementarily and support effective training on data-sparse actions. Besides, limited by the lack of information fusion between the two towers, the model learns insufficiently to represent users' preferences on various tag topics well. This paper introduces a novel multi-task model called Mixture of Virtual-Kernel Experts (MVKE) to learn user preferences on various actions and topics unitedly. In MVKE, we propose a concept of Virtual-Kernel Expert, which focuses on modeling one particular facet of the user's preferences, and all of them learn coordinately. Besides, the gate-based structure used in MVKE builds an information fusion bridge between two towers, improving the model's capability and maintaining high efficiency. We apply the model in Tencent Advertising System, where both online and offline evaluations show that our method has a significant improvement compared with the existing ones and brings about an obvious lift to actual advertising revenue.
Given the risks and cost of hospitalization, there has been significant interest in exploiting machine learning models to improve perioperative care. However, due to the high dimensionality and noisiness of perioperative data, it remains a challenge to develop accurate and robust encoding for surgical predictions. Furthermore, it is important for the encoding to be interpretable by perioperative care practitioners to facilitate their decision making process. We proposeclinical variational autoencoder (cVAE), a deep latent variable model that addresses the challenges of surgical applications through two salient features. (1) To overcome performance limitations of traditional VAE, it isprediction-guided with explicit expression of predicted outcome in the latent representation. (2) Itdisentangles the latent space so that it can be interpreted in a clinically meaningful fashion. We apply cVAE to two real-world perioperative datasets to evaluate its efficacy and performance in predicting outcomes that are important to perioperative care, including postoperative complication and surgery duration. To demonstrate the generality and facilitate reproducibility, we also apply cVAE to the open MIMIC-III dataset for predicting ICU duration and mortality. Our results show that the latent representation provided by cVAE leads to superior performance in classification, regression and multi-task predictions. The two features of cVAE are mutually beneficial and eliminate the need of a predictor. We further demonstrate the interpretability of the disentangled representation and its capability to capture intrinsic characteristics of hospitalized patients. While this work is motivated by and evaluated in the context of clinical applications, the proposed approach may be generalized for other fields using high-dimensional and noisy data and valuing interpretable representations.
Recurring outbreaks of COVID-19 have posed enduring effects on global society, which calls for a predictor of pandemic waves using various data with early availability. Existing prediction models that forecast the first outbreak wave using mobility data may not be applicable to the multiwave prediction, because the evidence in the USA and Japan has shown that mobility patterns across different waves exhibit varying relationships with fluctuations in infection cases. Therefore, to predict the multiwave pandemic, we propose a Social Awareness-Based Graph Neural Network (SAB-GNN) that considers the decay of symptom-related web search frequency to capture the changes in public awareness across multiple waves. Our model combines GNN and LSTM to model the complex relationships among urban districts, inter-district mobility patterns, web search history, and future COVID-19 infections. We train our model to predict future pandemic outbreaks in the Tokyo area using its mobility and web search data from April 2020 to May 2021 across four pandemic waves collected by Yahoo Japan Corporation under strict privacy protection rules. Results demonstrate our model outperforms state-of-the-art baselines such as ST-GNN, MPNN, and GraphLSTM. Though our model is not computationally expensive (only 3 layers and 10 hidden neurons), the proposed model enables public agencies to anticipate and prepare for future pandemic outbreaks.
Predictive autoscaling (autoscaling with workload forecasting) is an important mechanism that supports autonomous adjustment of computing resources in accordance with fluctuating workload demands in the Cloud. In recent works, Reinforcement Learning (RL) has been introduced as a promising approach to learn the resource management policies to guide the scaling actions under the dynamic and uncertain cloud environment. However, RL methods face the following challenges in steering predictive autoscaling, such as lack of accuracy in decision-making, inefficient sampling and significant variability in workload patterns that may cause policies to fail at test time. To this end, we propose an end-to-end predictive meta model-based RL algorithm, aiming to optimally allocate resource to maintain a stable CPU utilization level, which incorporates a specially-designed deep periodic workload prediction model as the input and embeds the Neural Process [11, 16] to guide the learning of the optimal scaling actions over numerous application services in the Cloud. Our algorithm not only ensures the predictability and accuracy of the scaling strategy, but also enables the scaling decisions to adapt to the changing workloads with high sample efficiency. Our method has achieved significant performance improvement compared to the existing algorithms and has been deployed online at Alipay, supporting the autoscaling of applications for the world-leading payment platform.
Learning-to-Rank (LTR) systems are ubiquitous in web applications nowadays. The existing literature mainly focuses on improving ranking performance by trying to generate the optimal order of candidate items. However, virtually all advanced ranking functions are not scale calibrated. For example, rankers have the freedom to add a constant to all item scores without changing their relative order. This property has resulted in several limitations in deploying advanced ranking methods in practice. On the one hand, it limits the use of effective ranking functions in important applications. For example, in ads ranking, predicted Click-Through Rate (pCTR) is used for ranking and is required to be calibrated for the downstream ads auction. This is a major reason that existing ads ranking methods use scale calibrated pointwise loss functions that may sacrifice ranking performance. On the other hand, popular ranking losses are translation-invariant. We rigorously show that, both theoretically and empirically, this property leads to training instability that may cause severe practical issues.
In this paper, we study how to perform scale calibration of deep ranking models to address the above concerns. We design three different formulations to calibrate ranking models through calibrated ranking losses. Unlike existing post-processing methods, our calibration is performed during training, which can resolve the training instability issue without any additional processing. We conduct experiments on the standard LTR benchmark datasets and one of the largest sponsored search ads dataset from Google. Our results show that our proposed calibrated ranking losses can achieve nearly optimal results in terms of both ranking quality and score scale calibration.
In large-scale online services, crucial metrics, a.k.a., key performance indicators (KPIs), are monitored periodically to check the running statuses. Generally, KPIs are aggregated along multiple dimensions and derived by complex calculations among fundamental metrics from the raw data. Once abnormal KPI values are observed, root cause analysis (RCA) can be applied to identify the reasons for anomalies, so that we can troubleshoot quickly. Recently, several automatic RCA techniques were proposed to localize the related dimensions (or a combination of dimensions) to explain the anomalies. However, their analyses are limited to the data on the abnormal metric and ignore the data of other metrics which are also related to the anomalies, leading to imprecise or even incorrect root causes. To this end, we propose a cross-metric multi-dimensional root cause analysis method, named CMMD, which consists of two key components: 1) relationship modeling, which utilizes graph neural network (GNN) to model the unknown complex calculation among metrics and aggregation function among dimensions from historical data; 2) root cause localization, which adopts the genetic algorithm to efficiently and effectively dive into the raw data and localize the abnormal dimension(s) once the KPI anomalies are detected. Experiments on synthetic datasets, real-world datasets and online production environments demonstrate the superiority of our proposed CMMD method compared with baselines. Currently, CMMD is running as an online service in Microsoft Azure.
The task of road extraction has aroused remarkable attention due to its critical role in facilitating urban development and up-to-date map maintenance, which has widespread applications such as navigation and autonomous driving. Existing solutions either rely on a single source of data for road graph extraction or simply fuse the multimodal information in a sub-optimal way. In this paper, we present an automatic road extraction solution named DuARE, which is designed to exploit the multimodal knowledge for underlying road extraction in a fully automatic manner. Specifically, we collect a large-scale real-world dataset for paired aerial image and trajectory data, covering over 33,000 km2 in more than 80 cities. First, road extraction is performed on the abundant spatial-temporal trajectory data adaptively based on the density distribution. Then, a coarse-to-fine road graph learner from aerial images is proposed to take advantage of the local and global context. Finally, our cross-check-based fusion approach keeps the optimal state of each modality while revisiting the original trajectory map with the guidance of aerial predictions to further improve the performance. Extensive experiments conducted on large-scale real-world datasets demonstrate the superiority and effectiveness of DuARE. In addition, DuARE has been deployed in production at Baidu Maps since June 2021 and keeps updating the road network by 100,000 km per month. This confirms that DuARE is a practical and industrial-grade solution for large-scale cost-effective road extraction from multimodal data.
Although conceptualization has been widely studied in semantics and knowledge representation, it is still challenging to find the most accurate concept terms to tag fast-growing social media content. This is partly attributed to the fact that most traditional knowledge bases contain general terms of the world, such as trees and cars, which are not interesting to users, and do not have the defining power for social media content. Another reason is that the intricate use of tense, negation and grammar in social media content may change the logic or emphasis of the content, thus focusing on different main ideas. In this paper, we present TAG, a high-quality concept matching dataset consisting of 10,000 labeled pairs of fine-grained concepts and web-styled natural language sentences, mined from open-domain social media content. The concepts we provide are the trending terms on social media and have the right granularity to define user interests, e.g., highly educated actors instead of just actors. In the meantime, TAG offers a concept graph which interconnects these fine-grained concepts and entities to provide contextual information. We evaluate a wide range of neural text matching models as well as pre-trained language models for the concept matching task on TAG, and point out their insufficiency to tag social media content to characterize its main idea. We further propose a novel graph-graph matching framework that demonstrates superior abstraction and generalization performance by better utilizing both the structural information in the concept graph and logic interactions between semantic units in the natural language sentence via syntactic dependency parsing.
Multi-touch attribution (MTA), aiming to estimate the contribution of each advertisement touchpoint in conversion journeys, is essential for budget allocation and automatically advertising. Existing methods first train a model to predict the conversion probability of the advertisement journeys with historical data and calculate the attribution of each touchpoint by using the results counterfactual predictions. An assumption of these works is the conversion prediction model is unbiased. It can give accurate predictions on any randomly assigned journey, including both the factual and counterfactual ones. Nevertheless, this assumption does not always hold as the user preferences act as the common cause for both ad generation and user conversion, involving the confounding bias and leading to an out-of-distribution (OOD) problem in the counterfactual prediction. In this paper, we define the causal MTA task and propose CausalMTA to solve this problem. It systemically eliminates the confounding bias from both static and dynamic perspectives and learn an unbiased conversion prediction model using historical data. We also provide a theoretical analysis to prove the effectiveness of CausalMTA with sufficient ad journeys. Extensive experiments on both synthetic and real data in Alibaba advertising platform show that CausalMTA can not only achieve better prediction performance than the state-of-the-art method but also generate meaningful attribution credits across different advertising channels.
On-device machine learning enables the lightweight deployment of recommendation models in local clients, which reduces the burden of the cloud-based recommenders and simultaneously incorporates more real-time user features. Nevertheless, the cloud-based recommendation in the industry is still very important considering its powerful model capacity and the efficient candidate generation from the billion-scale item pool. Previous attempts to integrate the merits of both paradigms mainly resort to a sequential mechanism, which builds the on-device recommender on top of the cloud-based recommendation. However, such a design is inflexible when user interests dramatically change: the on-device model is stuck by the limited item cache while the cloud-based recommendation based on the large item pool do not respond without the new re-fresh feedback. To overcome this issue, we propose a meta controller to dynamically manage the collaboration between the on-device recommender and the cloud-based recommender, and introduce a novel efficient sample construction from the causal perspective to solve the dataset absence issue of meta controller. On the basis of the counterfactual samples and the extended training, extensive experiments in the industrial recommendation scenarios show the promise of meta controller in the device-cloud collaboration.
Text relevance or text matching of query and product is an essential technique for e-commerce search engine, which helps users find the desirable products and is also crucial to ensuring user experience. A major difficulty for e-commerce text relevance is the severe vocabulary gap between query and product. Recently, neural networks have been the mainstream for the text matching task owing to the better performance for semantic matching. Practical e-commerce relevance models are usually representation-based architecture, which can pre-compute representations offline and are therefore online efficient. Interaction-based models, although can achieve better performance, are mostly time-consuming and hard to be deployed online. Recently BERT has achieved significant progress on many NLP tasks including text matching, and it is of great value but also big challenge to deploy BERT to the e-commerce relevance task. To realize this goal, we propose ReprBERT, which has the advantages of both excellent performance and low latency, by distilling the interaction-based BERT model to a representation-based architecture. To reduce the performance decline, we investigate the key reasons and propose two novel interaction strategies to resolve the absence of representation interaction and low-level semantic interaction. Finally, ReprBERT can achieve only about 1.5% AUC loss from the interaction-based BERT, but has more than 10% AUC improvement compared to previous state-of-the-art representation-based models. ReprBERT has already been deployed on the search engine of Taobao and serving the entire search traffic, achieving significant gain of user experience and business profit.
As we move toward a cookie-less world, the ability to track users' online activities for behavior targeting will be drastically reduced, making contextual targeting an appealing alternative for advertising platforms. Category-based contextual targeting displays ads on web pages that are relevant to advertiser-targeted categories, according to a pre-defined taxonomy. Accurate web page classification is key to the success of this approach. In this paper, we use multilingual Transformer-based transfer learning models to classify web pages in five high-impact languages. We adopt multiple data sampling techniques to increase coverage for rare categories, and modify the loss using class-based re-weighting to smooth the influence of frequent versus rare categories. Offline evaluation shows that these are crucial for improving our classifiers. We leverage knowledge distillation to train accurate models that are lightweight in terms of (i) model size, and (ii) the input text used. Classifying web pages using only text from the URL addresses a unique challenge for contextual targeting in that bid requests come to ad systems as URLs without content, while crawling is time consuming and costly. We launched the proposed models for contextual targeting in the Yahoo DSP, significantly increasing its revenue.
Spaced repetition is a mnemonic technique where long-term memory can be efficiently formed by following review schedules. For greater memorization efficiency, spaced repetition schedulers need to model students' long-term memory and optimize the review cost. We have collected 220 million students' memory behavior logs with time-series features and built a memory model with Markov property. Based on the model, we design a spaced repetition scheduler guaranteed to minimize the review cost by a stochastic shortest path algorithm. Experimental results have shown a 12.6% performance improvement over the state-of-the-art methods. The scheduler has been successfully deployed in the online language-learning app MaiMemo to help millions of students.
Graph neural networks (GNNs) are deep learning models designed specifically for graph data, and they typically rely on node features as the input to the first layer. When applying such a type of network on the graph without node features, one can extract simple graph-based node features (e.g., number of degrees) or learn the input node representations (i.e., embeddings) when training the network. While the latter approach, which trains node embeddings, more likely leads to better performance, the number of parameters associated with the embeddings grows linearly with the number of nodes. It is therefore impractical to train the input node embeddings together with GNNs within graphics processing unit (GPU) memory in an end-to-end fashion when dealing with industrial-scale graph data. Inspired by the embedding compression methods developed for natural language processing (NLP) tasks, we develop a node embedding compression method where each node is compactly represented with a bit vector instead of a floating-point vector. The parameters utilized in the compression method can be trained together with GNNs. We show that the proposed node embedding compression method achieves superior performance compared to the alternatives.
Age-related macular degeneration (AMD) is the leading cause of irreversible blindness in developed countries. Identifying patients at high risk of progression to late AMD, the sight-threatening stage, is critical for clinical actions, including medical interventions and timely monitoring. Recently, deep-learning-based models have been developed and achieved superior performance for late AMD prediction. However, most existing methods are limited to the color fundus photography (CFP) from the last ophthalmic visit and do not include the longitudinal CFP history and AMD progression during the previous years' visits. Patients in different AMD subphenotypes might have various speeds of progression in different stages of AMD disease. Capturing the progression information during the previous years' visits might be useful for the prediction of AMD progression. In this work, we propose a C ontrastive-A ttention-based T ime-aware L ong S hort-T erm M emory network (CAT-LSTM ) to predict AMD progression. First, we adopt a convolutional neural network (CNN) model with a contrastive attention module (CA) to extract abnormal features from CFPs. Then we utilize a time-aware LSTM (T-LSTM) to model the patients' history and consider the AMD progression information. The combination of disease progression, genotype information, demographics, and CFP features are sent to T-LSTM. Moreover, we leverage an auto-encoder to represent temporal CFP sequences as fixed-size vectors and adopt k-means to cluster them into subphenotypes. We evaluate the proposed model based on real-world datasets, and the results show that the proposed model could achieve 0.925 on area under the receiver operating characteristic (AUROC) for 5-year late-AMD prediction and outperforms the state-of-the-art methods by more than 3%, which demonstrates the effectiveness of the proposed CAT-LSTM. After analyzing patient representation learned by an auto-encoder, we identify 3 novel subphenotypes of AMD patients with different characteristics and progression rates to late AMD, paving the way for improved personalization of AMD management. The code of CAT-LSTM can be found at https://github.com/yinchangchang/CAT-LSTM .
Large-scale vehicle trajectories bring great benefits in understanding urban mobility, and can be used to promote a wide range of applications in building intelligent transportation systems. Traditional approaches cannot recover the trajectories of all the vehicles on the roads since they are based on partial trajectory data. To address it, we study the all-vehicle trajectory recovery based on traffic camera video data. However, there are two challenges in this study. First, the quality of the images captured by traffic cameras is unbalanced, so it is hard to identify the same vehicles. Second, the traffic camera observation data are sparse due to the incompleteness of the traffic cameras and possible vehicle miss from the traffic cameras. To deal with these challenges, we design a novel system to recover the vehicle trajectory with the granularity of the road intersection. In this system, we propose an iterative framework to jointly optimize the vehicle re-identification and trajectory recovery tasks. In the vehicle re-identification task, we propose an effective strategy to guide the vehicle clustering based on visual features and the spatio-temporal constraint features updated by the trajectory discovery task. In the trajectory recovery task, we model the spatial and temporal relations as well as the vehicle miss problem by a probabilistic approach to recover the trajectories. Extensive experiments demonstrate that our framework outperforms the existing state-of-art solutions. Finally, our system is deployed in practical applications of SenseTime, China, including traffic congestion analysis and traffic signal control.
Large-scale pre-trained language models (PLMs) have shown promising advances on various downstream tasks, among which dialogue is one of the most concerned. However, there remain challenges for individual developers to create a knowledge-grounded dialogue system upon such big models because of the expensive cost of collecting the knowledge resources for supporting the system as well as tuning these large models for the task. To tackle these obstacles, we propose XDAI, a knowledge-grounded dialogue system that is equipped with the prompt-aware tuning-free PLM exploitation and supported by the ready-to-use open-domain external knowledge resources plus the easy-to-change domain-specific mechanism. With XDAI, the developers can leverage the PLMs without any fine-tuning cost to quickly create the open-domain dialogue systems as well as easily customize their own domain-specific systems. Extensive experiments including human evaluation, Turing test, and online evaluation have demonstrated the competitive performance of XDAI compared with the state-of-the-art general PLMs and specific PLMs for dialogue. XDAI pilots studies on the exploitation of PLMs and made intriguing findings which could be inspiring for the future research on other PLM-based applications.
Developers and related researchers can get access to our repository at https://github.com/THUDM/XDAI, which presents a series of APIs, incremental toolkits and chatbot service of XDAI platform.
We introduce CommerceMM - a multimodal model capable of providing a diverse and granular understanding of commerce topics associated to the given piece of content (image, text, image+text), and having the capability to generalize to a wide range of tasks, including Multimodal Categorization, Image-Text Retrieval, Query-to-Product Retrieval, Image-to-Product Retrieval, etc. We follow the pre-training + fine-tuning training regime and present 5 effective pre-training tasks on image-text pairs. To embrace more common and diverse commerce data with text-to-multimodal, image-to-multimodal, and multimodal-to-multimodal mapping, we propose another 9 novel cross-modal and cross-pair retrieval tasks, called Omni-Retrieval pre-training. We also propose a novel approach of modality randomization to dynamically adjust our model under different efficiency constraints. The pre-training is conducted in an efficient manner with only two forward/backward updates for the combined 14 tasks. Extensive experiments and analysis show the effectiveness of each task. When combining all pre-training tasks, our model achieves state-of-the-art performance on 7 commerce-related downstream tasks after fine-tuning.
Video advertisements may grasp customers' attention instantly and are often adored by advertisers. Since the corpus is vast, achieving an efficient query-to-video search can be challenging. Because traditional approximate nearest neighborhood (ANN) search methods are based simple similarities (e.g., cosine or inner products) on embedding vectors. They are often not sufficient for bridging the modal gap between a text query and video advertisements and typically can only achieve sub-optimal performance in query-to-video search. Tree-based deep model (TDM) overcomes the limited matching capability of embedding-based methods but suffers from the data sparsity problem. Deep retrieval model adopts a graph-based model which overcomes the data sparsity problem in TDM by sharing the nodes. But the shared nodes entangle features of different items, making it difficult to distinguish similar items. In this work, we enhance the graph-based model through sub-path embedding to differentiate similar videos. The added sub-path embedding provides personalized characteristics, beneficial for modeling fine-grain details to discriminate similar items. After launching enhanced graph model (EGM), the click-through rate (CTR) relatively increases by 1.33%, and the conversion rate (CVR) relatively by 1.07%.
Corporate credit ratings issued by third-party rating agencies are quantified assessments of a company's creditworthiness. Credit Ratings highly correlate to the likelihood of a company defaulting on its debt obligations. These ratings play critical roles in investment decision-making as one of the key risk factors. They are also central to the regulatory framework such as BASEL II in calculating necessary capital for financial institutions. Being able to predict rating changes will greatly benefit both investors and regulators alike. In this paper, we consider the corporate credit rating migration early prediction problem, which predicts the credit rating of an issuer will be upgraded, unchanged, or downgraded after 12 months based on its latest financial reporting information at the time. We investigate the effectiveness of different standard machine learning algorithms and conclude these models deliver inferior performance. As part of our contribution, we propose a new Multi-task Envisioning Transformer-based Autoencoder (META) model to tackle this challenging problem. META consists of Positional Encoding, Transformer-based Autoencoder, and Multi-task Prediction to learn effective representations for both migration prediction and rating prediction. This enables META to better explore the historical data in the training stage for one-year later prediction. Experimental results show that META outperforms all baseline models.
Embedding learning is an important technique in deep recommendation models to map categorical features to dense vectors. However, the embedding tables often demand an extremely large number of parameters, which become the storage and efficiency bottlenecks. Distributed training solutions have been adopted to partition the embedding tables into multiple devices. However, the embedding tables can easily lead to imbalances if not carefully partitioned. This is a significant design challenge of distributed systems named embedding table sharding, i.e., how we should partition the embedding tables to balance the costs across devices, which is a non-trivial task because 1) it is hard to efficiently and precisely measure the cost, and 2) the partition problem is known to be NP-hard. In this work, we introduce our novel practice in Meta, namely AutoShard, which uses a neural cost model to directly predict the multi-table costs and leverages deep reinforcement learning to solve the partition problem. Experimental results on an open-sourced large-scale synthetic dataset and Meta's production dataset demonstrate the superiority of AutoShard over the heuristics. Moreover, the learned policy of AutoShard can transfer to sharding tasks with various numbers of tables and different ratios of the unseen tables without any fine-tuning. Furthermore, AutoShard can efficiently shard hundreds of tables in seconds. The effectiveness, transferability, and efficiency of AutoShard make it desirable for production use. Our algorithms have been deployed in Meta production environment. A prototype is available at https://github.com/daochenzha/autoshard
Watch-time prediction remains to be a key factor in reinforcing user engagement via video recommendations. It has become increasingly important given the ever-growing popularity of online videos. However, prediction of watch time not only depends on the match between the user and the video but is often mislead by the duration of the video itself. With the goal of improving watch time, recommendation is always biased towards videos with long duration. Models trained on this imbalanced data face the risk of bias amplification, which misguides platforms to over-recommend videos with long duration but overlook the underlying user interests. This paper presents the first work to study duration bias in watch-time prediction for video recommendation. We employ a causal graph illuminating that duration is a confounding factor that concurrently affects video exposure and watch-time prediction---the first effect on video causes the bias issue and should be eliminated, while the second effect on watch time originates from video intrinsic characteristics and should be preserved. To remove the undesired bias but leverage the natural effect, we propose a Duration-Deconfounded Quantile-based (D2Q) watch-time prediction framework, which allows for scalability to perform on industry production systems. Through extensive offline evaluation and live experiments, we showcase the effectiveness of this duration-deconfounding framework by significantly outperforming the state-of-the-art baselines. We have fully launched our approach on Kuaishou App, which has substantially improved real-time video consumption due to more accurate watch-time predictions.
Oracle Bone Inscriptions (OBI) is one of the oldest scripts in the world. The rejoining of Oracle Bone (OB) fragments is of vital importance to the research of ancient scripts and history. Although significant progress has been achieved in the past decades, the rejoining work still heavily relies on domain knowledge and manual work, thus remains a low efficient and time-consuming process Therefore, an automatic and practical algorithm/system for OB rejoining is of great value to the OBI community. To this end, we collect a real-world dataset for rejoining Oracle Bone fragments, namely OB-Rejoin, which consists of 998 OB rubbing images that suffer from low quality image problems, due to intrinsic underground eroding over time and extrinsic imaging conditions in the past. Moreover, a practical Self-Supervised Splicing Network, S3-Net, is proposed to rejoin the OB fragments based on shape similarity of their borderlines. Specifically, we first transform the manually annotated borderline strokes of OB images into times series style shape representations, which are fed as input to a Generative Adversarial Network for augmenting positive pairs of rejoinable OBs for each OB fragment that does not have rejoinable counterparts. A Siamese network is trained on such augmented data in a contrastive learning manner to retrieve the matching OB fragments of an unseen query from an OB fragment gallery. Experiments on the OB-Rejoin benchmark show that our data-driven approach outperforms two recent methods for time-series analysis. In order to demonstrate its practical potential, we deploy the proposed S3-Net method in real tests and ultimately discover dozens of new rejoinings missed by domain experts for decades.
Embedding based retrieval (EBR) is a fundamental building block in many web applications. However, EBR in sponsored search is distinguished from other generic scenarios and technically challenging due to the need of serving multiple retrieval purposes: firstly, it has to retrieve high-relevance ads, which may exactly serve user's search intent; secondly, it needs to retrieve high-CTR ads so as to maximize the overall user clicks. In this paper, we present a novel representation learning framework Uni-Retriever developed for Bing Search, which unifies two different training modes knowledge distillation and contrastive learning to realize both required objectives. On one hand, the capability of making high-relevance retrieval is established by distilling knowledge from the "relevance teacher model''. On the other hand, the capability of making high-CTR retrieval is optimized by learning to discriminate user's clicked ads from the entire corpus. The two training modes are jointly performed as a multi-objective learning process, such that the ads of high relevance and CTR can be favored by the generated embeddings. Besides the learning strategy, we also elaborate our solution for EBR serving pipeline built upon the substantially optimized DiskANN, where massive-scale EBR can be performed with competitive time and memory efficiency, and accomplished in high-quality. We make comprehensive offline and online experiments to evaluate the proposed techniques, whose findings may provide useful insights for the future development of EBR systems. Uni-Retriever has been mainstreamed as the major retrieval path in Bing's production thanks to the notable improvements on the representation and EBR serving quality.
Felicitas is a distributed cross-device Federated Learning (FL) framework to solve the industrial difficulties of FL in large-scale device deployment scenarios. In Felicitas, FL-Clients are deployed on mobile or embedded devices, while FL-Server is deployed on the cloud platform. We also summarize the challenges of FL deployment in industrial cross-device scenarios (massively parallel, stateless clients, non-use of client identifiers, highly unreliable, unsteady and complex deployment), and provide reliable solutions. We provide the source code and documents at https://www.mindspore.cn/. In addition, the Felicitas has been deployed on mobile phones in real world. At the end of the paper, we demonstrate the validity of the framework through experiments.
Recommender System (RS) is an important online application that affects billions of users every day. The mainstream RS ranking framework is composed of two parts: a Multi-Task Learning model (MTL) that predicts various user feedback, i.e., clicks, likes, sharings, and a Multi-Task Fusion model (MTF) that combines the multi-task outputs into one final ranking score with respect to user satisfaction. There has not been much research on the fusion model while it has great impact on the final recommendation as the last crucial process of the ranking. To optimize long-term user satisfaction rather than obtain instant returns greedily, we formulate MTF task as Markov Decision Process (MDP) within a recommendation session and propose a Batch Reinforcement Learning (RL) based Multi-Task Fusion framework (BatchRL-MTF) that includes a Batch RL framework and an online exploration. The former exploits Batch RL to learn an optimal recommendation policy from the fixed batch data offline for long-term user satisfaction, while the latter explores potential high-value actions online to break through the local optimal dilemma. With a comprehensive investigation on user behaviors, we model the user satisfaction reward with subtle heuristics from two aspects of user stickiness and user activeness. Finally, we conduct extensive experiments on a billion-sample level real-world dataset to show the effectiveness of our model. We propose a conservative offline policy estimator (Conservative-OPEstimator) to test our model offline. Furthermore, we take online experiments in a real recommendation environment to compare performance of different models. As one of few Batch RL researches applied in MTF task successfully, our model has also been deployed on a large-scale industrial short video platform, serving hundreds of millions of users.
On e-commerce platforms, predicting if two products are compatible with each other is an important functionality to achieve trustworthy product recommendation and search experience for consumers. However, accurately predicting product compatibility is difficult due to the heterogeneous product data and the lack of manually curated training data. We study the problem of discovering effective labeling rules that can enable weakly-supervised product compatibility prediction. We develop AMRule, a multi-view rule discovery framework that can (1) adaptively and iteratively discover novel rulers that can complement the current weakly-supervised model to improve compatibility prediction; (2) discover interpretable rules from both structured attribute tables and unstructured product descriptions. AMRule adaptively discovers labeling rules from large-error instances via a boosting-style strategy, the high-quality rules can remedy the current model's weak spots and refine the model iteratively. For rule discovery from structured product attributes, we generate composable high-order rules from decision trees; and for rule discovery from unstructured product descriptions, we generate prompt-based rules from a pre-trained language model. Experiments on 4 real-world datasets show that AMRule outperforms the baselines by $5.98%$ on average and improves rule quality and rule proposal efficiency.
There is no shortage of outlier detection (OD) algorithms in the literature, yet a vast body of them are designed for a single machine. With the increasing reality of already cloud-resident datasets comes the need for distributed OD techniques. This area, however, is not only understudied but also short of public-domain implementations for practical use. This paper aims to fill this gap: We design Sparx, a data-parallel OD algorithm suitable for shared-nothing infrastructures, which we specifically implement in Apache Spark. Through extensive experiments on three real-world datasets, with several billions of points and millions of features, we show that existing open-source solutions fail to scale up; either by large number of points or high dimensionality, whereas Sparx yields scalable and effective performance. To facilitate practical use of OD on modern-scale datasets, we open-source Sparx under the Apache license at https://tinyurl.com/sparx2022.
It is critical and important to detect anomalies in event sequences, which becomes widely available in many application domains. Indeed, various efforts have been made to capture abnormal patterns from event sequences through sequential pattern analysis or event representation learning. However, existing approaches usually ignore the semantic information of event content. To this end, in this paper, we propose a self-attentive encoder-decoder transformer framework, Content-Aware Transformer CAT, for anomaly detection in event sequences. In CAT, the encoder learns preamble event sequence representations with content awareness, and the decoder embeds sequences under detection into a latent space, where anomalies are distinguishable. Specifically, the event content is first fed to a content-awareness layer, generating representations of each event. The encoder accepts preamble event representation sequence, generating feature maps. In the decoder, an additional token is added at the beginning of the sequence under detection, denoting the sequence status. A one-class objective together with sequence reconstruction loss is collectively applied to train our framework under the label efficiency scheme. Furthermore, CAT is optimized under a scalable and efficient setting. Finally, extensive experiments on three real-world datasets demonstrate the superiority of CAT.
Leveraging artificial intelligence (AI) techniques in medical applications is helping our world to deal with the shortage of healthcare workers and improve the efficiency and productivity of healthcare delivery. Intelligent pre-consultation (IPC) is a relatively new application deployed on mobile terminals for collecting patient's information before a face-to-face consultation. It takes advantages of state-of-the-art machine learning techniques to assist doctors on clinical decision-making. One of key functions of IPC is to detect medical symptoms from patient queries. By extracting symptoms from patient queries, IPC is able to collect more information on patient's health status by asking symptom-related questions. All collected information will be summarized as a medical record for doctors to make clinical decision. This saves a great deal of time for both doctors and patients. Detecting symptoms from patient's query is challenging, as most patients lack medical background and often tend to use colloquial language to describe their symptoms. In this work, we formulate symptom detection as a retrieval problem and propose a bi-directional hard-negative enforced noise contrastive estimation method (Bi-hardNCE) to tackle the symptom detection problem. Bi-hardNCE has both forward contrastive estimation and backward contrastive estimation, which forces model to distinguish the true symptom from negative symptoms and meanwhile distinguish true query from negative queries. To include more informative negatives, our Bi-hardNCE adopts a hard-negative mining strategy and a false-negative eliminating strategy, which achieved a significant improvement on performance. Our proposed model outperforms commonly used retrieval models by a large margin.
Graph neural networks (GNNs) have achieved great success in many graph-based applications. However, the enormous size and high sparsity level of graphs hinder their applications under industrial scenarios. Although some scalable GNNs are proposed for large-scale graphs, they adopt a fixed K-hop neighborhood for each node, thus facing the over-smoothing issue when adopting large propagation depths for nodes within sparse regions. To tackle the above issue, we propose a new GNN architecture --- Graph Attention Multi-Layer Perceptron (GAMLP), which can capture the underlying correlations between different scales of graph knowledge. We have deployed GAMLP in Tencent with the Angel platform, and we further evaluate GAMLP on both real-world datasets and large-scale industrial datasets. Extensive experiments on these 14 graph datasets demonstrate that GAMLP achieves state-of-the-art performance while enjoying high scalability and efficiency. Specifically, it outperforms GAT by 1.3% regarding predictive accuracy on our large-scale Tencent Video dataset while achieving up to 50x training speedup. Besides, it ranks top-1 on both the leaderboards of the largest homogeneous and heterogeneous graph (i.e., ogbn-papers100M and ogbn-mag) of Open Graph Benchmark.
This paper aims to advance the mathematical intelligence of machines by presenting the first Chinese mathematical pre-trained language model (PLM) for effectively understanding and representing mathematical problems. Unlike other standard NLP tasks, mathematical texts are difficult to understand, since they involve mathematical terminology, symbols and formulas in the problem statement. Typically, it requires complex mathematical logic and background knowledge for solving mathematical problems.
Considering the complex nature of mathematical texts, we design a novel curriculum pre-training approach for improving the learning of mathematical PLMs, consisting of both basic and advanced courses. Specially, we first perform token-level pre-training based on a position-biased masking strategy, and then design logic-based pre-training tasks that aim to recover the shuffled sentences and formulas, respectively. Finally, we introduce a more difficult pre-training task that enforces the PLM to detect and correct the errors in its generated solutions. We conduct extensive experiments on offline evaluation (including nine math-related tasks) and online A/B test. Experimental results demonstrate the effectiveness of our approach compared with a number of competitive baselines. Our code is available at: https://github.com/RUCAIBox/JiuZhang.
Graph neural networks (GNN) have shown great success in learn- ing from graph-structured data. They are widely used in various applications, such as recommendation, fraud detection, and search. In these domains, the graphs are typically large and heterogeneous, containing many millions or billions of vertices and edges of different types. To tackle this challenge, we develop DistDGLv2, a system that extends DistDGL for training GNNs on massive heterogeneous graphs in a mini-batch fashion, using distributed hybrid CPU/GPU training. DistDGLv2 places graph data in distributed CPU memory and performs mini-batch computation in GPUs. For ease of use, DistDGLv2 adopts API compatible with Deep Graph Library (DGL)'s mini-batch training and heterogeneous graph API, which enables distributed training with almost no code modification. To ensure model accuracy, DistDGLv2 follows a synchronous training approach and allows ego-networks forming mini-batches to include non-local vertices. To ensure data locality and load balancing, DistDGLv2 partitions heterogeneous graphs by using a multi-level partitioning algorithm with min-edge cut and multiple balancing constraints. DistDGLv2 deploys an asynchronous mini- batch generation pipeline that makes computation and data access asynchronous to fully utilize all hardware (CPU, GPU, network, PCIe). We demonstrate DistDGLv2 on various GNN workloads. Our results show that DistDGLv2 achieves 2 - 3x speedup over DistDGL and 18× speedup over Euler. It takes only 5 - 10 seconds to complete an epoch on graphs with hundreds of millions of vertices on a cluster with 64 GPUs.
Online medical consultation, which enables patients to remotely inquire doctors in the form of web chatting, has become an indispensable part of the social health care system. Intuitively, it is a crucial step to recommend suitable doctor candidates for patients, especially with suffering the severe cold-start challenge of patients due to the limited historical records and insufficient description of patient condition. Along this line, in this paper, we propose a novel Dialogue based Doctor Recommendation (DDR) model, which comprehensively integrates three types of information in modeling, including the profile and chief complaint from patients, the historical records of doctors and the patient-doctor dialogue. Accordingly, we propose 1) a patient encoder which represents the patient's condition and medical requirements; 2) a doctor encoder which distills the doctor's expertise and communication skills; 3) a dialogue encoder which extracts textual features from doctor-patient conversation. Specifically, since the patient-doctor dialogue is not available in the testing stage, we propose to simulate the dialogue embedding with patient embedding via a contrastive learning based module. Experimental results on a real-world data set show that the proposed DDR model can outperform state-of-the-art recommendation-based methods. Moreover, considering the accessibility variance of online medical consultation services between the youth and the elderly, we also conduct a fairness study on the proposed DDR model.
We present Deep network Dynamic Graph Partitioning (DDGP), a novel algorithm for optimizing the division of large graphs for mixture of expert graph neural networks. Our work is motivated from the observation that real world graphs suffer from spatial concept drift, which is detrimental to neural network training. We answer the question of how we can divide a graph, with vertices in each subgraph sharing a similar distribution, so that an expert network trained over each subgraph may yield the best learning outcome. DDGP is a two pronged algorithm that consists of cluster merging, followed by cluster boundary refinement. We used the training performance of each expert model as feedback to iteratively refine partition boundaries among subgraphs. These partitions are distinct for each model and graph network. We provide theoretical proof of convergence for DDGP boundary refinement as a guarantee for model training stability. Finally, we demonstrate experimentally that DDGP outperforms state-of-the-art graph partitioning algorithms for a regression task on multiple large real world graphs, with GraphSage and Graph Attention as our expert models.
Causal Inference has wide applications in various areas such as E-commerce and precision medicine, and its performance heavily relies on the accurate estimation of the Individual Treatment Effect (ITE). Conventionally, ITE is predicted by modeling the treated and control response functions separately in their individual sample spaces. However, such an approach usually encounters two issues in practice, i.e. divergent distribution between treated and control groups due to treatment bias, and significant sample imbalance of their population sizes. This paper proposes Deep Entire Space Cross Networks (DESCN) to model treatment effects from an end-to-end perspective. DESCN captures the integrated information of the treatment propensity, the response, and the hidden treatment effect through a cross network in a multi-task learning manner. Our method jointly learns the treatment and response functions in the entire sample space to avoid treatment bias and employs an intermediate pseudo treatment effect prediction network to relieve sample imbalance. Extensive experiments are conducted on a synthetic dataset and a large-scaled production dataset from the E-commerce voucher distribution business. The results indicate that DESCN can successfully enhance the accuracy of ITE estimation and improve the uplift ranking performance. A sample of the production dataset and the source code are released to facilitate future research in the community, which is, to the best of our knowledge, the first large-scale public biased treatment dataset for causal inference.
As one of the fundamental trends for future development of recommender systems, Fashion Clothes Matching Recommendation for click-through rate (CTR) prediction has become an increasingly essential task. Unlike traditional single-item recommendation, a combo item, composed of a top item (e.g. a shirt) and a bottom item (e.g. a skirt), is recommended. In such a task, the matching effect between these two single items plays a crucial role, and greatly influences the users' preferences; however, it is usually neglected by previous approaches in CTR prediction. In this work, we tackle this problem by designing a novel algorithm called Combo-Fashion, which extracts the matching effect by introducing the matching history of the combo item with two cascaded modules: (i) Matching Search Module (MSM) seeks the popular combo items and undesirable ones as a positive set and a negative set, respectively; (ii) Matching Prediction Module (MPM) models the precise relationship between the candidate combo item and the positive/negative set by an attention-based deep model. Besides, the CPM Fashion Attribute, considered from characteristic, pattern and material, is applied to capture the matching effect further. As part of this work, we release two large-scale datasets consisting of 3.56 million and 6.01 million user behaviors with rich context and fashion information in millions of combo items. The experimental results over these two real-world datasets have demonstrated the superiority of our proposed model with significant improvements. Furthermore, we have deployed Combo-Fashion onto the platform of Taobao to recommend the combo items to the users, where an 8-day online A/B test proved the effectiveness of Combo-Fashion with an improvement of pCTR by 1.02% and uCTR by 0.70%.
User-tag profile modeling has become one of the novel and significant trends for the future development of industrial recommendation systems, which can be divided into two fundamental tasks: User Preferred Tag (UPT) and Tag Preferred User (TPU) in practical scenarios. In most existing deep learning models for user-tag profiling, the network inputs all the combined tags of the item with the user features when training but inputs only one tag with the user feature to evaluate the user's preference on a single tag when testing. This leads to data discrepancy between the training and testing samples. To address such an issue, we attempt a novel Random Masking Model (RMM) to remain only one tag at the training time by masking. However, it causes two other serious downsides. First, not all tags attached to the same item are equally predictive. Irrelevant tags may introduce noisy signals and thus cause performance degradation. Second, it neglects the impact of combined tags aggregated together, which may be an essential factor leading to user clicks. Therefore, we further propose a framework called Contrast Weighted Tag Masking (CWTM) in this work, which tackles these two issues with two modules: (i) Weighted Masking Module (WMM) introduces the importance network to compute a score for each tag attached to the item and then samples from these tags weightedly according to the score; (ii) Contrast Module (CM) makes use of a contrastive learning architecture to inherit and distill some understanding about the effect of aggregated tags. Offline experiments on four datasets (three public datasets and one proprietary industrial dataset) demonstrate the superiority and effectiveness of CWTM over the state-of-the-art baselines. Moreover, CWTM has been deployed on the training platform of Alibaba advertising systems and achieved substantial improvements of ROI and CVR by 16.8% and 9.6%, respectively.
Origin-Destination (O-D) travel demand prediction is a fundamental challenge in transportation. Recently, spatial-temporal deep learning models demonstrate the tremendous potential to enhance prediction accuracy. However, few studies tackled the uncertainty and sparsity issues in fine-grained O-D matrices. This presents a serious problem, because a vast number of zeros deviate from the Gaussian assumption underlying the deterministic deep learning models. To address this issue, we design a Spatial-Temporal Zero-Inflated Negative Binomial Graph Neural Network (STZINB-GNN) to quantify the uncertainty of the sparse travel demand. It analyzes spatial and temporal correlations using diffusion and temporal convolution networks, which are then fused to parameterize the probabilistic distributions of travel demand. The STZINB-GNN is examined using two real-world datasets with various spatial and temporal resolutions. The results demonstrate the superiority of STZINB-GNN over benchmark models, especially under high spatial-temporal resolutions, because of its high accuracy, tight confidence intervals, and interpretable parameters. The sparsity parameter of the STZINB-GNN has physical interpretation for various transportation applications.
The large-scale vehicle routing problems (VRPs) are defined based on the classical VRPs with thousands of customers. It is an important optimization problem in modern logistic systems, since efficiently obtaining high-quality solutions can greatly reduce operation expenses as well as improve customer satisfaction. Most existing algorithms, including traditional non-learning heuristics and learning-based methods, only perform well on small-scale instances with usually no more than hundreds of customers. In this paper we present a novel Rewriting-by-Generating (RBG) framework which solves large-scale VRPs hierarchically. RBG consists of a rewriter agent that refines the customer division globally and an elementary generator to infer regional solutions locally. It is also flexible with multiple CVRP variant problems and could be continuously evolved with more up-to-date generator designs. We conduct extensive experiments on both synthetic and real-world data to demonstrate the effectiveness and efficiency of our proposed RBG framework. It outperforms HGS, one of the best heuristic method for CVRPs and also shortens the inference time. Online evaluation is also conducted on a deployed express platform in Guangdong, China, where RBG shows advantages to other alternative built-in algorithms.
We study allocation of COVID-19 vaccines to individuals based on the structural properties of their underlying social contact network. Using a realistic representation of a social contact network for the Commonwealth of Virginia, we study how a limited number of vaccine doses can be strategically distributed to individuals to reduce the overall burden of the pandemic. We show that allocation of vaccines based on individuals' degree (number of social contacts) and total social proximity time is significantly more effective than the usually used age-based allocation strategy in reducing the number of infections, hospitalizations and deaths. The overall strategy is robust even: (i) if the social contacts are not estimated correctly; (ii) if the vaccine efficacy is lower than expected or only a single dose is given; (iii) if there is a delay in vaccine production and deployment; and (iv) whether or not non-pharmaceutical interventions continue as vaccines are deployed. For reasons of implementability, we have used degree, which is a simple structural measure and can be easily estimated using several methods, including the digital technology available today. These results are significant, especially for resource-poor countries, where vaccines are less available, have lower efficacy, and are more slowly distributed.
In the fight against the COVID-19 pandemic, vaccines are the most critical resource but are still in short supply around the world. Therefore, efficient vaccine allocation strategies are urgently called for, especially in large-scale metropolis where uneven health risk is manifested in nearby neighborhoods. However, there exist several key challenges in solving this problem: (1) great complexity in the large scale scenario adds to the difficulty in experts' vaccine allocation decision making; (2) heterogeneous information from all aspects in the metropolis' contact network makes information utilization difficult in decision making; (3) when utilizing the strong decision-making ability of reinforcement learning (RL) to solve the problem, poor explainability limits the credibility of the RL strategies. In this paper, we propose a reinforcement learning enhanced experts method. We deal with the great complexity via a specially designed algorithm aggregating blocks in the metropolis into communities and we hierarchically integrate RL among the communities and experts solution within each community. We design a self-supervised contact network representation algorithm to fuse the heterogeneous information for efficient vaccine allocation decision making. We conduct extensive experiments in three metropolis with real-world data and prove that our method outperforms the best baseline, reducing 9.01% infections and 12.27% deaths.We further demonstrate the explainability of the RL model, adding to its credibility and also enlightening the experts in turn.
For those seeking healthcare advice online, AI based dialogue agents capable of interacting with patients to perform automatic disease diagnosis are a viable option. This application necessitates efficient inquiry of relevant disease symptoms in order to make accurate diagnosis recommendations. This can be formulated as a problem of sequential feature (symptom) selection and classification for which reinforcement learning (RL) approaches have been proposed as a natural solution. They perform well when the feature space is small, that is, the number of symptoms and diagnosable disease categories is limited, but they frequently fail in assignments with a large number of features. To address this challenge, we propose a Multi-Model-Fused Actor-Critic (MMF-AC) RL framework that consists of a generative actor network and a diagnostic critic network. The actor incorporates a Variational AutoEncoder (VAE) to model the uncertainty induced by partial observations of features, thereby facilitating in making appropriate inquiries. In the critic network, a supervised diagnosis model for disease predictions is involved to precisely estimate the state-value function. Furthermore, inspired by the medical concept of differential diagnosis, we combine the generative and diagnosis models to create a novel reward shaping mechanism to address the sparse reward problem in large search spaces. We conduct extensive experiments on both synthetic and real-world datasets for empirical evaluations. The results demonstrate that our approach outperforms state-of-the-art methods in terms of diagnostic accuracy and interaction efficiency while also being more effectively scalable to large search spaces. Besides, our method is adaptable to both categorical and continuous features, making it ideal for online applications.
Mobile health apps are revolutionizing the healthcare ecosystem by improving communication, efficiency, and quality of service. In low- and middle-income countries, they also play a unique role as a source of information about health outcomes and behaviors of patients and healthcare workers, while providing a suitable channel to deliver both personalized and collective policy interventions. We propose a framework to study user engagement with mobile health, focusing on healthcare workers and digital health apps designed to support them in resource-poor settings. The behavioral logs produced by these apps can be transformed into daily time series characterizing each user's activity. We use probabilistic and survival analysis to build multiple personalized measures of meaningful engagement, which could serve to tailor contents and digital interventions suiting each health worker's specific needs. Special attention is given to the problem of detecting churn, understood as a marker of complete disengagement. We discuss the application of our methods to the Indian and Ethiopian users of the Safe Delivery App, a capacity building tool for skilled birth attendant. This work represents an important step towards a full characterization of user engagement in mobile health applications, which can significantly enhance the abilities of health workers and, ultimately, save lives.
Electronic health records (EHRs) provide rich clinical information and the opportunities to extract epidemiological patterns to understand and predict patient disease risks with suitable machine learning methods such as topic models. However, existing topic models do not generate identifiable topics each predicting a unique phenotype. One promising direction is to use known phenotype concepts to guide topic inference. We present a seed-guided Bayesian topic model called MixEHR-Seed with 3 contributions: (1) for each phenotype, we infer a dual-form of topic distribution: a seed-topic distribution over a small set of key EHR codes and a regular topic distribution over the entire EHR vocabulary; (2) we model age-dependent disease progression as Markovian dynamic topic priors; (3) we infer seed-guided multi-modal topics over distinct EHR data types. For inference, we developed a variational inference algorithm. Using MixEHR-Seed, we inferred 1569 PheCode-guided phenotype topics from an EHR database in Quebec, Canada covering 1.3 million patients for up to 20-year follow-up with 122 million records for 8539 and 1126 unique diagnostic and drug codes, respectively. We observed (1) accurate phenotype prediction by the guided topics, (2) clinically relevant PheCode-guided disease topics, (3) meaningful age-dependent disease prevalence. Source code is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-Seed.
Leveraging computational methods to generate small molecules with desired properties has been an active research area in the drug discovery field. Towards real-world applications, however, efficient generation of molecules that satisfy multiple property requirements simultaneously remains a key challenge. In this paper, we tackle this challenge using a search-based approach and propose a simple yet effective framework called MolSearch for multi-objective molecular generation (optimization).We show that given proper design and sufficient domain information, search-based methods can achieve performance comparable or even better than deep learning methods while being computationally efficient. Such efficiency enables massive exploration of chemical space given constrained computational resources. In particular, MolSearch starts with existing molecules and uses a two-stage search strategy to gradually modify them into new ones, based on transformation rules derived systematically and exhaustively from large compound libraries. We evaluate MolSearch in multiple benchmark generation settings and demonstrate its effectiveness and efficiency.
Global monitoring of novel diseases and outbreaks is crucial for pandemic prevention. To this end, movement data from cell-phones is already used to augment epidemiological models. Recent work has posed individual cell-phone metadata as a universal data source for syndromic surveillance for two key reasons: (1) these records are already collected for billing purposes in virtually every country and (2) they could allow deviations from people's routine behaviors during symptomatic illness to be detected, both in terms of mobility and social interactions. In this paper, we develop the necessary models to conduct population-level infectious disease surveillance by using cell-phone metadata individually linked with health outcomes. Specifically, we propose GraphDNA---a model that builds Graph neural networks (GNNs) into Dynamic Network Anomaly detection. Using cell-phone call records (CDR) linked with diagnostic information from Iceland during the H1N1v influenza outbreak, we show that GraphDNA outperforms state-of-the-art baselines on individual Date-of-Diagnosis (DoD) prediction, while tracking the epidemic signal in the overall population. Our results suggest that proper modeling of the universal CDR data could inform public health officials and bolster epidemic preparedness measures.
Brain networks characterize complex connectivities among brain regions as graph structures, which provide a powerful means to study brain connectomes. In recent years, graph neural networks have emerged as a prevalent paradigm of learning with structured data. However, most brain network datasets are limited in sample sizes due to the relatively high cost of data collection, which hinders the deep learning models from sufficient training. Inspired by meta-learning that learns new concepts fast with limited training examples, this paper studies data-efficient training strategies for analyzing brain connectomes in a cross-dataset setting. Specifically, we propose to meta-train the model on datasets of large sample sizes and transfer the knowledge to small datasets. In addition, we also explore brain-network-oriented designs, including atlas transformation and adaptive task reweighing. Compared to other pre-training strategies, our meta-learning-based approach achieves higher and stabler performance, which demonstrates the effectiveness of our proposed solutions. The framework is also able to derive new insights regarding the similarities among datasets and diseases in a data-driven fashion.
Human daily activities, such as working, eating out, and traveling, play an essential role in contact tracing and modeling the diffusion patterns of the COVID-19 pandemic. However, individual-level activity data collected from real scenarios are highly limited due to privacy issues and commercial concerns. In this paper, we present a novel framework based on generative adversarial imitation learning, to generate artificial activity trajectories that retain both the fidelity and utility of the real-world data. To tackle the inherent randomness and sparsity of irregular-sampled activities, we innovatively capture the spatiotemporal dynamics underlying trajectories by leveraging neural differential equations. We incorporate the dynamics of continuous flow between consecutive activities and instantaneous updates at observed activity points in temporal evolution and spatial transformation. Extensive experiments on two real-world datasets show that our proposed framework achieves superior performance over state-of-the-art baselines in terms of improving the data fidelity and data utility in facilitating practical applications. Moreover, we apply the synthetic data to model the COVID-19 spreading, and it achieves better performance by reducing the simulation MAPE over the baseline by more than 50%. The source code is available online: https://github.com/tsinghua-fib-lab/Activity-Trajectory-Generation.
Medical dialogue generation is an important yet challenging task. Most previous works rely on the attention mechanism and large-scale pretrained language models. However, these methods often fail to acquire pivotal information from the long dialogue history to yield an accurate and informative response, due to the fact that the medical entities usually scatters throughout multiple utterances along with the complex relationships between them. To mitigate this problem, we propose a medical response generation model with Pivotal Information Recalling (MedPIR), which is built on two components, i.e., knowledge-aware dialogue graph encoder and recall-enhanced generator. The knowledge-aware dialogue graph encoder constructs a dialogue graph by exploiting the knowledge relationships between entities in the utterances, and encodes it with a graph attention network. Then, the recall-enhanced generator strengthens the usage of these pivotal information by generating a summary of the dialogue before producing the actual response. Experimental results on two large-scale medical dialogue datasets show that MedPIR outperforms the strong baselines in BLEU scores and medical entities F1 measure.
How can we build and optimize a recommender system that must rapidly fill slates (i.e. banners) of personalized recommendations? The combination of deep learning stacks with fast maximum inner product search (MIPS) algorithms have shown it is possible to deploy flexible models in production that can rapidly deliver personalized recommendations to users. Albeit promising, this methodology is unfortunately not sufficient to build a recommender system which maximizes the reward, e.g. the probability of click. Usually instead a proxy loss is optimized and A/B testing is used to test if the system actually improved performance. This tutorial takes participants through the necessary steps to model the reward and directly optimize the reward of recommendation engines built upon fast search algorithms to produce high-performance reward-optimizing recommender systems.
Non-IID (i.i.d.) data holds complex non-IIDness, e.g., couplings and interactions (non-independent) and heterogeneities (not IID drawn from a given distribution). Non-IID learning emerges as a major challenge to shallow and deep learning, including classic statistical learning, mathematical modeling, shallow machine learning, and deep neural learning. Here, we outline the problem, research map, main challenges and topics of shallow and deep non-IID learning.
Recent research has shown that interpretable machine learning models can be just as accurate as blackbox learning methods on tabular datasets. In this tutorial we will walk you through leading open source tools for glassbox learning, and show how intelligible machine learning helps practitioners uncover flaws in their datasets, discover new science, and build models that are more fair and robust. We'll begin with an introduction to the science behind glassbox modeling, and walk through a series of case-studies that highlight the added value of interpretable methods in a variety of domains such as finance and healthcare without compromising accuracy. We'll also show how glassbox models can be used for state of the art differentially private learning, bias detection/mitigation, and how these models can be edited to remove undesirable effects with GAMChanger. We'll also discuss how to train interpretable models with deep neural nets.
Recent studies have revealed important properties that are unique to graph datasets such as hierarchies and global structures. This has driven research into hyperbolic space due to their ability to effectively encode the inherent hierarchy present in graph datasets. However, a major bottleneck here is the obscurity of hyperbolic geometry and a better comprehension of its gyrovector operations. In this tutorial, we aim to introduce researchers and practitioners in the data mining community to the hyperbolic equivariants of the Euclidean operations that are necessary to tackle their application to neural networks. We describe the popular hyperbolic variants of GNN architectures and explain their implementation, in contrast to the Euclidean counterparts. Also, we motivate our tutorial through critical analysis of existing applications in the areas of graph mining, knowledge graph reasoning, search, NLP, and computer vision.
The increasing prevalence of multimodal data in our society has led to the increased need for machines to make sense of such data holistically. However, data scientists and machine learning engineers aspiring to work on such data face challenges fusing the knowledge from existing tutorials which often deal with each mode separately. Drawing on our experience in classifying multimodal municipal issue feedback in the Singapore government, we conduct a hands-on tutorial to help flatten the learning curve for practitioners who want to apply machine learning to multimodal data.
To model graph-structured data, graph learning, in particular deep graph learning with graph neural networks, has drawn much attention in both academic and industrial communities lately. The effectiveness of prevailing graph learning methods usually rely on abundant labeled data for model training. However, it is common that graphs are scarcely labeled since data annotation and labeling on graphs is always time and resource-consuming. Therefore, it is imperative to investigate graph learning with minimal human supervision for the low-resource settings where limited or even no labeled data is available. In this tutorial, we will focus on the state-of-the-art techniques of Graph Minimally-Supervised Learning, in particular a series of weakly-supervised learning, few-shot learning, and self-supervised learning methods on graph-structured data as well as their real-world applications. The objectives of this tutorial are to: (1) formally categorize the problems in graph minimally-supervised learning and discuss the challenges under different learning scenarios; (2) comprehensively review the existing and recent advances of graph minimally-supervised learning; and (3) elucidate open questions and future research directions. This tutorial introduces major topics within minimally-supervised learning and offers a guide to a new frontier of graph learning.
Recommender systems are fundamental building blocks of modern consumer web applications that seek to predict user preferences to better serve relevant items. As such, high-quality user and item representations as inputs to recommender systems are crucial for personalized recommendation. To construct these user and item representations, self-supervised graph embedding has emerged as a principled approach to embed relational data such as user social graphs, user membership graphs, user-item engagements, and other heterogeneous graphs. In this tutorial we discuss different families of approaches to self-supervised graph embedding. Within each family, we outline a variety of techniques, their merits and disadvantages, and expound on latest works. Finally, we demonstrate how to effectively utilize the resultant large embedding tables to improve candidate retrieval and ranking in modern industry-scale deep-learning recommender systems.
Automated machine learning (AutoML) offers the promise of translating raw data into accurate predictions without the need for significant human effort, expertise, and manual experimentation. In this lecture-style tutorial, we demonstrate fundamental techniques that powers up multimodal AutoML. Different from most AutoML systems that focus on solving tabular tasks that contain categorical and numerical features, we consider supervised learning tasks on various types of data including tabular features, text, and image, as well as their combinations. Rather than technical descriptions of how individual ML models work, we emphasize how to best use models within an overall ML pipeline that takes in raw training data and outputs predictions for test data. A major focus of our tutorial is on automatically building and training deep learning models, which are powerful yet cumbersome to manage manually. Hardly any educational material describes their successful automation. Each topic covered in the tutorial is accompanied by a hands-on Jupyter notebook that implements best practices (which will be available on GitHub before and after the tutorial). Most of the code is adopted from AutoGluon (https://auto.gluon.ai/), a recent open-source AutoML toolkit that is both state-of-the-art and easy-to-use.
Machine learning on graph data has become a common area of interest across academia and industry. However, due to the size of real-world industry graphs (hundreds of millions of vertices and billions of edges) and the special architecture of graph neural net- works, it is still a challenge for practitioners and researchers to perform machine learning tasks on large-scale graph data. It typi- cally takes a powerful and expensive GPU machine to train a graph neural network on a million-vertex scale graph, let alone doing deep learning on real enterprise graphs. In this tutorial, we will cover how to develop and run performant graph algorithms and graph neural network models with TigerGraph [3], a massively parallel platform for graph analytics, and its Machine Learning Workbench with PyTorch Geometric [4] and DGL [8] support. Using an NFT transaction dataset [6], we will first investigate transactions using graph algorithms by themselves as methods of graph traversing, clustering, classification, and determining similarities between data. Secondly, we will show how to use those graph-derived features such as PageRank and embeddings to empower traditional machine learning models. Finally, we will demonstrate how to train common graph neural networks with TigerGraph and how to implement novel graph neural network models. Participants will use the Tiger- Graph ML Workbench Cloud to perform graph feature engineering and train their machine learning algorithms during the session.
Misinformation is a pressing issue in modern society. It arouses a mixture of anger, distrust, confusion, and anxiety that cause damage on our daily life judgments and public policy decisions. While recent studies have explored various fake news detection and media bias detection techniques in attempts to tackle the problem, there remain many ongoing challenges yet to be addressed, as can be witnessed from the plethora of untrue and harmful content present during the COVID-19 pandemic, which gave rise to the first social-media infodemic, and the international crises of late. In this tutorial, we provide researchers and practitioners with a systematic overview of the frontier in fighting misinformation. Specifically, we dive into the important research questions of how to (i) develop a robust fake news detection system that not only fact-checks information pieces provable by background knowledge, but also reason about the consistency and the reliability of subtle details about emerging events; (ii) uncover the bias and the agenda of news sources to better characterize misinformation; as well as (iii) correct false information and mitigate news biases, while allowing diverse opinions to be expressed. Participants will learn about recent trends, representative deep neural network language and multimedia models, ready-to-use resources, remaining challenges, future research directions, and exciting opportunities to help make the world a better place, with safer and more harmonic information sharing.
In this digital age, people spend a significant portion of their lives online and this has led to an explosion of personal data of users due to their activities. Typically, this data is private and nobody else, except the user, is allowed to look at it. To provide better experience and assist users in their activities, it is critical to mine certain information from this data. This poses interesting and complex challenge from scalable information extraction point of view: building information extraction models where there is little data to learn from due to privacy constraints but need highly accurate models to run on a large amount of diverse data across different users. Anonymization of data is typically used to convert private data into publicly accessible data. But this may not always be feasible and may require complex differential privacy guarantees to be safe from any potential negative consequences. Further, the anonymization process needs to ensure that it retains sufficient information for modeling purposes post anonymization. Other techniques involve building extraction models using a small amount of seen (eyes-on) data with no privacy restrictions (hence, can be labeled) and a large amount of unseen (eyes-off) data which only a machine or a program can access. In this tutorial, we use emails as the canonical example of private data to explain in detail the challenges and solutions for scalable information extraction (IE) under privacy-aware constraints.
Lale is a sklearn-compatible library for automated machine learning (AutoML). It is open-source (https://github.com/ibm/lale) and addresses the need for gradual automation of machine learning as opposed to offering a black-box AutoML tool. Black-box AutoML tools are difficult to customize and thus restrict data scientists in leveraging their knowledge and intuition in the automation process. Lale is built on three principles: progressive disclosure, orthogonality, and least surprise. These enable a gradual approach offering a spectrum of usage patterns starting from total automation to controlling almost every aspect of AutoML. Lale provides compositional constructs that let data scientists control some aspects of their pipelines while leaving other aspects free to be searched automatically. This tutorial demonstrates the use of Lale for various machine-learning tasks, showing how to progressively exercise more customization. It also covers AutoML for advanced scenarios such as class imbalance correction, bias detection and mitigation, multi-objective optimization, and working with multi-table datasets. While Lale comes with hyperparameter specifications for 216 operators out-of-the-box, users can also add more operators of their own, and this tutorial covers how to do that. Overall, this tutorial teaches you how you can exercise fine-grained control over AutoML without having to be an AutoML expert.
This tutorial is proposed based upon the recently released open-source library Dive into Graphs (DIG) along with hands-on code examples. DIG is a turnkey library that considers four frontiers in graph deep learning, including self-supervised learning of GNNs, 3D GNNs, explainability of GNNs, and graph generation. It provides data interfaces, common algorithms, and evaluation metrics for each direction. It has 255,000+ visitors, 11,000+ installations, and 1,100+ stars within a year and is becoming a robust and dominant ecosystem for graph neural network research. In this tutorial, we will review representative methodologies for these four directions and show hands-on code examples to demonstrate how to effortlessly implement benchmarks using DIG. This tutorial targets a broad audience working on or interested in various research themes. To encourage audience participation, we will promote our tutorial in advance on social media, reading groups, and library contribution community. We anticipate this tutorial would attract more researchers to these interesting and promising topics, leading to a more active community, eventually generating both scientific values and real-world impacts.
Graph is a ubiquitous type of data that appears in many real-world applications, including social network analysis, recommendations and financial security. Important as it is, decades of research have developed plentiful computational models to mine graphs. Despite its prosperity, concerns with respect to the potential algorithmic discrimination have been grown recently. Algorithmic fairness on graphs, which aims to mitigate bias introduced or amplified during the graph mining process, is an attractive yet challenging research topic. The first challenge corresponds to the theoretical challenge, where the non-IID nature of graph data may not only invalidate the basic assumption behind many existing studies in fair machine learning, but also introduce new fairness definition(s) based on the inter-correlation between nodes rather than the existing fairness definition(s) in fair machine learning. The second challenge regarding its algorithmic aspect aims to understand how to balance the trade-off between model accuracy and fairness. This tutorial aims to (1) comprehensively review the state-of-the-art techniques to enforce algorithmic fairness on graphs and (2) enlighten the open challenges and future directions. We believe this tutorial could benefit researchers and practitioners from the areas of data mining, artificial intelligence and social science.
Artificial Intelligence (AI) is increasingly playing an integral role in determining our day-to-day experiences. Increasingly, the applications of AI are no longer limited to search and recommendation systems, such as web search and movie and product recommendations, but AI is also being used in decisions and processes that are critical for individuals, businesses, and society. With AI based solutions in high-stakes domains such as hiring, lending, criminal justice, healthcare, and education, the resulting personal and professional implications of AI are far-reaching. Consequently, it becomes critical to ensure that these models are making accurate predictions, are robust to shifts in the data, are not relying on spurious features, and are not unduly discriminating against minority groups. To this end, several approaches spanning various areas such as explainability, fairness, and robustness have been proposed in recent literature, and many papers and tutorials on these topics have been presented in recent computer science conferences. However, there is relatively less attention on the need for monitoring machine learning (ML) models once they are deployed and the associated research challenges.
In this tutorial, we first motivate the need for ML model monitoring[14], as part of a broader AI model governance[9] and responsible AI framework, from societal, legal, customer/end-user, and model developer perspectives, and provide a roadmap for thinking about model monitoring in practice. We then present findings and insights on model monitoring desiderata based on interviews with various ML practitioners spanning domains such as financial services, healthcare, hiring, online retail, computational advertising, and conversational assistants[15]. We then describe the technical considerations and challenges associated with realizing the above desiderata in practice. We provide an overview of techniques/tools for model monitoring (e.g., see [1, 1, 2, 5, 6, 8, 10-13, 18-21]. Then, we focus on the real-world application of model monitoring methods and tools [3, 4, 7, 11, 13, 16, 17], present practical challenges/guidelines for using such techniques effectively, and lessons learned from deploying model monitoring tools for several web-scale AI/ML applications. We present case studies across different companies, spanning application domains such as financial services, healthcare, hiring, conversational assistants, online retail, computational advertising, search and recommendation systems, and fraud detection. We hope that our tutorial will inform both researchers and practitioners, stimulate further research on model monitoring, and pave the way for building more reliable ML models and monitoring tools in the future.
As Internet users attach importance to their own privacy, and a number of laws and regulations go into effect in most countries, Internet products need to provide users with privacy protection. As one of the feasible solutions to provide such privacy protection, federated learning has rapidly gained popularity in both academia and industry in recent years. In this tutorial, we will start off with some real-world tasks to illustrate the topic of federated learning, and cover some basic concepts and important scenarios including cross-device and cross-silo settings. Along with it, we will give several demonstrations with popular federated learning frameworks. We will also show how to do the automatic hyperparameter tuning with federated learning to significantly save their efforts in practice. Then we dive into three parallel hot topics, Personalized Federated Learning, Federated Graph Learning, and Attack in Federated Learning. For each of them, we will motivate it with real-world applications, illustrate the state-of-the-art methods, and discuss their pros and cons using concrete examples. As the last part, we will point out some future research directions.
Network traffic data is key in addressing several important cybersecurity problems, such as intrusion and malware detection, and network management problems, such as application and device identification. However, it poses several challenges to building machine learning models. Two main challenges are manual feature engineering and scarcity of training data due to privacy and security concerns. In this tutorial we provide a comprehensive review of recent advances to address these challenges through use of deep learning. Network traffic data can be cast as a multivariate time-series (sequential) data, attributed graph data, or image data to leverage representation learning architectures available in deep learning. To preserve data privacy, generative methods, such as GANs and autoregressive neural architectures can be used to synthesize realistic network traffic data. In particular, our tutorial is organized into three parts: 1) we describe network traffic data, applications to security and network management, and challenges; 2) we present different deep learning architectures used for representation learning instead of feature engineering of network traffic data; and, 3) we describe use of generative neural models for synthetic generation of network traffic data.
Pretrained text representations, evolving from context-free word embeddings to contextualized language models, have brought text mining into a new era: By pretraining neural models on large-scale text corpora and then adapting them to task-specific data, generic linguistic features and knowledge can be effectively transferred to the target applications and remarkable performance has been achieved on many text mining tasks. Unfortunately, a formidable challenge exists in such a prominent pretrain-finetune paradigm: Large pretrained language models (PLMs) usually require a massive amount of training data for stable fine-tuning on downstream tasks, while human annotations in abundance can be costly to acquire.
In this tutorial, we introduce recent advances in pretrained text representations, as well as their applications to a wide range of text mining tasks. We focus on minimally-supervised approaches that do not require massive human annotations, including (1) self-supervised text embeddings and pretrained language models that serve as the fundamentals for downstream tasks, (2) unsupervised and distantly-supervised methods for fundamental text mining applications, (3) unsupervised and seed-guided methods for topic discovery from massive text corpora and (4) weakly-supervised methods for text classification and advanced text mining tasks.
Online clustering algorithms play a critical role in data science, especially with the advantages regarding time, memory usage and complexity, while maintaining a high performance compared to traditional clustering methods. This tutorial serves, first, as a survey on online machine learning and, in particular, data stream clustering methods. During this tutorial, state-of-the-art algorithms and the associated core research threads will be presented by identifying different categories based on distance, density grids and hidden statistical models. Clustering validity indices, an important part of the clustering process which are usually neglected or replaced with classification metrics, resulting in misleading interpretation of final results, will also be deeply investigated.
Then, this introduction will be put into the context with River, a go-to Python library merged between Creme and scikit-multiflow. It is also the first open-source project to include an online clustering module that can facilitate reproducibility and allow direct further improvements. From this, we propose methods of clustering configuration, applications and settings for benchmarking, using real-world problems and datasets.
Machine learning techniques for developing industry-scale search engines have long been a prominent part of most domains and their online products. Search relevance algorithms are key components of products across different fields, including e-commerce, streaming services, and social networks. In this tutorial, we give an introduction to such large-scale search ranking systems, specifically focusing on deep learning techniques in this area. The topics we cover are the following: (1) Overview of search ranking systems in practice, including classical and machine learning techniques; (2) Introduction to sequential and language models in the context of search ranking; and (3) Knowledge distillation approaches for this area. For each of the aforementioned sessions, we first give an introductory talk and then go over an hands-on tutorial to really hone in on the concepts. We cover fundamental concepts using demos, case studies, and hands-on examples, including the latest Deep Learning methods that have achieved state-of-the-art results in generating the most relevant search results. Moreover, we show example implementations of these methods in python, leveraging a variety of open-source machine-learning/deep-learning libraries as well as real industrial data or open-source data.
Time series anomaly detection is an interesting practical problem that mostly falls into unsupervised learning segment. There has been continuous stream of work being published in top-tier data mining and machine learning conferences. We invented many anomaly algorithms, procedures, and applications while working on real industrial application settings. This tutorial presents a design and implementation of a scikit-compatible system for detecting anomalies from time series data for the purpose of offering a broad range of algorithms to the end user, with special focus on unsupervised/semi-supervised learning.
It is widely accepted that data preparation is one of the most time-consuming steps of the machine learning (ML) lifecycle. It is also one of the most important steps, as the quality of data directly influences the quality of a model. In this tutorial, we will discuss the importance and the role of exploratory data analysis (EDA) and data visualisation techniques to find data quality issues and for data preparation, relevant to building ML pipelines. We will also discuss the latest advances in these fields and bring out areas that need innovation. To make the tutorial actionable for practitioners, we will also discuss the most popular open-source packages that one can get started with along with their strengths and weaknesses. Finally, we will discuss on the challenges posed by industry workloads and the gaps to be addressed to make data-centric AI real in industry settings.
Recommender Systems (RecSys) are the engine of the modern internet and the catalyst for human decisions. The goal of a recommender system is to generate relevant recommendations for users from a collection of items or services that might interest them. Building a recommendation system is challenging because it requires multiple stages (item retrieval, filtering, ranking, ordering) to work together seamlessly and efficiently during training and inference. The biggest challenges faced by new practitioners are the lack of understanding around what RecSys look like in the real world and the difficulty in transitioning from the simple Matrix Factorization (MF) to more complex deep learning architectures with multiple input features, neural components and prediction heads.
To address these challenges on building recommender systems, NVIDIA developed an open source framework, called Merlin. Merlin consists of a set of libraries and tools to help RecSys practitioners build models and pipelines easily and more efficiently. Merlin Models provides modularized building blocks that can be easily connected to build classic and state-of-the-art models. It offers flexibility at each stage: multiple input processing/representation modules, different layers for designing the model's architecture, prediction heads, loss functions, negative sampling techniques, among others.
In this hands-on tutorial, participants will start with data preparation using NVTabular an open-source feature engineering and preprocessing library designed to quickly and easily manipulate large scale datasets. Participants will then work on modeling with Merlin Models library, building the fundamental recommendation models such as MF and then transitioning to more complex deep learning-based models for candidate retrieval. In each iteration, we will demonstrate the seamless integration between data preparation and model training. Over the span of this tutorial, participants will learn the fundamentals of recommender systems modeling and how to build a two-stage recommender system easily using open source Merlin libraries.
The most intuitive way to model a transaction in the financial world is through a Graph. Every transaction can be considered as an edge between two vertices, one of which is the paying party and another is the receiving party. Properties of these nodes and edges directly map to business problems in the financial world. The problem of detecting a fraudulent transaction can be considered as a property of the edge. The problem of money laundering can be considered as a path-detection in the Graph. The problem of a merchant going delinquent can be considered as the property of a node. While there are many such examples, the above help in realising the direct mapping of Graph properties with the financial problems in the real-world. This tutorial is based on the potential of using Graph Neural Network based Learning for solving business problems in the financial world.
Graph Neural Networks (GNNs) have gained the interest of industry with Relational Graph Convolutional Networks (R-GCNs) showing promise for fraud detection. Taking existing workflows that leverage graph features to train a gradient boosted decision tree (GBDT) and replacing the graph features with GNN produced embedding achieves an increase in accuracy. However, recent work has shown that the combination of graph attributes with GNN embeddings provides the biggest lift in accuracy.
Whether to use a GNN is half of the picture. Data loading, data cleaning and prep (ETL), and graph processing are critical first steps before graph features or GNN training can be performed. Moreover, the entire process is interactive, optimizing training and validation, for shorter model delivery cycles. Quicker model updates are the key to staying ahead of evolving fraud techniques. McDonald and Deotte [1] published a BLOG on the importance of being able to iterate quickly in finding a solution.
The RAPIDS [2] suite of open-source software libraries gives the data scientist the freedom to execute end-to-end analytics work-flows on GPUs. The ETL and data loading portion is handled by RAPIDS cuDF, which utilizes a familiar DataFrame API. The GBDT process is handled by RAPIDS cuML that has an implementation of XGBoost and RandomForest. The graph analytic portion is handled by RAPIDS cuGraph. Recently cuGraph announced integration into Deep Graph Library (DGL) [3]. For GNN training, graph sampling can consume up to 80% of the training time. RAPIDS cuGraph sampling algorithms execute 10x to 100x faster than similar CPU versions and scale to support massive size graphs. Join us as we dive into GNNs for fraud detection and as we demonstrate how RAPIDS + DGL drastically reduces training time. We will cover everything from accelerating data load and data prep to accelerated GNN training with cuGraph + DGL.
The recent COVID-19 pandemic has reinforced the importance of epidemic forecasting to equip decision makers in multiple domains, ranging from public health to economics. However, forecasting the epidemic progression remains a non-trivial task as the spread of diseases is subject to multiple confounding factors spanning human behavior, pathogen dynamics and environmental conditions, etc. Research interest has been fueled by the increased availability of rich data sources capturing previously unseen facets of the epidemic spread and initiatives from government public health and funding agencies like forecasting challenges and funding calls. This has resulted in recent works covering many aspects of epidemic forecasting. Data-centered solutions have specifically shown potential by leveraging non-traditional data sources as well as recent innovations in AI and machine learning. This tutorial will explore various data-driven methodological and practical advancements. First, we will enumerate epidemiological datasets and novel data streams capturing various factors like symptomatic online surveys, retail and commerce, mobility and genomics data. Next, we discuss methods and modeling paradigms with a focus on the recent data-driven statistical and deep-learning based methods as well as novel class of hybrid models that combine domain knowledge of mechanistic models with the effectiveness and flexibility of statistical approaches. We also discuss experiences and challenges that arise in real-world deployment of these forecasting systems including decision-making informed by forecasts. Finally, we highlight some open problems found across the forecasting pipeline.
Counterfactual estimators enable the use of existing log data to estimate how some new target policy would have performed, if it had been used instead of the policy that logged the data. We say that those estimators work "off-policy", since the policy that logged the data is different from the target policy. In this way, counterfactual estimators enable Off-policy Evaluation (OPE) akin to an unbiased offline A/B test, as well as learning new decision-making policies through Off-policy Learning (OPL). The goal of this tutorial is to summarize Foundations, Implementations, and Recent Advances of OPE and OPL (OPE/OPL), with applications in recommendation, search, and an ever growing range of interactive systems. Specifically, we will introduce the fundamentals of OPE/OPL and provide theoretical and empirical comparisons of conventional methods. Then, we will cover emerging practical challenges such as how to handle large action spaces, distributional shift, and hyper-parameter tuning. We will then present Open Bandit Pipeline, an open-source Python software for OPE/OPL to better enable new research and applications. We will conclude the tutorial with future directions.
Deep Reinforcement Learning uses best of both Reinforcement Learning and Deep Learning for solving problems which cannot be addressed by them individually. Deep Reinforcement Learning has been used widely for games, robotics etc. Limited work has been done for applying Deep Reinforcement Learning for Conversational AI. Hence, in this tutorial cover application of Deep Reinforcement Learning for Conversational AI.
We give conceptual introduction to Reinforcement Learning and Deep Reinforcement Learning. We then present various real-life approaches with increasing complexity in detail. The approaches include dialog generation, task-oriented dialog generation, modelling chitchat, natural language generation, hierarchical, weakly supervised, multi-domain and decision transformer.
We then walk-through code for implementation of core ideas and for some of the real-life approaches.
In this tutorial, we will provide an in-depth and hands-on tutorial on Automated Machine Learning & Tuning with a fast python library FLAML. We will start with an overview of the AutoML problem and the FLAML library. In the first half of the tutorial, we will then give a hands-on tutorial on how to use FLAML to automate typical machine learning tasks in an end-to-end manner with different customization options and how to perform general tuning tasks on user-defined functions. In the second half of the tutorial, we will introduce several advanced functionalities of the library. For example, zero-shot AutoML, fair AutoML, and online AutoML. We will close the tutorial with several open problems, and challenges learned from AutoML practice.
Although deep neural networks (DNNs) have been successfully deployed in various real-world application scenarios, recent studies demonstrated that DNNs are extremely vulnerable to adversarial attacks. By introducing visually imperceptible perturbations into benign inputs, the attacker can manipulate a DNN model into providing wrong predictions. For practitioners who are applying DNNs into real-world problems, understanding the characteristics of different kinds of attacks will not only help them improve the robustness of their models, but also can help them have deeper insights into the working mechanism of DNNs. In this tutorial, we provide a comprehensive overview of the recent advances of adversarial learning, including both attack methods and defense methods. Specifically, we first give a detailed introduction of various types of evasion attacks, followed by a series of representative defense methods against evasion attacks. We then discuss different poisoning attack methods, followed by several defense methods against poisoning attacks. In addition, besides introducing attack methods working in the digital setting, we also introduce attack methods designed for threatening physical world systems. Finally, we present DeepRobust, a PyTorch adversarial learning library which aims to build a comprehensive and easy-to-use platform to foster this research field. Via our tutorial, audience can grasp the main ideas of adversarial attacks and defenses and obtain a deep insight of the robustness of DNNs. The tutorial official website is available at https://sites.google.com/view/kdd22-tutorial-adv-learn/.
Exploring the vast amount of rapidly growing scientific text data is highly beneficial for real-world scientific discovery. However, scientific text mining is particularly challenging due to the lack of specialized domain knowledge in natural language context, complex sentence structures in scientific writing, and multi-modal representations of scientific knowledge. This tutorial presents a comprehensive overview of recent research and development on scientific text mining, focusing on the biomedical and chemistry domains. First, we introduce the motivation and unique challenges of scientific text mining. Then we discuss a set of methods that perform effective scientific information extraction, such as named entity recognition, relation extraction, and event extraction. We also introduce real-world applications such as textual evidence retrieval, scientific topic contrasting for drug discovery, and molecule representation learning for reaction prediction. Finally, we conclude our tutorial by demonstrating, on real-world datasets (COVID-19 and organic chemistry literature), how the information can be extracted and retrieved, and how they can assist further scientific discovery. We also discuss the emerging research problems and future directions for scientific text mining.
Graphs (or networks) are ubiquitous representation in life sciences and medicine, from molecular interactions maps, signaling transduction pathways, to graphs of scientific knowledge and patient- disease-intervention relationships derived from population studies and/or real-world data, such as electronic health records and insurance claims. Recent advance in graph machine learning (ML) approaches such as graph neural networks (GNNs) has transformed a diverse set of problems relying on biomedical networks that traditionally depend on descriptive topological data analyses. Small- and macro- molecules that were not modeled as graphs also saw a bloom in GNN-based algorithms improving the state-of-the-art performance for learning their properties. Comparing to graph ML applications from other domains, life sciences offer many unique problems and nuances ranging from graph construction to graph- level, and bi-graph-level supervision tasks.
The objective of this tutorial is twofold. First, it will provide a comprehensive overview of the types of biomedical graphs/networks, the underlying biological and medical problems, and the applications of graph ML algorithms for solving those problems. Second, it will showcase four concrete GNN solutions in life sciences with hands-on experience for the attendees. These hands-on sessions will cover: 1) training and fine-tuning GNN models for small-molecule property prediction on atomic graphs, 2) macro-molecule property and function prediction on residue graphs, 3) bi-graph based bind- ing affinity prediction for protein-ligand pairs, and 4) organizing and generating new knowledge for drug discovery and repurposing with knowledge graphs. This tutorial will also instruct the attendees to develop in two extensions of the software library Deep Graph Library (DGL), including DGL-lifesci and DGL-KE, so that they could jumpstart their own graph ML journey to advance life science research and development.
Time series analysis is ubiquitous and important in various areas, such as Artificial Intelligence for IT Operations (AIOps) in cloud computing, AI-powered Business Intelligence (BI) in E-commerce, Artificial Intelligence of Things (AIoT), etc. In real-world scenarios, time series data often exhibit complex patterns with trend, seasonality, outlier, and noise. In addition, as more time series data are collected and stored, how to handle the huge amount of data efficiently is crucial in many applications. We note that these significant challenges exist in various tasks like forecasting, anomaly detection, and fault cause localization. Therefore, how to design effective and efficient time series models for different tasks, which are robust to address the aforementioned challenging patterns and noise in real-world scenarios, is of great theoretical and practical interests. In this tutorial, we provide a comprehensive and organized tutorial on the state-of-the-art algorithms of robust time series analysis, ranging from traditional statistical methods to the most recent deep learning based methods. We will not only introduce the principle of time series algorithms, but also provide insights into how to apply them effectively in practical real-world industrial applications. Specifically, we organize the tutorial in a bottom-up framework. We first present preliminaries from different disciplines including robust statistics, signal processing, optimization, and deep learning. Then, we identify and discuss those most-frequently processing blocks in robust time series analysis, including periodicity detection, trend filtering, seasonal-trend decomposition, and time series similarity. Lastly, we discuss recent advances in multiple time series tasks including forecasting, anomaly detection, fault cause localization, and autoscaling, as well as practical lessons of large-scale time series applications from an industrial perspective.
Deep graph learning (DGL) has achieved remarkable progress in both business and scientific areas ranging from finance and e-commerce, to drug and advanced material discovery. Despite the progress, how to ensure various DGL algorithms behave in a socially responsible manner and meet regulatory compliance requirements becomes an emerging problem, especially in risk-sensitive domains. Trustworthy graph learning (TwGL) aims to solve the above problems from a technical viewpoint. In contrast to conventional graph learning which mainly cares about model performance, TwGL considers various reliability and safety aspects of DGL, including but not limited to adversarial robustness, explainability, and privacy protection. Whilst several previous tutorials have been made for the introduction of DGL in KDD, seldom is there a special focus on its safety aspects, including reliability, explainability, and privacy protection capability. This tutorial mainly covers the key achievements of trustworthy graph learning in recent years. Specifically, we will discuss three essential topics, that is, the reliability of DGL against inherent noise, distribution shift and adversarial attack, explainability methods, and privacy protection for DGL. Meanwhile, we will introduce some guidelines for applying DGL to risk-sensitive applications (e.g., AI drug discovery) to ensure GNN models behave in a trustworthy way. We hope our tutorial can offer a comprehensive review of recent advances in this area and also provide some useful suggestions to guide the developers to choose appropriate techniques for their applications.
The field of graph neural networks (GNNs) has seen rapid and incredible strides over the recent years. Graph neural networks, also known as deep learning on graphs, graph representation learning, or geometric deep learning, have become one of the fastest-growing research topics in machine learning, especially deep learning. This wave of research at the intersection of graph theory and deep learning has also influenced other fields of science, including recommendation systems, computer vision, natural language processing, inductive logic programming, program synthesis, software mining, automated planning, cybersecurity, and intelligent transportation. However, as the field rapidly grows, it has been extremely challenging to gain a global perspective of the developments of GNNs. Therefore, we feel the urgency to bridge the above gap and have a comprehensive tutorial on this fast-growing yet challenging topic. This tutorial of Graph Neural Networks (GNNs): Foundation, Frontiers and Applications will cover a broad range of topics in graph neural networks, by reviewing and introducing the fundamental concepts and algorithms of GNNs, new research frontiers of GNNs, and broad and emerging applications with GNNs. In addition, rich tutorial materials will be included and introduced to help the audience gain a systematic understanding by using our recently published book-Graph Neural Networks (GNN): Foundation, Frontiers, and Applications [12], which can easily be accessed at https://graph-neural-networks.github.io/index.html.
In the past decade, deep learning has significantly reshaped the landscape of information retrieval (IR). The community has recently begun to notice the potential dangers of overusing less-understood mechanisms and over-simplified assumptions to learn patterns and make decisions. In particular, there is growing concerns on the interpretation, reliability, social impact, and long-term utility of real-world IR systems. Therefore, it has become a pressing issue to bring the IR community comprehensive and systematic tools to understand empirical domain solutions and motivate principled design ideas. We focus on the three pillar stones of modern IR systems: pattern recognition with deep learning, causal inference analysis, and online decision making (with bandits and reinforcement learning). Our objectives are as follows.
Anomaly detection is becoming important in modern society as everything goes digital. Consumers are spending a lot more time online, and various digital sensors are placed into physical/chemical equipment for health monitoring. Such monitoring data is growing at an exponential rate, which enables automated anomaly detection in various high-impact domains. In this tutorial, we dive deep into several state-of-art methods for finding anomalies in spatiotemporal data (or more generally, correlated multivariate data), on a few specific use cases such as telecom network performance and user behavior monitoring. We identify suitable methods for specific data representation (i.e. spatial, temporal and categorical dimensionality) for specific use case, and present methods to convert raw data into the formats the existing methods require. Such survey and hands-on exercise is necessary as each use case has its own special data representation and requirements in live applications. Certain methods fit better than others for a specific data format under a specific scenario. With such investigation outlined in this tutorial, we hope attendees will be able to more efficiently choose the most appropriate method for their use case. To reach as wide an audience as possible, we investigate anomaly detection application in multiple domains: telecom network performance, user behavior monitoring, financial transactions, and industrial internet of things. The types of datasets range from univariate, multivariate time series of single or multiple entities, to transactional tabular data with timestamps or sequenced indexes. The types of anomalies in these datasets include contextual anomalies for one entity or collective anomalies aggregated among multiple entities in time series, and suspicious record or user in transactional tabular data. During this three-hour hands-on tutorial, we examine the suitability and performance of different methods for the 4 introduced use cases. The available choices of anomaly detection modeling framework ranges from sequential or time series models, to static graph neural network models and dynamic graph neural network models that learns coupled spatial, structural and temporal information. Among these models, anomaly detection is formulated as a forecast problem for some use cases in which anomaly severity is derived based on the observed value deviation from the predicted value. In other use cases, anomaly detection is solved as a classification problem (either as node, edge classification in graph, row-based classification in tabular data or sequential classification in sequence data). The tutorial combines an introduction of fundamental anomaly detection techniques with hands-on exercises. For the hands-on exercises, we focus on the time series based HTM method introduced in and graph methods introduced in. Participants learn to identify suitable methods, apply data transformation techniques to convert raw data into the formats the different methods assume, and study these methods on the aforementioned real-world or synthetic data sets.
This tutorial will show you how to do visualization and build interactive dashboards using HoloViz, which is an open-source visualization ecosystem comprising seven packages. You will learn how to turn nearly any notebook into a deployable dashboard, how to build visualizations easily even for big, streaming, and multidimensional data, how to build interactive drill-down exploratory tools for your data and models without having to run a web-technology software development project, and finally how to deploy your dashboard.
Different from traditional machine learning tasks and benchmarks, real-world problems are usually accompanied by enormous output spaces, from hundred thousands of diseases in medical diagnosis, to millions of items and billions of websites in product and web search engines. Unfortunately, conventional machine learning tools and libraries are incapable of efficiently and accurately tackling large-scale output spaces. To address this issue, PECOS (Prediction for Enormous and Correlated Output Spaces) [11] is a state-of-the-art and open-sourced machine learning library1, which not only provides high-level and user-friendly interfaces of both linear and deep learning models, but also supplies considerable flexibility for solving diverse machine learning problems. Specifically, PECOS eases complicated semantic indexing for organizing enormous output spaces, thereby efficiently training models and deriving predictions by magnitude orders on correlated output labels. As a powerful and useful framework, PECOS has already been adopted in various real- world large-scale products like semantic search in Amazon [1], as well as achieved state-of-the-art on public extreme multi-label classification (XMC) benchmarks [2, 11, 12 ] and various downstream applications [3, 7, 9].
In this tutorial, we will introduce several key functions and features of the PECOS library. By way of real-world examples, the attendees will learn how to efficiently train large-scale machine learning models for enormous output spaces, and obtain predictions in less than 1 millisecond for a data input with million labels, in the context of product recommendation and natural language processing. We will also show the flexibility of dealing with diverse machine learning problems and data formats with assorted built-in utilities in PECOS. By the end of the tutorial, we believe that attendees will be easily capable of adopting certain concepts to their own projects and address different machine learning problems with enormous output spaces
Similar to previous iterations, the epiDAMIK @ KDD workshop is a forum to promote data driven approaches in epidemiology and public health research. Even after the devastating impact of COVID-19 pandemic, data driven approaches are not as widely studied in epidemiology, as they are in other spaces. We aim to promote and raise the profile of the emerging research area of data-driven and computational epidemiology, and create a venue for presenting state-of-the-art and in-progress results-in particular, results that would otherwise be difficult to present at a major data mining conference, including lessons learnt in the 'trenches'. The current COVID-19 pandemic has only showcased the urgency and importance of this area. Our target audience consists of data mining and machine learning researchers from both academia and industry who are interested in epidemiological and public-health applications of their work, and practitioners from the areas of mathematical epidemiology and public health. Homepage: https://epidamik.github.io/.
An average consumer spends 8+ hours a day across all devices interacting with online content almost entirely sponsored by advertisements. At over $450B global market size in 2022 and expected to pass $1T by 2027, online advertising has already surpassed traditional ads in global spend. Moreover, computational advertising in particular is perhaps the most visible and ubiquitous application of machine learning and one that interacts directly with consumers. When done right, ads help us enrich our lives and creep us out when done badly. Looking at the published literature over the last few years, many researchers might consider computational advertising as a mature field. Yet, the opposite is true. The field is evolving, however, from ads controlled by monolithic publishers and randomly rotating banner ads to highly personalized content experiences in news feeds on mobile devices and even on TV-all utilizing data amassed from petabytes of stored user data. Ads are far from done.
The MIS2-TrueFact is geared towards bringing academic, industry, and government researchers and practitioners together to tackle the challenges in misinformation, misbehavior, and data quality issues on the web with heterogeneous and multi-modal sources of information including texts, images, videos, relational data, social networks, and knowledge graphs.
Citation data, along with other bibliographic datasets, have long been adopted by the knowledge and data discovery community as an important direction for presenting the validity and effectiveness of proposed algorithms and strategies. Many top computer scientists are also excellent researchers in the science of science. The purpose of this workshop is to bridge the two communities (i.e., the knowledge discovery community and the science of science community) together as the scholarly activities become salient web and social activities that start to generate a ripple effect on broader knowledge discovery communities. This workshop will showcase the current data-driven science of science research by highlighting several studies and constructing a community of researchers to explore questions critical to the future of data-driven science of science, especially a community of data-driven science of science in Data Science so as to facilitate collaboration and inspire innovation. Through discussion on emerging and critical topics in the science of science, this workshop aims to help generate effective solutions for addressing environmental, societal, and technological problems in the scientific community.
Adversarial learning methods and their applications such as generative adversarial network, adversarial robustness, and security and privacy, have prevailed and revolutionized the research in machine learning and data mining. Their importance has not only been emphasized by the research community but also been widely recognized by the industry and the general public. Continuing the synergies in previous years, this third annual workshop aims to advance this research field. The AdvML'22 workshop consists of four tracks: (i) open-call paper submissions; (ii) invited speakers; (iii) rising star awards and presentations; and (iv) panel discussion on AdvML. The full details about the workshop can be found at https://sites.google.com/view/advml.
Recently, we have witnessed that deep learning-based approaches have been widely applied. Particularly, some applications involve data that are high dimensional, sparse or imbalanced, which are different from those applications with dense data processing, such as image classification and speech recognition, where deep learning-based approaches have been extensively studied. One of the main applications is the user-centric platform that consists of great deal of users, items and user generated tabular data which are quite high-dimensional. The characteristics of such data pose unique challenges to the adoption of deep learning in these applications, including modeling, training, and online serving, etc. More and more communities from both academia and industry have initiated the endeavors to solve these challenges. This workshop will provide a venue for both the research and engineering communities to discuss and formulate the challenges, utilize opportunities, and propose new ideas in the practice and theory of deep learning on high-dimensional, sparse and imbalanced data.
Recommender systems (RecSys) play important roles in helping users navigate, discover, and consume large and highly dynamic information. Today, many RecSys solutions deployed in the real world rely on categorical user-profiles and/or pre-calculated recommendation actions that stay static during a user session. However, recent trends suggest that RecSys need to model user intent in real time and constantly adapt to meet user needs at the moment or change user behavior in situ. There are three primary drivers for this emerging need of online adaptation. First, in order to meet the increasing demand for a better personalized experience, the personalization dimensions and space will grow larger and larger. It would not be feasible to pre-compute recommended actions for all personalization scenarios beyond a certain scale. Second, in many settings the system does not have user prior history to leverage. Estimating user intent in real time is the only feasible way to personalize. As various consumer privacy laws tighten, it is foreseeable that many businesses will reduce their reliance on static user profiles. Therefore, it makes the modeling of user intent in real time an important research topic. Third, a user's intent often changes within a session and between sessions, and user behavior could shift significantly during dramatic events. A RecSys should adapt in real time to meet user needs and be robust against distribution shifts. The online and adaptive recommender systems (OARS) workshop offers a focused discussion of the study and application of OARS, and will bring together an interdisciplinary community of researchers and practitioners from both industry and academia. KDD, as the premier data science conference, is an ideal venue to gather leaders in the field to further research into OARS and promote its adoption. This workshop is complementary to several sessions of the main conference (e.g., recommendation, reinforcement learning, etc.) and brings them together using a practical and focused application.
Knowledge networks/graphs provide a powerful approach for data discovery, integration, and reuse. The NSF's new Convergence Accelerator program, which focuses on transitioning research to practice and translational research, announced Track A on the Open Knowledge Network (OKN). The program calls for multidisciplinary and multi-sector teams to work together to build a cooperative and shared open knowledge network infrastructure to drive innovation across science, engineering, and humanities. This workshop aims to invite researchers, practitioners, and the general public to brainstorm the ideas related to OKN, collaboratively build KGs for different domains or applications, develop AI algorithms to provide intelligent services based on OKN, and discuss the social and economic implications related to OKN.
The Fragile EarthWorkshop is a recurring event that gathers the research community to find and explore howdata science can measure and progress climate and social issues, following the framework of the United Nations Sustainable Development Goals (SDGs).
Fragile Earth 2022: AI for Climate Mitigation, Adaptation, and Environmental Justice is a workshop taking place as part of the ACM's KDD 2022 Conference on research in knowledge discovery and data mining and their applications. The dates for the Conference are August 14-18, 2022.
The 17th International Workshop on Mining and Learning with Graphs (MLG) is held in Washington DC, USA on August 15, 2022 and is co-located with the Eighth International Workshop on Deep Learning on Graphs (DLG) as part of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. This workshop is a forum for exchanging ideas and methods for mining and learning with graphs, developing new common understandings of the problems at hand, sharing of data sets where applicable, and leveraging existing knowledge from different disciplines. In doing so, we aim to better understand the overarching principles and the limitations of our current methods, and to inspire research on new algorithms and techniques for mining and learning with graphs. Topics of interest include, but are not limited to, graph mining, statistical relational learning, social network analysis, and network science. The target audience spans researchers and practitioners across academia, government, and industry.
The 3rd IADSS Workshop on Data Science Standards follows a tradition of two prior KDD workshops and the initial workshop at ICDM-2018. The theme of the 2022 workshop is: Hiring, Assessing and Upskilling Data Science Talent. Organized by the Initiative for Analytics and Data Science (IADSS.org) at KDD, the workshop provides a platform to discuss industry needs and practices around external and internal talent pipeline development in data science. We aim to provide an understanding of the data science job market, and the critical role of collaboration between academic institutions and industry to meet the increasing need for talent. IADSS conducts ongoing research in this domain. In this workshop, we share detailed findings and observations from this research. In addition to contribution from researchers and industry practitioners through an open call for papers, the workshop features several invited presentations and invited speakers. In order to achieve intended aim, interactive panels will discuss topics of interest and feedback from the workshop will be used to produce post-workshop learnings. This workshop is designed as a half-day working meeting with short talks, invited panels and discussion sessions to plan for future steps in the topic. Post conference, learnings from the workshop will be available at the workshop's home page: https://www.iadss.org/kdd2022
Human civilization faces existential threats in the forms of climate change, food insecurity, pandemics, international conflicts, forced displacements, and environmental injustice. These overarching humanitarian challenges disproportionately impact historically marginalized communities worldwide. UN OCHA estimates that 274 million people will need humanitarian support in 2022. Despite growing perils to human and environmental well-being, there remains a paucity of publicly-engaged computing research to inform the design of interventions. Data science efforts exist, but they remain isolated from socioeconomic, environmental, cultural, and policy contexts at local and international scales. Moreover, biases and privacy infringements in data-driven methods further amplify existing inequalities. The result is that proclaimed benefits of data-driven innovations may remain inaccessible to policymakers, practitioners, and underserved communities whose lives they intend to transform. To address gaps in knowledge and improve the livelihood of marginalized populations, we have established the Data-driven Humanitarian Mapping and Policymaking, an interdisciplinary initiative.
Machine learning applications are rapidly adopted by industry leaders in any field. The growth of investment in AI-driven solutions created new challenges in managing Data Science and ML resources, people and projects as a whole. The discipline of managing applied machine learning teams, requires a healthy mix between agile product development tool-set and a long term research oriented mindset. The abilities of investing in deep research while at the same time connecting the outcomes to significant business results create a large knowledge based on management methods and best practices in the field. The Workshop on Applied Machine Learning Management brings together applied research managers from various fields to share methodologies and case-studies on management of ML teams, products, and projects, achieving business impact with advanced AI-methods.
At present, most machine learning research on customer optimization focuses on short term success of the customers by addressing questions such as - which users have a higher propensity to click? Where to place one ad/multiple contents on a web page? What is the most appropriate time to show content? There has been less/little thought put into building a coherent system for the long term/end-end customer optimization from acquisition by understanding a user's propensity to convert to a particular product at a certain time, to user's ability to be successful long term on a platform as measured by CLV (Customer Lifetime Value), to users' ability to buy more products (cross sell) on the same platform, and finally users propensity to churn. Currently, such models and algorithms are built in isolation to serve a single purpose which leads to inefficiencies in modeling and data pipelines. Also, most of the time the customer is not looked at as a single entity - but each product/subgroup within an organization (marketing, sales, product growth, go-to-market, product) considers the customer independently. This workshop aims to connect academic researchers and industrial practitioners who are working on, or interested in building holistic systems and solutions in the field of end to end customer journey optimization.
With the advancement of GPS and remote sensing technologies and the pervasiveness of smartphones and IoT devices, an enormous amount of spatiotemporal data are being collected from various domains. Knowledge discovery from spatiotemporal data is crucial in addressing many grand societal challenges, ranging from flood disaster management to monitoring coastal hazards, and from autonomous driving to disease forecasting. The recent success in deep learning technologies in computer vision and natural language processing provides new opportunities for spatiotemporal data mining, but existing deep learning techniques also face unique spatiotemporal challenges (e.g., autocorrelation, non-stationarity, physics awareness). This workshop provides a premium platform for researchers from both academia and industry to exchange ideas on the opportunities, challenges, and cutting-edge techniques related to deep learning for spatiotemporal data.
The Sixth International Workshop on Automation in Machine Learning aims to identify opportunities and challenges for automation in machine learning, to provide an opportunity for researchers to discuss best practices for automation in machine learning potentially leading to definition of standards, and to provide a forum for researchers to speak out and debate on different ideas in automation in machine learning. The workshop agenda includes four invited keynote speakers and four accepted paper presentations chosen from a peer review process. The workshop seeks to drive engaging and interactive exchange of thoughts and ideas on AutoML.
The finance industry is constantly faced with an ever evolving set of challenges including credit card fraud, identity theft, network intrusion, money laundering, human trafficking, and illegal sales of firearms. There is also the newly emerging threat of fake news in financial media that can lead to distortions in trading strategies and investment decisions. In addition, traditional problems such as customer analytics, forecasting, and recommendations take on a unique flavor when applied to financial data. A number of new ideas are emerging to tackle all these problems including semi-supervised learning methods, deep learning algorithms, network/graph based solutions as well as linguistic approaches. These methods must often be able to work in real-time and be able handle large volumes of data. The purpose of this workshop is to bring together researchers and practitioners to discuss both the problems faced by the financial industry and potential solutions. We plan to invite regular papers, positional papers and extended abstracts of work in progress. We will also encourage short papers from financial industry practitioners that introduce domain specific problems and challenges to academic researchers.
Causal relationships have been utilized in almost all disciplines, and the research into causal discovery has attracted a lot of attention in the last few years. Traditionally, causal relationships are identified by making use of interventions or randomized controlled experiments. However, conducting such experiments is often expensive or even impossible due to cost or ethical concerns. Therefore, there has been an increasing interest in discovering causal relationships based on observational data, and in the past few decades, significant contributions have been made to this field by computer scientists.
Following the success of CD 2016 - CD 2021, CD 2022 continues to serve as a forum for researchers and practitioners in data mining and other disciplines to share their recent research in causal discovery in their respective fields and to explore the possibility of interdisciplinary collaborations in the study of causality. Based on the platform of KDD, this workshop is especially interested in attracting contributions that link data mining/machine learning research with causal discovery, and solutions to causal discovery in large scale datasets.
Urbanization's rapid progress has led to many big cities, which have modernized many people's lives but also engendered big challenges, such as air pollution, increased energy consumption and traffic congestion. Tackling these challenges were nearly impossible years ago given the complex and dynamic settings of cities. Nowadays, sensing technologies and large-scale computing infrastructures have produced a variety of big data in urban spaces, e.g., human mobility, air quality, traffic patterns, and geographical data. Motivated by the opportunities of building more intelligent cities, we came up with a vision of urban computing, which aims to unlock the power of knowledge from big and heterogeneous data collected in urban spaces and apply this powerful information to solve major issues our cities face today. This is the eleventh time that we organize this workshop. The previous 10 workshops were hosted with SIGKDD and SIGSPATIAL, each of which attracted over 70 participants and 30 submissions on average.
Shopping experience on any e-commerce website is largely driven by the content customers interact with. The large volume of diverse content on e-commerce platforms, and the advances in machine learning, pose unique opportunities for gathering insights through content understanding and applying these insights to generate content better shopper experience. The purpose of the first edition of this workshop was to bring together researchers from industry and academia on questions surrounding e-commerce content understanding and generation.
Business documents are central to the operation of all organizations, and they come in all shapes and sizes: project reports, planning documents, technical specifications, financial statements, meeting minutes, legal agreements, contracts, resumes, purchase orders, invoices, and many more. The ability to read, understand and interpret these documents, referred to here as Document Intelligence (DI), is challenging due to not only many domains of knowledge involved, but also their complex formats and structures, internal and external cross references deployed, and even less-than-ideal quality of scans and OCR oftentimes performed on them. This workshop aims to explore and advance the current state of research and practice in answering these challenges.
The detection of, explanation of, and accommodation to anomalies and novelties are active research areas in multiple communities, including data mining, machine learning, and computer vision. They are applied in various guises including anomaly detection, out-of-distribution example detection, adversarial example recognition and detection, curiosity-driven reinforcement learning, and open-set recognition and adaptation, all of which are of great interest to the SIGKDD community. The techniques developed have been applied in a wide range of domains including fraud detection and anti-money laundering in fintech, early disease detection, intrusion detection in large-scale computer networks and data centers, defending AI systems from adversarial attacks, and in improving the practicality of agents through overcoming the closed-world assumption.
This workshop is focused on Anomaly and Novelty Detection, Explanation, and Accommodation (ANDEA). It will gather researchers and practitioners from data mining, machine learning, and computer vision communities and diverse knowledge background to promote the development of fundamental theories, effective algorithms, and novel applications of anomaly and novelty detection, characterization, and adaptation. All materials of keynote talks and accepted papers of the workshop are made available at https://sites.google.com/view/andea2022/.
Data science is the practice of deriving insight from data, enabled by modeling, computational methods, interactive visual analysis, and domain-driven problem solving. Data science draws from methodology developed in such fields as applied mathematics, statistics, machine learning, data mining, data management, visualization, and HCI. It drives discoveries in business, economy, biology, medicine, environmental science, the physical sciences, the humanities and social sciences, and beyond. Machine learning and data mining and visualization are integral parts of data science, and essential to enable sophisticated analysis of data. Nevertheless, both research areas are currently still rather separated and investigated by different communities rather independently. The goal of this workshop is to bring researchers from both communities together in order to discuss common interests, to talk about practical issues in application-related projects, and to identify open research problems. This summary gives a brief overview of the ACM KDD Workshop on Visualization in Data Science (VDS at ACM KDD and IEEE VIS), which will take place virtually on Aug 14-18, 2022 (Held in conjunction with KDD'22). The workshop website is available at http://www.visualdatascience.org/2022/
Time series data are ubiquitous, and is one of the fastest growing and richest types of data. Recent advances in sensing technologies has resulted in a rapid growth in the size and complexity of time series archives. This demands development of new tools and solutions. The goals of this workshop are to: (1) highlight the significant challenges that underpin learning and mining from time series data (e.g. irregular sampling, spatiotemporal structure, uncertainty quantification), (2) discuss recent algorithmic, theoretical, statistical, or systems-based developments for tackling these problems, and (3) exploring new frontiers in time series analysis and their connections with important topics such as knowledge representation, reasoning, control, and business intelligence. In summary, our workshop will focus on both the theoretical and practical aspects of time series data analysis and will provide a platform for researchers and practitioners from both academia and industry to discuss potential research directions, key technical issues, and present solutions to tackle related issues in practical applications. We will invite researchers and practitioners from the related areas of AI, machine learning, data science, statistics, and many others to contribute to this workshop.
Online marketplace is a digital platform that connects buyers (demand) and sellers (supply) and provides exposure opportunities that individual participants would not otherwise have access to. Online marketplaces exist in a diverse set of domains and industries, for example, rideshare (Lyft, DiDi, Uber), house rental (Airbnb), real estate (Beke), online retail (Amazon