KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Full Citation in the ACM Digital Library

SESSION: Research Track Full Papers

Minimizing Hitting Time between Disparate Groups with Shortcut Edges

Structural bias or segregation of networks refers to situations where two or more disparate groups are present in the network, so that the groups are highly connected internally, but loosely connected to each other. Examples include polarized communities in social networks, antagonistic content in video-sharing or news-feed platforms, etc. In many cases it is of interest to increase the connectivity of disparate groups so as to, e.g., minimize social friction, or expose individuals to diverse viewpoints. A commonly-used mechanism for increasing the network connectivity is to add edge shortcuts between pairs of nodes. In many applications of interest, edge shortcuts typically translate to recommendations, e.g., what video to watch, or what news article to read next. The problem of reducing structural bias or segregation via edge shortcuts has recently been studied in the literature, and random walks have been an essential tool for modeling navigation and connectivity in the underlying networks. Existing methods, however, either do not offer approximation guarantees, or engineer the objective so that it satisfies certain desirable properties that simplify the optimization task.

In this paper we address the problem of adding a given number of shortcut edges in the network so as to directly minimize the average hitting time and the maximum hitting time between two disparate groups. The objectives we study are more natural than objectives considered earlier in the literature (e.g., maximizing hitting-time reduction) and the optimization task is significantly more challenging. Our algorithm for minimizing average hitting time is a greedy bicriteria that relies on supermodularity. In contrast, maximum hitting time is not supermodular. Despite, we develop an approximation algorithm for that objective as well, by leveraging connections with average hitting time and the asymmetric k-center problem.

Maximizing Neutrality in News Ordering

The detection of fake news has received increasing attention over the past few years, but there are more subtle ways of deceiving one's audience. In addition to the content of news stories, their presentation can also be made misleading or biased. In this work, we study the impact of the ordering of news stories on audience perception. We introduce the problems of detecting cherry-picked news orderings and maximizing neutrality in news orderings. We prove hardness results and present several algorithms for approximately solving these problems. Furthermore, we provide extensive experimental results and present evidence of potential cherry-picking in the real world.

Fair Allocation Over Time, with Applications to Content Moderation

In today's digital world, interaction with online platforms is ubiquitous, and thus content moderation is important for protecting users from content that do not comply with pre-established community guidelines. Given the vast volume of content generated online daily, having an efficient content moderation system throughout every stage of planning is particularly important. We study the short-term planning problem of allocating human content reviewers to different harmful content categories. We use tools from fair division and study the application of competitive equilibrium and leximin allocation rules for addressing this problem. On top of the traditional Fisher market setup, we additionally incorporate novel aspects that are of practical importance. The first aspect is the forecasted workload of different content categories, which puts constraints on the allocation chosen by the planner. We show how a formulation that is inspired by the celebrated Eisenberg-Gale program allows us to find an allocation that not only satisfies the forecasted workload, but also fairly allocates the remaining working hours from the content reviewers among all content categories. A fair allocation of oversupply provides a guardrail in cases where the actual workload deviates from the predicted workload. The second practical consideration is time dependent allocation that is motivated by the fact that partners need scheduling guidance for the reviewers across days to achieve efficiency. To address the time component, we introduce new extensions of the various fair allocation approaches for the single-time period setting, and we show that many properties extend in essence, albeit with some modifications. Lastly, related to the time component, we additionally investigate how to satisfy markets' desire for smooth allocation (i.e, an allocation that does not vary much from time to time) so that the switch in staffing is minimized. We demonstrate the performance of our proposed approaches through real-world data obtained from Meta.

LEA: Improving Sentence Similarity Robustness to Typos Using Lexical Attention Bias

Textual noise, such as typos or abbreviations, is a well-known issue that penalizes vanilla Transformers for most downstream tasks. We show that this is also the case for sentence similarity, a fundamental task in multiple domains, e.g. matching, retrieval or paraphrasing. Sentence similarity can be approached using cross-encoders, where the two sentences are concatenated in the input allowing the model to exploit the inter-relations between them. Previous works addressing the noise issue mainly rely on data augmentation strategies, showing improved robustness when dealing with corrupted samples that are similar to the ones used for training. However, all these methods still suffer from the token distribution shift induced by typos. In this work, we propose to tackle textual noise by equipping cross-encoders with a novel LExical-aware Attention module (LEA) that incorporates lexical similarities between words in both sentences. By using raw text similarities, our approach avoids the tokenization shift problem obtaining improved robustness. We demonstrate that the attention bias introduced by LEA helps cross-encoders to tackle complex scenarios with textual noise, specially in domains with short-text descriptions and limited context. Experiments using three popular Transformer encoders in five e-commerce datasets for product matching show that LEA consistently boosts performance under the presence of noise, while remaining competitive on the original (clean) splits. We also evaluate our approach in two datasets for textual entailment and paraphrasing showing that LEA is robust to typos in domains with longer sentences and more natural context. Additionally, we thoroughly analyze several design choices in our approach, providing insights about the impact of the decisions made and fostering future research in cross-encoders dealing with typos.

Rank-heterogeneous Preference Models for School Choice

School choice mechanism designers use discrete choice models to understand and predict families' preferences. The most widely-used choice model, the multinomial logit (MNL), is linear in school and/or household attributes. While the model is simple and interpretable, it assumes the ranked preference lists arise from a choice process that is uniform throughout the ranking, from top to bottom. In this work, we introduce two strategies for rank-heterogeneous choice modeling tailored for school choice. First, we adapt a context-dependent random utility model (CDM), considering down-rank choices as occurring in the context of earlier up-rank choices. Second, we consider stratifying the choice modeling by rank, regularizing rank-adjacent models towards one another when appropriate. Using data on household preferences from the San Francisco Unified School District (SFUSD) across multiple years, we show that the contextual models considerably improve our out-of-sample evaluation metrics across all rank positions over the non-contextual models in the literature. Meanwhile, stratifying the model by rank can yield more accurate first-choice predictions while down-rank predictions are relatively unimproved. These models provide performance upgrades that school choice researchers can adopt to improve predictions and counterfactual analyses.

Knowledge Graph Reasoning over Entities and Numerical Values

A complex logic query in a knowledge graph refers to a query expressed in logic form that conveys a complex meaning, such as where did the Canadian Turing award winner graduate from? Knowledge graph reasoning-based applications, such as dialogue systems and interactive search engines, rely on the ability to answer complex logic queries as a fundamental task. In most knowledge graphs, edges are typically used to either describe the relationships between entities or their associated attribute values. An attribute value can be in categorical or numerical format, such as dates, years, sizes, etc. However, existing complex query answering (CQA) methods simply treat numerical values in the same way as they treat entities. This can lead to difficulties in answering certain queries, such as which Australian Pulitzer award winner is born before 1927, and which drug is a pain reliever and has fewer side effects than Paracetamol. In this work, inspired by the recent advances in numerical encoding and knowledge graph reasoning, we propose numerical complex query answering. In this task, we introduce new numerical variables and operations to describe queries involving numerical attribute values. To address the difference between entities and numerical values, we also propose the framework of Number Reasoning Network (NRN) for alternatively encoding entities and numerical values into separate encoding structures. During the numerical encoding process, NRN employs a parameterized density function to encode the distribution of numerical values. During the entity encoding process, NRN uses established query encoding methods for the original CQA problem. Experimental results show that NRN consistently improves various query encoding methods on three different knowledge graphs and achieves state-of-the-art results.

Communication Efficient and Differentially Private Logistic Regression under the Distributed Setting

We study the classic machine learning problem of logistic regression with differential privacy (DP), under the distributed setting. While logistic regression with DP has been extensively studied in the literature, most of the research is focused on the centralized setting, where a centralized server is trusted with the entire private training dataset. However, in many real-world scenarios (e.g., federated learning), the data is distributed among multiple clients who may not trust others, including clients and the server. While the server tries to learn a model using the clients' private datasets, the clients should provide each individual record in their local datasets with a formal privacy guarantee.

Towards this end, we propose a general mechanism for logistic regression with DP under the distributed setting, based on output perturbation. We show that our solution satisfies differential privacy and enjoys privacy amplification by secure aggregation, a recent technique for DP under the distributed setting. In addition, our solution also incurs much lower communication costs (which is considered as a huge overhead in federated learning), compared with existing ones. In particular, our solution requires the clients to communicate only once throughout the entire FL process. Finally, we provide experimental results on real-world datasets to demonstrate the effectiveness of our solution.

Connecting the Dots -- Density-Connectivity Distance unifies DBSCAN, k-Center and Spectral Clustering

Despite the popularity of density-based clustering, its procedural definition makes it difficult to analyze compared to clustering methods that minimize a loss function. In this paper, we reformulate DBSCAN through a clean objective function by introducing the density-connectivity distance (dc-dist), which captures the essence of density-based clusters by endowing the minimax distance with the concept of density. This novel ultrametric allows us to show that DBSCAN, k-center, and spectral clustering are equivalent in the space given by the dc-dist, despite these algorithms being perceived as fundamentally different in their respective literatures. We also verify that finding the pairwise dc-dists gives DBSCAN clusterings across all epsilon-values, simplifying the problem of parameterizing density-based clustering. We conclude by thoroughly analyzing density-connectivity and its properties -- a task that has been elusive thus far in the literature due to the lack of formal tools. Our code recreates every experiment below: https://github.com/Andrew-Draganov/dc_dist

Sketch-Based Anomaly Detection in Streaming Graphs

Given a stream of graph edges from a dynamic graph, how can we assign anomaly scores to edges and subgraphs in an online manner, for the purpose of detecting unusual behavior, using constant time and memory? For example, in intrusion detection, existing work seeks to detect either anomalous edges or anomalous subgraphs, but not both. In this paper, we first extend the count-min sketch data structure to a higher-order sketch. This higher-order sketch has the useful property of preserving the dense subgraph structure (dense subgraphs in the input turn into dense submatrices in the data structure). We then propose 4 online algorithms that utilize this enhanced data structure, which (a) detect both edge and graph anomalies; (b) process each edge and graph in constant memory and constant update time per newly arriving edge, and; (c) outperform state-of-the-art baselines on 4 real-world datasets. Our method is the first streaming approach that incorporates dense subgraph search to detect graph anomalies in constant memory and time.

Preemptive Detection of Fake Accounts on Social Networks via Multi-Class Preferential Attachment Classifiers

In this paper, we describe a new algorithm called Preferential Attac hment k-class Classifier (PreAttacK) for detecting fake accounts in a social network. Recently, several algorithms have obtained high accuracy on this problem. However, they have done so by relying on information about fake accounts' friendships or the content they share with others-the very things we seek to prevent.

PreAttacK represents a significant departure from these approaches. We provide some of the first detailed distributional analyses of how new fake (and real) accounts first attempt to make friends by strategically targeting their initial friend requests after joining a major social network (Facebook). We show that even before a new account has made friends or shared content, these initial friend request behaviors evoke a natural multi-class extension of the canonical Preferential Attachment model of social network growth.

We leverage this model to derive a new algorithm, PreAttacK. We prove that in relevant problem instances, PreAttacK near-optimally approximates the posterior probability that a new account is fake under this multi-class Preferential Attachment model of new accounts' (not-yet-answered) friend requests. These are the first provable guarantees for fake account detection that apply to new users, and that do not require strong homophily assumptions.

This principled approach also makes PreAttacK the only algorithm with provable guarantees that obtains state-of-the-art performance at scale on the global Facebook network, allowing it to detect fake accounts before standard methods apply and at lower computational cost. Specifically, PreAttacK converges to informative classifications (AUC≈0.9) after new accounts send + receive a total of just 20 not-yet-answered friend requests. For comparison, state-of-the-art network-based algorithms do not obtain this performance even after observing additional data on new users' first 100 friend requests. Thus, unlike mainstream algorithms, PreAttacK converges before the median new fake account has made a single friendship (i.e. accepted friend request) with a human.

On Improving the Cohesiveness of Graphs by Merging Nodes: Formulation, Analysis, and Algorithms

Graphs are a powerful mathematical model, and they are used to represent real-world structures in various fields. In many applications, real-world structures with high connectivity and robustness are preferable. For enhancing the connectivity and robustness of graphs, two operations, adding edges and anchoring nodes, have been extensively studied. However, merging nodes, which is a realistic operation in many scenarios (e.g., bus station reorganization, multiple team formation), has been overlooked. In this work, we study the problem of improving graph cohesiveness by merging nodes. First, we formulate the problem mathematically using the size of the k-truss, for a given k, as the objective. Then, we prove the NP-hardness and non-modularity of the problem. After that, we develop BATMAN, a fast and effective algorithm for choosing sets of nodes to be merged, based on our theoretical findings and empirical observations. Lastly, we demonstrate the superiority of BATMAN over several baselines, in terms of speed and effectiveness, through extensive experiments on fourteen real-world graphs.

MBrain: A Multi-channel Self-Supervised Learning Framework for Brain Signals

Brain signals are important quantitative data for understanding physiological activities and diseases of human brain. Meanwhile, rapidly developing deep learning methods offer a wide range of opportunities for better modeling brain signals, which has attracted considerable research efforts recently. Most existing studies pay attention to supervised learning methods, which, however, require high-cost clinical labels. In addition, the huge difference in the clinical patterns of brain signals measured by invasive (e.g., SEEG) and non-invasive (e.g., EEG) methods leads to the lack of a unified method. To handle the above issues, in this paper, we propose to study the self-supervised learning (SSL) framework for brain signals that can be applied to pre-train either SEEG or EEG data. Intuitively, brain signals, generated by the firing of neurons, are transmitted among different connecting structures in human brain. Inspired by this, we propose MBrain to learn implicit spatial and temporal correlations between different channels (i.e., contacts of the electrode, corresponding to different brain areas) as the cornerstone for uniformly modeling different types of brain signals. Specifically, we represent the spatial correlation by a graph structure, which is built with proposed multi-channel CPC. We theoretically prove that optimizing the goal of multi-channel CPC can lead to a better predictive representation and apply the instantaneou-time-shift prediction task based on it. Then we capture the temporal correlation by designing the delayed-time-shift prediction task. Finally, replace-discriminative-learning task is proposed to preserve the characteristics of each channel. Extensive experiments of seizure detection on both EEG and SEEG large-scale real-world datasets demonstrate that our model outperforms several state-of-the-art time series SSL and unsupervised models, and has the ability to be deployed to clinical practice.

When to Pre-Train Graph Neural Networks? From Data Generation Perspective!

In recent years, graph pre-training has gained significant attention, focusing on acquiring transferable knowledge from unlabeled graph data to improve downstream performance. Despite these recent endeavors, the problem of negative transfer remains a major concern when utilizing graph pre-trained models to downstream tasks. Previous studies made great efforts on the issue of what to pre-train and how to pre-train by designing a variety of graph pre-training and fine-tuning strategies. However, there are cases where even the most advanced "pre-train and fine-tune" paradigms fail to yield distinct benefits. This paper introduces a generic framework W2PGNN to answer the crucial question of when to pre-train (.e., in what situations could we take advantage of graph pre-training) before performing effortful pre-training or fine-tuning. We start from a new perspective to explore the complex generative mechanisms from the pre-training data to downstream data. In particular, W2PGNN first fits the pre-training data into graphon bases, each element of graphon basis (i.e., a graphon) identifies a fundamental transferable pattern shared by a collection of pre-training graphs. All convex combinations of graphon bases give rise to a generator space, from which graphs generated form the solution space for those downstream data that can benefit from pre-training. In this manner, the feasibility of pre-training can be quantified as the generation probability of the downstream data from any generator in the generator space. W2PGNN offers three broad applications: providing the application scope of graph pre-trained models, quantifying the feasibility of pre-training, and assistance in selecting pre-training data to enhance downstream performance. We provide a theoretically sound solution for the first application and extensive empirical justifications for the latter two applications.

Privacy Matters: Vertical Federated Linear Contextual Bandits for Privacy Protected Recommendation

Recent awareness of privacy protection and compliance requirement resulted in a controversial view of recommendation system due to personal data usage. Therefore, privacy-protected recommendation emerges as a novel research direction. In this paper, we first formulate this problem as a vertical federated learning problem, i.e., features are vertically distributed over different departments. We study a contextual bandit learning problem for recommendation in the vertical federated setting. To this end, we carefully design a customized encryption scheme named orthogonal matrix-based mask mechanism (O3M). O3M mechanism, a tailored component for contextual bandits by carefully exploiting their shared structure, can ensure privacy protection while avoiding expensive conventional cryptographic techniques. We further apply the mechanism to two commonly-used bandit algorithms, LinUCB and LinTS, and instantiate two practical protocols for online recommendation. The proposed protocols can perfectly recover the service quality of centralized bandit algorithms while achieving a satisfactory runtime efficiency, which is theoretically proved and analysed in this paper. By conducting extensive experiments on both synthetic and real-world datasets, we show the superiority of the proposed method in terms of privacy protection and recommendation performance.

Efficient Coreset Selection with Cluster-based Methods

Coreset selection is a technique for efficient machine learning, which selects a subset of the training data to achieve similar model performance as using the full dataset. It can be performed with or without training machine learning models. Coreset selection with training, which iteratively trains the machine model and updates data items in the coreset, is time consuming. Coreset selection without training can select the coreset before training. Gradient approximation is the typical method, but it can also be slow when dealing with large training datasets as it requires multiple iterations and pairwise distance computations for each iteration. The state-of-the-art (SOTA) results w.r.t. effectiveness are achieved by the latter approach, i.e. gradient approximation.

In this paper, we aim to significantly improve the efficiency of coreset selection while ensuring good effectiveness, by improving the SOTA approaches of using gradient descent without training machine learning models. Specifically, we present a highly efficient coreset selection framework that utilizes an approximation of the gradient. This is achieved by dividing the entire training set into multiple clusters, each of which contains items with similar feature distances (calculated using the Euclidean distance). Our framework further demonstrates that the full gradient can be bounded based on the maximum feature distance between each item and each cluster, allowing for more efficient coreset selection by iterating through these clusters. Additionally, we propose an efficient method for estimating the maximum feature distance using the product quantization technique. Our experiments on multiple real-world datasets demonstrate that we can improve the efficiency 3-10 times comparing with SOTA almost without sacrificing the accuracy.

SURE: Robust, Explainable, and Fair Classification without Sensitive Attributes

A classifier that is accurate on average may still underperform for "sensitive" subsets of people. Such subsets could be based on race, gender, age, etc. The goal of a fair classifier is to perform well for all sensitive subsets. But often, the sensitive subsets are not known a priori. So we may want the classifier to perform well on all subsets that are likely to be sensitive. We propose an iterative algorithm called SURE for this problem. In each iteration, SURE identifies high-risk zones in feature space where the risk of unfair classification is statistically significant. By changing the loss function's weights for points from these zones, SURE builds a fair classifier. The emphasis on statistical significance makes SURE robust to noise. The high-risk zones are intuitive and interpretable. Every step of our method is explainable in terms of significance tests. Finally, SURE is fast and parameter-free. Experiments on both simulated and real-world datasets show that SURE is competitive with the state-of-the-art.

Data-Efficient and Interpretable Tabular Anomaly Detection

Anomaly detection (AD) plays an important role in numerous applications. In this paper, we focus on two understudied aspects of AD that are critical for integration into real-world applications. First, most AD methods cannot incorporate labeled data that are often available in practice in small quantities and can be crucial to achieve high accuracy. Second, most AD methods are not interpretable, a bottleneck that prevents stakeholders from understanding the reason behind the anomalies. In this paper, we propose a novel AD framework, DIAD, that adapts a white-box model class, Generalized Additive Models, to detect anomalies using a partial identification objective which naturally handles noisy or heterogeneous features. DIAD can incorporate a small amount of labeled data to further boost AD performances in semi-supervised settings. We demonstrate the superiority of DIAD compared to previous work in both unsupervised and semi-supervised settings on multiple datasets. We also present explainability capabilities of DIAD, on its rationale behind predicting certain samples as anomalies.

IPOC: An Adaptive Interval Prediction Model based on Online Chasing and Conformal Inference for Large-Scale Systems

In large-scale systems, due to system complexity and demand volatility, diverse and dynamic workloads make accurate predictions difficult. In this work, we address an online interval prediction problem (OnPred-Int) and adopt ensemble learning to solve it. We depict that the ensemble learning for OnPred-Int is a dynamic deterministic Markov Decision Process (Dd-MDP) and convert it into a stateful online learning task. Then we propose IPOC, a lightweight and flexible model able to produce effective confidence intervals, adapting the dynamics of real-time workload streams. At each time, IPOC selects a target model and executes chasing for it by a designed chasing oracle, during which process IPOC produces accurate confidence intervals. The effectiveness of IPOCis theoretically validated through sublinear regret analysis and satisfaction of confidence interval requirements. Besides, we conduct extensive experiments on 4 real-world datasets comparing with 19 baselines. To the best of our knowledge, we are the first to apply the frontier theory of online learning to time series prediction tasks.

On Hierarchical Disentanglement of Interactive Behaviors for Multimodal Spatiotemporal Data with Incompleteness

Multimodal spatiotemporal data (MST) consists of multiple simultaneous spatiotemporal modalities that interact with each other in a dynamic manner. Due to the complexity of MST and the recent desire for the explainability of artificial intelligent systems, disentangled representation learning for MST (DisentMST) has become a significant task, which aims to learn disentangled representations that can expose the underlying spatial semantics, temporal dynamic patterns, and inter-modality interaction modes of the complex MST. One limitation of existing approaches is that they might fail to tolerate the real-world incomplete MST data, where missing information might break the cross-modal spatiotemporal dynamics and bring noise and ambiguity to the learning process. Another limitation is that no existing work systematically reveals the structure of different types of disentangled information. To tackle the two limitations, we define a novel two-level hierarchically structured disentanglement task for MST, which reveals informative and structured disentangled representations for MST as well as digests the real-world MST with incompleteness. We propose a new framework, BiDisentMST, which leverages Gaussian Processes and Graph Factorization on the latent space to achieve our purposes. The experimental results demonstrate the effectiveness of our proposed framework compared with baselines with respect to disentanglement and imputation results.

Open-Set Semi-Supervised Text Classification with Latent Outlier Softening

Semi-supervised text classification (STC) has been extensively researched and reduces human annotation. However, existing research assuming that unlabeled data only contains in-distribution texts is unrealistic. This paper extends STC to a more practical Open-set Semi-supervised Text Classification (OSTC) setting, which assumes that the unlabeled data contains out-of-distribution (OOD) texts. The main challenge in OSTC is the false positive inference problem caused by inadvertently including OOD texts during training. To address the problem, we first develop baseline models using outlier detectors for hard OOD-data filtering in a pipeline procedure. Furthermore, we propose a Latent Outlier Softening (LOS) framework that integrates semi-supervised training and outlier detection within probabilistic latent variable modeling. LOS softens the OOD impacts by the Expectation-Maximization (EM) algorithm and weighted entropy maximization. Experiments on 3 created datasets show that LOS significantly outperforms baselines.

Improving Expressivity of GNNs with Subgraph-specific Factor Embedded Normalization

Graph Neural Networks~(GNNs) have emerged as a powerful category of learning architecture for handling graph-structured data. However, existing GNNs typically ignore crucial structural characteristics in node-induced subgraphs, which thus limits their expressiveness for various downstream tasks. In this paper, we strive to strengthen the representative capabilities of GNNs by devising a dedicated plug-and-play normalization scheme, termed as SUbgraph-sPEcific FactoR Embedded Normalization (SuperNorm), that explicitly considers the intra-connection information within each node-induced subgraph. To this end, we embed the subgraph-specific factor at the beginning and the end of the standard BatchNorm, as well as incorporate graph instance-specific statistics for improved distinguishable capabilities. In the meantime, we provide theoretical analysis to support that, with the elaborated SuperNorm, an arbitrary GNN is at least as powerful as the 1-WL test in distinguishing non-isomorphism graphs. Furthermore, the proposed SuperNorm scheme is also demonstrated to alleviate the over-smoothing phenomenon. Experimental results related to predictions of graph, node, and link properties on the eight popular datasets demonstrate the effectiveness of the proposed method. The code is available at https://github.com/chenchkx/SuperNorm.

Approximation Algorithms for Size-Constrained Non-Monotone Submodular Maximization in Deterministic Linear Time

In this work, we study the problem of finding the maximum value of a non-negative submodular function subject to a limit on the number of items selected, a ubiquitous problem that appears in many applications, such as data summarization and nonlinear regression. We provide the first deterministic, linear-time approximation algorithms for this problem that do not assume the objective is monotone. We present three deterministic, linear-time algorithms: a single-pass streaming algorithm with a ratio of 23.313 + ε, which is the first linear-time streaming algorithm; a simpler deterministic linear-time algorithm with a ratio of 11.657; and a (4 + O(ε))-approximation algorithm. Finally, we present a deterministic algorithm that obtains ratio of e + ε in O_ε (n log(n)) time, close to the best known expected ratio of e - 0.121 in polynomial time.

Accelerating Personalized PageRank Vector Computation

Personalized PageRank Vectors are widely used as fundamental graph-learning tools for detecting anomalous spammers, learning graph embeddings, and training graph neural networks. The well-known local FwdPush algorithm[5] approximates PPVs and has a sublinear rate of O(1 over αε). A recent study [51] found that when high precision is required, FwdPush is similar to the power iteration method, and its run time is pessimistically bounded by O(m over α log 1 over ε). This paper looks closely at calculating PPVs for both directed and undirected graphs. By leveraging the linear invariant property, we show that FwdPush is a variant of Gauss-Seidel and propose a Successive Over-Relaxation based method, FwdPushSOR to speed it up by slightly modifying FwdPush. Additionally, we prove FwdPush has local linear convergence rate O(vol (S) over α log 1 over ε) enjoying advantages of two existing bounds. We also design a new local heuristic push method that reduces the number of operations by 10-50 percent compared to FwdPush. For undirected graphs, we propose two momentum-based acceleration methods that can be expressed as one-line updates and speed up non-acceleration methods by O (1 / √ α). Our experiments on six real-world graph datasets confirm the efficiency of FwdPushSOR and the acceleration methods for directed and undirected graphs, respectively.

Neural-Hidden-CRF: A Robust Weakly-Supervised Sequence Labeler

We propose a neuralized undirected graphical model called Neural-Hidden-CRF to solve the weakly-supervised sequence labeling problem. Under the umbrella of undirected graphical theory, the proposed Neural-Hidden-CRF embedded with a hidden CRF layer models the variables of word sequence, latent ground truth sequence, and weak label sequence with the global perspective that undirected graphical models particularly enjoy. In Neural-Hidden-CRF, we can capitalize on the powerful language model BERT or other deep models to provide rich contextual semantic knowledge to the latent ground truth sequence, and use the hidden CRF layer to capture the internal label dependencies. Neural-Hidden-CRF is conceptually simple and empirically powerful. It obtains new state-of-the-art results on one crowdsourcing benchmark and three weak-supervision benchmarks, including outperforming the recent advanced model CHMM by 2.80 F1 points and 2.23 F1 points in average generalization and inference performance, respectively.

Shilling Black-box Review-based Recommender Systems through Fake Review Generation

Review-Based Recommender Systems (RBRS) have attracted increasing research interest due to their ability to alleviate well-known cold-start problems. RBRS utilizes reviews to construct the user and items representations. However, in this paper, we argue that such a reliance on reviews may instead expose systems to the risk of being shilled. To explore this possibility, in this paper, we propose the first generation-based model for shilling attacks against RBRSs. Specifically, we learn a fake review generator through reinforcement learning, which maliciously promotes items by forcing prediction shifts after adding generated reviews to the system. By introducing the auxiliary rewards to increase text fluency and diversity with the aid of pre-trained language models and aspect predictors, the generated reviews can be effective for shilling with high fidelity. Experimental results demonstrate that the proposed framework can successfully attack three different kinds of RBRSs on the Amazon corpus with three domains and Yelp corpus. Furthermore, human studies also show that the generated reviews are fluent and informative. Finally, equipped with Attack Review Generators (ARGs), RBRSs with adversarial training are much more robust to malicious reviews.

Classification of Edge-dependent Labels of Nodes in Hypergraphs

A hypergraph is a data structure composed of nodes and hyperedges, where each hyperedge is an any-sized subset of nodes. Due to the flexibility in hyperedge size, hypergraphs represent group interactions (e.g., co-authorship by more than two authors) more naturally and accurately than ordinary graphs. Interestingly, many real-world systems modeled as hypergraphs contain edge-dependent node labels, i.e., node labels that vary depending on hyperedges. For example, on co-authorship datasets, the same author (i.e., a node) can be the primary author in a paper (i.e., a hyperedge) but the corresponding author in another paper (i.e., another hyperedge).

In this work, we introduce a classification of edge-dependent node labels as a new problem. This problem can be used as a benchmark task for hypergraph neural networks, which recently have attracted great attention, and also the usefulness of edge-dependent node labels has been verified in various applications. To tackle this problem, we propose WHATsNet, a novel hypergraph neural network that represents the same node differently depending on the hyperedges it participates in by reflecting its varying importance in the hyperedges. To this end, WHATsNet models the relations between nodes within each hyperedge, using their relative centrality as positional encodings. In our experiments, we demonstrate that WHATsNet significantly and consistently outperforms ten competitors on six real-world hypergraphs, and we also show successful applications of WHATsNet to (a) ranking aggregation, (b) node clustering, and (c) product return prediction.

Representation Learning on Hyper-Relational and Numeric Knowledge Graphs with Transformers

In a hyper-relational knowledge graph, a triplet can be associated with a set of qualifiers, where a qualifier is composed of a relation and an entity, providing auxiliary information for the triplet. While existing hyper-relational knowledge graph embedding methods assume that the entities are discrete objects, some information should be represented using numeric values, e.g., (J.R.R., was born in, 1892). Also, a triplet (J.R.R., educated at, Oxford Univ.) can be associated with a qualifier such as (start time, 1911). In this paper, we propose a unified framework named HyNT that learns representations of a hyper-relational knowledge graph containing numeric literals in either triplets or qualifiers. We define a context transformer and a prediction transformer to learn the representations based not only on the correlations between a triplet and its qualifiers but also on the numeric information. By learning compact representations of triplets and qualifiers and feeding them into the transformers, we reduce the computation cost of using transformers. Using HyNT, we can predict missing numeric values in addition to missing entities or relations in a hyper-relational knowledge graph. Experimental results show that HyNT significantly outperforms state-of-the-art methods on real-world datasets.

Reducing Exposure to Harmful Content via Graph Rewiring

Most media content consumed today is provided by digital platforms that aggregate input from diverse sources, where access to information is mediated by recommendation algorithms. One principal challenge in this context is dealing with content that is considered harmful. Striking a balance between competing stakeholder interests, rather than block harmful content altogether, one approach is to minimize the exposure to such content that is induced specifically by algorithmic recommendations. Hence, modeling media items and recommendations as a directed graph, we study the problem of reducing the exposure to harmful content via edge rewiring. We formalize this problem using absorbing random walks, and prove that it is NP-hard and NP-hard to approximate to within an additive error, while under realistic assumptions, the greedy method yields a (1-1/e)-approximation. Thus, we introduce Gamine, a fast greedy algorithm that can reduce the exposure to harmful content with or without quality constraints on recommendations. By performing just 100 rewirings on YouTube graphs with several hundred thousand edges, Gamine reduces the initial exposure by 50%, while ensuring that its recommendations are at most 5% less relevant than the original recommendations. Through extensive experiments on synthetic data and real-world data from video recommendation and news feed applications, we confirm the effectiveness, robustness, and efficiency of Gamine in practice.

MGNN: Graph Neural Networks Inspired by Distance Geometry Problem

Graph Neural Networks (GNNs) have emerged as a prominent research topic in the field of machine learning. Existing GNN models are commonly categorized into two types: spectral GNNs, which are designed based on polynomial graph filters, and spatial GNNs, which utilize a message-passing scheme as the foundation of the model. For the expressive power and universality of spectral GNNs, a natural approach is to improve the design of basis functions for better approximation ability. As for spatial GNNs, models like Graph Isomorphism Networks (GIN) analyze their expressive power based on Graph Isomorphism Tests. Recently, there have been attempts to establish connections between spatial GNNs and geometric concepts like curvature and cellular sheaves, as well as physical phenomena like oscillators. However, despite the recent progress, there is still a lack of comprehensive analysis regarding the universality of spatial GNNs from the perspectives of geometry and physics.

In this paper, we propose MetricGNN (MGNN), a spatial GNN model inspired by the congruent-insensitivity property of classifiers in the classification phase of GNNs. We demonstrate that a GNN model is universal in the spatial domain if it can generate embedding matrices that are congruent to any given embedding matrix. This property is closely related to the Distance Geometry Problem (DGP). Since DGP is an NP-Hard combinatorial optimization problem, we propose optimizing an energy function derived from spring networks and the Multi-Dimensional Scaling (MDS) problem. This approach also allows our model to handle both homophilic and heterophilic graphs. Finally, we propose employing the iteration method to optimize our energy function. We extensively evaluate the effectiveness of our model through experiments conducted on both synthetic and real-world datasets. Our code is available at: https://github.com/GuanyuCui/MGNN.

Below the Surface: Summarizing Event Sequences with Generalized Sequential Patterns

We study the problem of succinctly summarizing a database of event sequences in terms of generalized sequential patterns. That is, we are interested in patterns that are not exclusively defined over observed surface-level events, as is usual, but rather may additionally include generalized events that can match a set of events. To avoid spurious and redundant results we define the problem in terms of the Minimum Description Length principle, by which we are after that set of patterns and generalizations that together best compress the data without loss. The resulting optimization problem does not lend itself for exact search, which is why we propose the heuristic Flock algorithm to efficiently find high-quality models in practice. Extensive experiments on synthetic and real-world data show that Flock results in compact and easily interpretable models that accurately recover the ground truth, including rare instances of generalized patterns. Additionally Flock recovers how generalized events within patterns depend on each other, and overall provides clearer insight into the data-generating process than using state of the art algorithms that only consider surface-level patterns.

Deep Encoders with Auxiliary Parameters for Extreme Classification

The task of annotating a data point with labels most relevant to it from a large universe of labels is referred to as Extreme Classification (XC). State-of-the-art XC methods have applications in ranking, recommendation, and tagging and mostly employ a combination architecture comprised of a deep encoder and a high-capacity classifier. These two components are often trained in a modular fashion to conserve compute. This paper shows that in XC settings where data paucity and semantic gap issues abound, this can lead to suboptimal encoder training which negatively affects the performance of the overall architecture. The paper then proposes a lightweight alternative DEXA that augments encoder training with auxiliary parameters. Incorporating DEXA into existing XC architectures requires minimal modifications and the method can scale to datasets with 40 million labels and offer predictions that are up to 6% and 15% more accurate than embeddings offered by existing deep XC methods on benchmark and proprietary datasets, respectively. The paper also analyzes DEXA theoretically and shows that it offers provably superior encoder training than existing Siamese training strategies in certain realizable settings. Code for DEXA is available at https://github.com/Extreme-classification/dexa.

A Unified Framework of Graph Information Bottleneck for Robustness and Membership Privacy

Graph Neural Networks (GNNs) have achieved great success in modeling graph-structured data. However, recent works show that GNNs are vulnerable to adversarial attacks which can fool the GNN model to make desired predictions of the attacker. In addition, training data of GNNs can be leaked under membership inference attacks. This largely hinders the adoption of GNNs in high-stake domains such as e-commerce, finance and bioinformatics. Though investigations have been made in conducting robust predictions and protecting membership privacy, they generally fail to simultaneously consider the robustness and membership privacy. Therefore, in this work, we study a novel problem of developing robust and membership privacy-preserving GNNs. Our analysis shows that Information Bottleneck (IB) can help filter out noisy information and regularize the predictions on labeled samples, which can benefit robustness and membership privacy. However, structural noises and lack of labels in node classification challenge the deployment of IB on graph-structured data. To mitigate these issues, we propose a novel graph information bottleneck framework that can alleviate structural noises with neighbor bottleneck. Pseudo labels are also incorporated in the optimization to minimize the gap between the predictions on the labeled set and unlabeled set for membership privacy. Extensive experiments on real-world datasets demonstrate that our method can give robust predictions and simultaneously preserve membership privacy.

Contrastive Learning for User Sequence Representation in Personalized Product Search

Providing personalization in product search has attracted increasing attention in both industry and research communities. Most existing personalized product search methods model users' individual search interests based on their historical search logs to generate personalized search results. However, the search logs may be sparse or noisy in the real scenario, which is difficult for existing methods to learn accurate and robust user representations. To address this issue, we propose a contrastive learning framework CoPPS that aims to learn high-quality user representations for personalized product search. Specifically, we design three data augmentation and contrastive learning strategies to construct self-supervision signals from the original search behaviours. The contrastive learning tasks utilize an external knowledge graph and exploit the correlations within and between user sequences, thereby facilitating the discovery of more meaningful search patterns and ultimately enhancing the quality of personalized search. Experimental results on the public Amazon datasets verify the effectiveness of our approach.

Generalized Matrix Local Low Rank Representation by Random Projection and Submatrix Propagation

Matrix low rank approximation is an effective method to reduce or eliminate the statistical redundancy of its components. Compared with the traditional global low rank methods such as singular value decomposition (SVD), local low rank approximation methods are more advantageous to uncover interpretable data structures when clear duality exists between the rows and columns of the matrix. Local low rank approximation is equivalent to low rank submatrix detection. Unfortunately,existing local low rank approximation methods can detect only submatrices of specific mean structure, which may miss a substantial amount of true and interesting patterns. In this work, we develop a novel matrix computational framework called RPSP (Random Probing based submatrix Propagation) that provides an effective solution for the general matrix local low rank representation problem. RPSP detects local low rank patterns that grow from small submatrices of low rank property, which are determined by a random projection approach. RPSP is supported by theories of random projection. Experiments on synthetic data demonstrate that RPSP outperforms all state-of-the-art methods, with the capacity to robustly and correctly identify the low rank matrices when the pattern has a similar mean as the background, background noise is heteroscedastic and multiple patterns present in the data. On real-world datasets, RPSP also demonstrates its effectiveness in identifying interpretable local low rank matrices.

TWIN: Personalized Clinical Trial Digital Twin Generation

Clinical trial digital twins are virtual patients that reflect personal characteristics in a high degree of granularity and can be used to simulate various patient outcomes under different conditions. With the growth of clinical trial databases captured by Electronic Data Capture (EDC) systems, there is a growing interest in using machine learning models to generate digital twins. This can benefit the drug development process by reducing the sample size required for participant recruitment, improving patient outcome predictive modeling, and mitigating privacy risks when sharing synthetic clinical trial data. However, prior research has mainly focused on generating Electronic Healthcare Records (EHRs), which often assume large training data and do not account for personalized synthetic patient record generation. In this paper, we propose a sample-efficient method TWIN for generating personalized clinical trial digital twins. TWIN can produce digital twins of patient-level clinical trial records with high fidelity to the targeting participant's record and preserves the temporal relations across visits and events. We compare our method with various baselines for generating real-world patient-level clinical trial data. The results show that TWIN generates synthetic trial data with high fidelity to facilitate patient outcome predictions in low-data scenarios and strong privacy protection against real patients from the trials.

Accelerating Dynamic Network Embedding with Billions of Parameter Updates to Milliseconds

Network embedding, a graph representation learning method illustrating network topology by mapping nodes into lower-dimension vectors, is challenging to accommodate the ever-changing dynamic graphs in practice. Existing research is mainly based on node-by-node embedding modifications, which falls into the dilemma of efficient calculation and accuracy. Observing that the embedding dimensions are usually much smaller than the number of nodes, we break this dilemma with a novel dynamic network embedding paradigm that rotates and scales the axes of embedding space instead of a node-by-node update. Specifically, we propose the Dynamic Adjacency Matrix Factorization (DAMF) algorithm, which achieves an efficient and accurate dynamic network embedding by rotating and scaling the coordinate system where the network embedding resides with no more than the number of edge modifications changes of node embeddings. Moreover, a dynamic Personalized PageRank is applied to the obtained network embeddings to enhance node embeddings and capture higher-order neighbor information dynamically. Experiments of node classification, link prediction, and graph reconstruction on different-sized dynamic graphs suggest that DAMF advances dynamic network embedding. Further, we unprecedentedly expand dynamic network embedding experiments to billion-edge graphs, where DAMF updates billion-level parameters in less than 10ms.

MetricPrompt: Prompting Model as a Relevance Metric for Few-shot Text Classification

Prompting methods have shown impressive performance in a variety of text mining tasks and applications, especially few-shot ones. Despite the promising prospects, the performance of prompting model largely depends on the design of prompt template and verbalizer. In this work, we propose MetricPrompt, which eases verbalizer design difficulty by reformulating few-shot text classification task into text pair relevance estimation task. MetricPrompt adopts prompting model as the relevance metric, further bridging the gap between Pre-trained Language Model's (PLM) pre-training objective and text classification task, making possible PLM's smooth adaption. Taking a training sample and a query one simultaneously, MetricPrompt captures cross-sample relevance information for accurate relevance estimation. We conduct experiments on three widely used text classification datasets across four few-shot settings. Results show that MetricPrompt outperforms manual verbalizer and other automatic verbalizer design methods across all few-shot settings, achieving new state-of-the-art (SOTA) performance.

Investigating Trojan Attacks on Pre-trained Language Model-powered Database Middleware

The recent success of pre-trained language models (PLMs) such as BERT has resulted in the development of various beneficial database middlewares, including natural language query interfaces and entity matching. This shift has been greatly facilitated by the extensive external knowledge of PLMs. However, as PLMs are often provided by untrusted third parties, their lack of standardization and regulation poses significant security risks that have yet to be fully explored. This paper investigates the security threats posed by malicious PLMs to these emerging database middleware. We specifically propose a novel type of Trojan attack, where a maliciously designed PLM causes unexpected behavior in the database middleware. These Trojan attacks possess the following characteristics: (1) Triggerability: The Trojan-infected database middleware will function normally with normal input, but will likely malfunction when triggered by the attacker. (2) Imperceptibility: There is no need for noticeable modification of the input to trigger the Trojan. (3) Generalizability: The Trojan is capable of targeting a variety of downstream tasks, not just one specific task. We thoroughly evaluate the impact of these Trojan attacks through experiments and analyze potential countermeasures and their limitations. Our findings could aid in the creation of stronger mechanisms for the implementation of PLMs in database middleware.

Localised Adaptive Spatial-Temporal Graph Neural Network

Spatial-temporal graph models are prevailing for abstracting and modelling spatial and temporal dependencies. In this work, we ask the following question: whether and to what extent can we localise spatial-temporal graph models? We limit our scope to adaptive spatial-temporal graph neural networks (ASTGNNs), the state-of-the-art model architecture. Our approach to localisation involves sparsifying the spatial graph adjacency matrices. To this end, we propose Adaptive Graph Sparsification (AGS), a graph sparsification algorithm which successfully enables the localisation of ASTGNNs to an extreme extent (fully localisation). We apply AGS to two distinct ASTGNN architectures and nine spatial-temporal datasets. Intriguingly, we observe that spatial graphs in ASTGNNs can be sparsified by over 99.5% without any decline in test accuracy. Furthermore, even when ASTGNNs are fully localised, becoming graph-less and purely temporal, we record no drop in accuracy for the majority of tested datasets, with only minor accuracy deterioration observed in the remaining datasets. However, when the partially or fully localised ASTGNNs are reinitialised and retrained on the same data, there is a considerable and consistent drop in accuracy. Based on these observations, we reckon that (i) in the tested data, the information provided by the spatial dependencies is primarily included in the information provided by the temporal dependencies and, thus, can be essentially ignored for inference; and (ii) although the spatial dependencies provide redundant information, it is vital for the effective training of ASTGNNs and thus cannot be ignored during training. Furthermore, the localisation of ASTGNNs holds the potential to reduce the heavy computation overhead required on large-scale spatial-temporal data and further enable the distributed deployment of ASTGNNs.

TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting

Transformers have gained popularity in time series forecasting for their ability to capture long-sequence interactions. However, their memory and compute-intensive requirements pose a critical bottleneck for long-term forecasting, despite numerous advancements in compute-aware self-attention modules. To address this, we propose TSMixer, a lightweight neural architecture exclusively composed of multi-layer perceptron (MLP) modules. TSMixer is designed for multivariate forecasting and representation learning on patched time series, providing an efficient alternative to Transformers. Our model draws inspiration from the success of MLP-Mixer models in computer vision. We demonstrate the challenges involved in adapting Vision MLP-Mixer for time series and introduce empirically validated components to enhance accuracy. This includes a novel design paradigm of attaching online reconciliation heads to the MLP-Mixer backbone, for explicitly modeling the time-series properties such as hierarchy and channel-correlations. We also propose a Hybrid channel modeling approach to effectively handle noisy channel interactions and generalization across diverse datasets, a common challenge in existing patch channel-mixing methods. Additionally, a simple gated attention mechanism is introduced in the backbone to prioritize important features. By incorporating these lightweight components, we significantly enhance the learning capability of simple MLP structures, outperforming complex Transformer models with minimal computing usage. Moreover, TSMixer's modular design enables compatibility with both supervised and masked self-supervised learning methods, making it a promising building block for time-series Foundation Models. TSMixer outperforms state-of-the-art MLP and Transformer models in forecasting by a considerable margin of 8-60%. It also outperforms the latest strong benchmarks of Patch-Transformer models (by 1-2%) with a significant reduction in memory and runtime (2-3X).

Dependence and Model Selection in LLP: The Problem of Variants

The problem of Learning from Label Proportions (LLP) has received considerable research attention and has numerous practical applications. In LLP, a hypothesis assigning labels to items is learned using knowledge of only the proportion of labels found in predefined groups, called bags. While a number of algorithmic approaches to learning in this context have been proposed, very little work has addressed the model selection problem for LLP. Nonetheless, it is not obvious how to extend straightforward model selection approaches to LLP, in part because of the lack of item labels. More fundamentally, we argue that a careful approach to model selection for LLP requires consideration of the dependence structure that exists between bags, items, and labels. In this paper we formalize this structure and show how it affects model selection. We show how this leads to improved methods of model selection that we demonstrate outperform the state of the art over a wide range of datasets and LLP algorithms.

Multiplex Heterogeneous Graph Neural Network with Behavior Pattern Modeling

Heterogeneous graph neural networks have gained great popularity in tackling various network analysis tasks on heterogeneous network data. However, most existing works mainly focus on general heterogeneous networks, and assume that there is only one type of edge between two nodes, while ignoring the multiplex characteristics between multi-typed nodes in multiplex heterogeneous networks and the different importance of multiplex structures among nodes for node embedding. In addition, the over-smoothing issue of graph neural networks limits existing models to only capturing local structure signals but hardly learning the global relevant information of the network. To tackle these challenges, this work proposes a model called Behavior Pattern based Heterogeneous Graph Neural Network (BPHGNN) for multiplex heterogeneous network embedding. Specifically, BPHGNN can collaboratively learn node representations across different multiplex structures among nodes with adaptive importance learning from local and global perspectives in multiplex heterogeneous networks through depth behavior pattern aggregation and breadth behavior pattern aggregation. Extensive experiments on six real-world networks with various network analytical tasks demonstrate the significant superiority of BPHGNN against state-of-the-art approaches in terms of various evaluation metrics.

Delving into Global Dialogue Structures: Structure Planning Augmented Response Selection for Multi-turn Conversations

Retrieval-based dialogue systems are a crucial component of natural language processing, employing information retrieval techniques to select responses from a predefined pool of candidates. The advent of pre-trained language models (PLMs) has significantly advanced the field, with a prevailing paradigm that involves post-training PLMs on specific dialogue corpora, followed by fine-tuning for the response selection (RS) task. This post-training process aims to capture dialogue-specific features, as most PLMs are originally trained on plain text. However, prior approaches predominantly rely on self-supervised tasks or session-level graph neural networks during post-training, focusing on capturing underlying patterns of coherent dialogues without explicitly refining the global pattern across the entire dialogue corpus. Consequently, the learned knowledge for organizing coherent dialogues remains isolated, heavily reliant on specific contexts. Additionally, interpreting or visualizing the implicit knowledge acquired through self-supervised tasks proves challenging. In this study, we address these limitations by explicitly refining the knowledge required for response selection and structuring it into a coherent global flow, known as "dialogue structure." This structure captures the inter-dependency of utterances and topic shifts, thereby enhancing the response selection task. To achieve this, we propose a novel structure model comprising a state recognizer and a structure planner. This model effectively captures the flow within the utterance history and plans the trajectory of future utterances. Importantly, the structure model operates orthogonally to the retrieval model, enabling seamless integration with existing retrieval models and facilitating collaborative training. Extensive experiments conducted on three benchmark datasets demonstrate the superior performance of our method over a wide range of competitive baselines, establishing a new state-of-the-art in the field.

Pre-training Antibody Language Models for Antigen-Specific Computational Antibody Design

Antibodies are proteins that effectively protect the human body by binding to pathogens. Recently, deep learning-based computational antibody design has attracted popular attention since it automatically mines the antibody patterns from data that could be complementary to human experiences. However, the computational methods heavily rely on high-quality antibody structure data, which is quite limited. Besides, the complementarity-determining region (CDR), which is the key component of an antibody that determines the specificity and binding affinity, is highly variable and hard to predict. Therefore, the limited availability of high-quality antibody structure data exacerbates the difficulty of CDR generation. Fortunately, there is a large amount of sequence data for antibodies that can help model the CDR and reduce reliance on structure data. By witnessing the success of pre-training models for protein modeling, in this paper, we develop the antibody pre-training language model and incorporate it into the antigen-specific antibody design model in a systemic way. Specifically, we first pre-train a novel antibody language model based on the sequence data, then propose a one-shot way for sequence and structure generation of CDR to mitigate the high cost and error propagation associated with autoregressive methods, and finally leverage the pre-trained antibody model for the antigen-specific antibody generation model with some carefully designed modules. Our experiments demonstrate the superiority of our method over previous baselines in tasks such as sequence and structure generation, CDR-H3 design for antigen binding, and antibody optimization1. The code is available at https://github.com/KyGao/ABGNN.

Pyramid Graph Neural Network: A Graph Sampling and Filtering Approach for Multi-scale Disentangled Representations

Spectral methods for graph neural networks (GNNs) have achieved great success. Despite their success, many works have shown that existing approaches are mainly focused on low-frequency information which may not be pertinent to the task at hand. Recent efforts have been made to design new graph filters for wider frequency profiles, but it remains an open problem how to learn multi-scale disentangled node embeddings in the graph Fourier domain. In this paper, we propose a graph (signal) sampling and filtering framework, entitled Pyramid Graph Neural Network (PyGNN), which follows the Downsampling-Filtering-Upsampling-Decoding scheme. To be specific, we develop an ω-bandlimited downsampling approach to split input graph into subgraphs for the reduction of high-frequency components, then perform spectral graph filters on subgraphs to achieve node embeddings with different frequency bands, and propose a Laplacian smoothing-based upsampling approach to extrapolate the node embedding on subgraphs to the full set of vertices on the original graph. In the end, we add frequency-aware gated units to decode node embeddings of different frequencies for downstream tasks. Results on both homophilic and heterophilic graph datasets show its superiority over state-of-the-art methods.

GAL-VNE: Solving the VNE Problem with Global Reinforcement Learning and Local One-Shot Neural Prediction

The NP-hard combinatorial Virtual Network Embedding (VNE) Problem refers to finding the node and edge mapping between a virtual net (request) and the physical net (resource). Learning-based methods are recently devised beyond traditional heuristic solvers. However, the efficiency and scalability hinder its applicability as reinforcement learning (RL) is often adopted in an auto-regressive node-by-node mapping manner to handle complex mapping constraints, for each coming request for mapping. Moreover, existing learning-based works often independently consider each online request, limiting the long-term online service performance. In this paper, we present a synergistic Global-And-Local learning approach for the VNE problem (GAL-VNE). At the global level across requests, RL is employed to capture the cross-request relation for better global resource accommodation to improve overall performance. At the local level within each request, we aim to replace the sequential decision-making procedure which relies much on the network size, with a more efficient one-shot solution generation scheme. The main challenge for such a one-shot model is how to encode the constraints under an end-to-end learning and inference paradigm. Accordingly, within the "rank-then-search" paradigm, we propose to first pretrain a graph neural network (GNN)-based node ranker with imitation supervision from an off-the-shelf solver (moderately expensive yet high quality), which is meanwhile regularized by a neighboring smooth prior. Then RL is used to finetune the GNN ranker whose supervision directly refers to the final (undifferentiable) business objectives concerning revenue and cost, etc. Experiments on benchmarks show that our method outperforms classic and learning-based methods in both efficacy and efficiency.

Sparse Binary Transformers for Multivariate Time Series Modeling

Compressed Neural Networks have the potential to enable deep learning across new applications and smaller computational environments. However, understanding the range of learning tasks in which such models can succeed is not well studied. In this work, we apply sparse and binary-weighted Transformers to multivariate time series problems, showing that the lightweight models achieve accuracy comparable to that of dense floating-point Transformers of the same structure. Our model achieves favorable results across three time series learning tasks: classification, anomaly detection, and single-step forecasting. Additionally, to reduce the computational complexity of the attention mechanism, we apply two modifications, which show little to no decline in model performance: 1) in the classification task, we apply a fixed mask to the query, key, and value activations, and 2) for forecasting and anomaly detection, which rely on predicting outputs at a single point in time, we propose an attention mask to allow computation only at the current time step. Together, each compression technique and attention modification substantially reduces the number of non-zero operations necessary in the Transformer. We measure the computational savings of our approach over a range of metrics including parameter count, bit size, and floating point operation (FLOPs) count, showing up to a 53× reduction in storage size and up to 10.5× reduction in FLOPs.

3D-Polishing for Triangular Mesh Compression of Point Cloud Data

Triangular meshes are commonly used to reconstruct the surfaces of 3-dimensional (3D) objects based on the point cloud data. With an increasing demand for high-quality approximation, the sizes of point cloud data and the generated triangular meshes continue to increase, resulting in high computational cost in data processing, visualization, analysis, transmission and storage. Motivated by the process of sculpture polishing, we develop a novel progressive mesh compression approach, called the greedy 3D-polishing algorithm, to sequentially remove redundant points and triangles in a greedy manner while maintaining the approximation quality of the surface. Based on the polishing algorithm, we propose the approximate curvature radius to evaluate the scale of features polished at each iteration. By reformulating the compression rate selection as a change-point detection problem, a rank-based procedure is developed to select the optimal compression rate so that most of the global features of the 3D object surface can be preserved with statistical guarantee. Experiments on both moderate- and large-scale 3D shape datasets show that the proposed method can substantially reduce the size of the point cloud data and the corresponding triangular mesh, so that most of the surface information can be retained by a much smaller number of points and triangles.

ESSA: Explanation Iterative Supervision via Saliency-guided Data Augmentation

Explanation supervision is a technique in which the model is guided by human-generated explanations during training. This technique aims to improve both the interpretability and predictability of the model by incorporating human understanding into the training process. Since explanation supervision requires a large scale of training data, the data augmentation technique is necessary to be applied to increase the size and diversity of the original dataset. However, data augmentation on sophisticated data like medical images is particularly challenging due to the following: 1) scarcity of data in training the learning-based data augmenter, 2) difficulty in generating realistic and sophisticated images, and 3) difficulty in ensuring the augmented data indeed boosts the performance of explanation-guided learning. To solve these challenges, we propose an Explanation Iterative Supervision via Saliency-guided Data Augmentation (ESSA) framework for conducting explanation supervision and adversarial-trained image data augmentation via a synergized iterative loop that handles the translation from annotation to sophisticated images and the generation of synthetic image-annotation pairs with an alternating training strategy. Extensive experiments on two datasets from the medical imaging domain demonstrate the effectiveness of our proposed framework in improving both the predictability and explainability of the model.

CounterNet: End-to-End Training of Prediction Aware Counterfactual Explanations

This work presents CounterNet, a novel end-to-end learning framework which integrates Machine Learning (ML) model training and the generation of corresponding counterfactual (CF) explanations into a single end-to-end pipeline. Counterfactual explanations offer a contrastive case, i.e., they attempt to find the smallest modification to the feature values of an instance that changes the prediction of the ML model on that instance to a predefined output. Prior techniques for generating CF explanations suffer from two major limitations: (i) all of them are post-hoc methods designed for use with proprietary ML models --- as a result, their procedure for generating CF explanations is uninformed by the training of the ML model, which leads to misalignment between model predictions and explanations; and (ii) most of them rely on solving separate time-intensive optimization problems to find CF explanations for each input data point (which negatively impacts their runtime). This work makes a novel departure from the prevalent post-hoc paradigm (of generating CF explanations) by presenting CounterNet, an end-to-end learning framework which integrates predictive model training and the generation of counterfactual (CF) explanations into a single pipeline. Unlike post-hoc methods, CounterNet enables the optimization of the CF explanation generation only once together with the predictive model. We adopt a block-wise coordinate descent procedure which helps in effectively training CounterNet's network. Our extensive experiments on multiple real-world datasets show that CounterNet generates high-quality predictions, and consistently achieves 100% CF validity and low proximity scores (thereby achieving a well-balanced cost-invalidity trade-off) for any new input instance, and runs 3X faster than existing state-of-the-art baselines.

SketchPolymer: Estimate Per-item Tail Quantile Using One Sketch

1Estimating the quantile of distribution, especially tail distribution, is an interesting topic in data stream models, and has obtained extensive interest from many researchers. In this paper, we propose a novel sketch, namely SketchPolymer to accurately estimate per-item tail quantile. SketchPolymer uses a technique called Early Filtration to filter infrequent items, and another technique called VSS to reduce error. Our experimental results show that the accuracy of SketchPolymer is on average 32.67 times better than state-of-the-art techniques. We also implement our SketchPolymer on P4 and FPGA platforms to verify its deployment flexibility. All our codes are available at GitHub.[1]

On Manipulating Signals of User-Item Graph: A Jacobi Polynomial-based Graph Collaborative Filtering

Collaborative filtering (CF) is an important research direction in recommender systems that aims to make recommendations given the information on user-item interactions. Graph CF has attracted more and more attention in recent years due to its effectiveness in leveraging high-order information in the user-item bipartite graph for better recommendations. Specifically, recent studies show the success of graph neural networks (GNN) for CF is attributed to its low-pass filtering effects. However, current researches lack a study of how different signal components contributes to recommendations, and how to design strategies to properly use them well. To this end, from the view of spectral transformation, we analyze the important factors that a graph filter should consider to achieve better performance. Based on the discoveries, we design JGCF, an efficient and effective method for CF based on Jacobi polynomial bases and frequency decomposition strategies. Extensive experiments on four widely used public datasets show the effectiveness and efficiency of the proposed methods, which brings at most 27.06% performance gain on Alibaba-iFashion. Besides, the experimental results also show that JGCF is better at handling sparse datasets, which shows potential in making recommendations for cold-start users.

Clenshaw Graph Neural Networks

Graph Convolutional Networks (GCNs), which use a message-passing paradigm with stacked convolution layers, are foundational spatial methods for learning graph representations. Polynomial filters, which have an advantage on heterophilous graphs, are motivated differently from the spectral perspective of graph convolutions. Recent spatial GCN models use various residual connection techniques to alleviate the model degradation problem such as over-smoothing and gradient vanishing. However, current residual connections do not effectively harness the full potential of polynomial filters, which are commonly employed in the spectral domain of GNNs.

In this paper, we introduce ClenshawGCN, a GNN model that injects the characteristic of spectral models into spatial models by a simple residual connection submodule: the Clenshaw residual connection, which is essentially a second-order negative residual combined with an initial residual. We show that a ClenshawGCN implicitly simulates an arbitrary polynomial filter under the Chebyshev basis, since the iteration process of stacked (spatial) convolutions equipped with Clenshaw residuals can be interpreted by Clenshaw Summation Algorithm. In addition, we conduct comprehensive experiments to demonstrate the superiority of our model over spatial and spectral GNN models. Our implementation is at https://github.com/yuziGuo/KDDClenshawGNN.

CampER: An Effective Framework for Privacy-Aware Deep Entity Resolution

Entity Resolution (ER) is a fundamental problem in data preparation. Standard deep ER methods have achieved state-of-the-art effectiveness, assuming that relations from different organizations are centrally stored. However, due to privacy concerns, it can be difficult to centralize data in practice, rendering standard deep ER solutions inapplicable. Despite efforts to develop rule-based privacy-preserving ER methods, they often neglect subtle matching mechanisms and have poor effectiveness as a result. To bridge effectiveness and privacy, in this paper, we propose CampER, an effective framework for privacy-aware deep entity resolution. Specifically, we first design a training pair self-generation strategy to overcome the absence of manually labeled data in privacy-aware scenarios. Based on the self-constructed training pairs, we present a collaborative fine-tuning approach to learn the match-aware and uni-space individual tuple embeddings for accurate matching decisions. During the matching decision-making process, we first introduce a cryptographically secure approach to determine matches. Furthermore, we propose an order-preserving perturbation strategy to significantly accelerate the matching computation while guaranteeing the consistency of ER results. Extensive experiments on eight widely-used benchmark datasets demonstrate that CampER not only is comparable with the state-of-the-art standard deep ER solutions in effectiveness, but also preserves privacy.

A Data-centric Framework to Endow Graph Neural Networks with Out-Of-Distribution Detection Ability

Out-of-distribution (OOD) detection, which aims to identify OOD samples from in-distribution (ID) ones in test time, has become an essential problem in machine learning. However, existing works are mostly conducted on Euclidean data, and the problem in graph-structured data remains under-explored. Several recent works begin to study graph OOD detection, but they all need to train a graph neural network (GNN) from scratch with high computational cost. In this work, we make the first attempt to endow a well-trained GNN with the OOD detection ability without modifying its parameters. To this end, we design a post-hoc framework with Adaptive Amplifier for Graph OOD Detection, named AAGOD, concentrating on data-centric manipulation. The insight of AAGOD is to superimpose a parameterized amplifier matrix on the adjacency matrix of each original input graph. The amplifier can be seen as prompts and is expected to emphasize the key patterns helpful for graph OOD detection, thereby enlarging the gap between OOD and ID graphs. Then well-trained GNNs can be reused to encode the amplified graphs into vector representations, and pre-defined scoring functions can further convert the representations into detection scores. Specifically, we design a Learnable Amplifier Generator (LAG) to customize amplifiers for different graphs, and propose a Regularized Learning Strategy (RLS) to train parameters with no OOD data required. Experiment results show that AAGOD can be applied on various GNNs to enable the OOD detection ability. Compared with the state-of-the-art baseline in graph OOD detection, on average AAGOD has 6.21% relative enhancement in AUC and a 34 times faster training speed. Code and data are available at https://github.com/BUPT-GAMMA/AAGOD.

Frigate: Frugal Spatio-temporal Forecasting on Road Networks

Modelling spatio-temporal processes on road networks is a task of growing importance. While significant progress has been made on developing spatio-temporal graph neural networks (Gnns), existing works are built upon three assumptions that are not practical on real-world road networks. First, they assume sensing on every node of a road network. In reality, due to budget-constraints or sensor failures, all locations (nodes) may not be equipped with sensors. Second, they assume that sensing history is available at all installed sensors. This is unrealistic as well due to sensor failures, loss of packets during communication, etc. Finally, there is an assumption of static road networks. Connectivity within networks change due to road closures, constructions of new roads, etc. In this work, we develop Frigate to address all these shortcomings. Frigate is powered by a spatio-temporal Gnn that integrates positional, topological, and temporal information into rich inductive node representations. The joint fusion of this diverse information is made feasible through a novel combination of gated Lipschitz embeddings with Lstms. We prove that the proposed Gnn architecture is provably more expressive than message-passing Gnns used in state-of-the-art algorithms. The higher expressivity of Frigate naturally translates to superior empirical performance conducted on real-world network-constrained traffic data. In addition, Frigate is robust to frugal sensor deployment, changes in road network connectivity, and temporal irregularity in sensing.

Detecting Interference in Online Controlled Experiments with Increasing Allocation

In the past decade, the technology industry has adopted online controlled experiments (a.k.a. A/B testing) to guide business decisions. In practice, A/B tests are often implemented with increasing treatment allocation: the new treatment is gradually released to an increasing number of units through a sequence of randomized experiments. In scenarios such as experimenting in a social network setting or in a bipartite online marketplace, interference among units may exist, which can harm the validity of simple inference procedures. In this work, we introduce a widely applicable procedure to test for interference in A/B testing with increasing allocation. Our procedure can be implemented on top of an existing A/B testing platform with a separate flow and does not require a priori a specific interference mechanism. In particular, we introduce two permutation tests that are valid under different assumptions. Firstly, we introduce a general statistical test for interference requiring no additional assumption. Secondly, we introduce a testing procedure that is valid under a time fixed effect assumption. The testing procedure is of very low computational complexity, it is powerful, and it formalizes a heuristic algorithm implemented already in industry. We demonstrate the performance of the proposed testing procedure through simulations on synthetic data. Finally, we discuss one application at LinkedIn, where a screening step is implemented to detect potential interference in all their marketplace experiments with the proposed methods in the paper.

Mitigating Action Hysteresis in Traffic Signal Control with Traffic Predictive Reinforcement Learning

Traffic signal control plays a pivotal role in the management of urban traffic flow. With the rapid advancement of reinforcement learning, the development of signal control methods has seen a significant boost. However, a major challenge in implementing these methods is ensuring that signal lights do not change abruptly, as this can lead to traffic accidents. To mitigate this risk, a time-delay is introduced in the implementation of control actions, but usually has a negative impact on the overall efficacy of the control policy. To address this challenge, this paper presents a novel Traffic Signal Control Framework (PRLight), which leverages an On-policy Traffic Control Model (OTCM) and an Online Traffic Prediction Model (OTPM) to achieve efficient and real-time control of traffic signals. The framework collects multi-source traffic information from a local-view graph in real-time and employs a novel fast attention mechanism to extract relevant traffic features. To be specific, OTCM utilizes the predicted traffic state as input, eliminating the need for communication with other agents and maximizing computational efficiency while ensuring that the most relevant information is used for signal control. The proposed framework was evaluated on both simulated and real-world road networks and compared to various state-of-the-art methods, demonstrating its effectiveness in preventing traffic congestion and accidents.

GAT-MF: Graph Attention Mean Field for Very Large Scale Multi-Agent Reinforcement Learning

Recent advancements in reinforcement learning have witnessed remarkable achievements by intelligent agents ranging from game-playing to industrial applications. Of particular interest is the area of multi-agent reinforcement learning (MARL), which holds significant potential for real-world scenarios. However, typical MARL methods are limited in their ability to handle tens of agents, leaving scenarios with up to hundreds or even thousands of agents almost unexplored. The scaling up of the number of agents presents two primary challenges: (1) agent-agent interactions are crucial in multi-agent systems while the number of interactions grows quadratically with the number of agents, resulting in substantial computational complexity and difficulty in strategies-learning; (2) the strengths of interactions among agents exhibit variations both across agents and over time, making it difficult to precisely model such interactions. In this paper, we propose a novel approach named Graph Attention Mean Field (GAT-MF). By converting agent-agent interactions into interactions between each agent and a weighted mean field, we achieve a substantial reduction in computational complexity. The proposed method offers a precise modeling of interaction dynamics with mathematical proofs of its correctness. Additionally, we design a graph attention mechanism to automatically capture the diverse and time-varying strengths of interactions, ensuring an accurate representation of agent interactions. Through extensive experimentation conducted in both manual and real-world scenarios involving over 3000 agents, we validate the efficacy of our method. The results demonstrate that our method outperforms the best baseline method with a remarkable improvement of 42.7%. Furthermore, our method saves 86.4% training time and 19.2% GPU memory compared to the best baseline method. For reproducibility, our source codes and data are available at https://github.com/tsinghua-fib-lab/Large-Scale-MARL-GATMF.

CLUR: Uncertainty Estimation for Few-Shot Text Classification with Contrastive Learning

Few-shot text classification has extensive application where the sample collection is expensive or complicated. When the penalty for classification errors is high, such as early threat event detection with scarce data, we expect to know "whether we should trust the classification results or reexamine them.'' This paper investigates the Uncertainty Estimation for Few-shot Text Classification (UEFTC), an unexplored research area. Given limited samples, a UEFTC model predicts an uncertainty score for a classification result, which is the likelihood that the classification result is false. However, many traditional uncertainty estimation models in text classification are unsuitable for implementing a UEFTC model. These models require numerous training samples, whereas the few-shot setting in UEFTC only provides a few or just one support sample for each class in an episode. We propose Contrastive Learning from Uncertainty Relations (CLUR) to address UEFTC. CLUR can be trained with only one support sample for each class with the help of pseudo uncertainty scores. Unlike previous works that manually set the pseudo uncertainty scores, CLUR self-adaptively learns them using our proposed uncertainty relations. Specifically, we explore four model structures in CLUR to investigate the performance of three common-used contrastive learning components in UEFTC and find that two of the components are effective. Experiment results prove that CLUR outperforms six baselines on four datasets, including an improvement of 4.52% AUPR on an RCV1 dataset in a 5-way 1-shot setting. Our code and data split for UEFTC are in https://github.com/he159ok/CLUR_UncertaintyEst_FewShot_TextCls.

Prescriptive PCA: Dimensionality Reduction for Two-stage Stochastic Optimization

In this paper, we consider the alignment between an upstream dimensionality reduction task of learning a low-dimensional representation of a set of high-dimensional data and a downstream optimization task of solving a stochastic program parameterized by said representation. In this case, standard dimensionality reduction methods (e.g., principal component analysis) may not perform well, as they aim to maximize the amount of information retained in the representation and do not generally reflect the importance of such information in the downstream optimization problem. To address this problem, we develop a prescriptive dimensionality reduction framework that aims to minimize the degree of suboptimality in the optimization phase. For the case where the downstream stochastic optimization problem has an expected value objective, we show that prescriptive dimensionality reduction can be performed via solving a distributionally-robust optimization problem, which admits a semidefinite programming relaxation. Computational experiments based on a warehouse transshipment problem and a vehicle repositioning problem show that our approach significantly outperforms principal component analysis with real and synthetic data sets.

Partial-label Learning with Mixed Closed-set and Open-set Out-of-candidate Examples

Partial-label learning (PLL) relies on a key assumption that the true label of each training example must be in the candidate label set. This restrictive assumption may be violated in complex real-world scenarios, and thus the true label of some collected examples could be unexpectedly outside the assigned candidate label set. In this paper, we term the examples whose true label is outside the candidate label set OOC (Out-Of-Candidate) examples, and pioneer a new PLL study to learn with OOC examples. We consider two types of OOC examples in reality, i.e., the closed-set/open-set OOC examples whose true label is inside/outside the known label space. To solve this new PLL problem, we first calculate the wooden cross-entropy loss from candidate and non-candidate labels respectively, and dynamically differentiate the two types of OOC examples based on specially designed criteria. Then, for closed-set OOC examples, we conduct reversed label disambiguation in the non-candidate label set; for open-set OOC examples, we leverage them for training by utilizing an effective regularization strategy that dynamically assigns random candidate labels from the candidate label set. In this way, the two types of OOC examples can be differentiated and further leveraged for model training. Extensive experiments demonstrate that our proposed method outperforms state-of-the-art PLL methods.

Planning to Fairly Allocate: Probabilistic Fairness in the Restless Bandit Setting

Restless and collapsing bandits are often used to model budget-constrained resource allocation in settings where arms have action-dependent transition probabilities, such as the allocation of health interventions among patients. However, SOTA Whittle-index-based approaches to this planning problem either do not consider fairness among arms, or incentivize fairness without guaranteeing it. We thus introduce ProbFair, a probabilistically fair policy that maximizes total expected reward and satisfies the budget constraint while ensuring a strictly positive lower bound on the probability of being pulled at each timestep. We evaluate our algorithm on a real-world application, where interventions support continuous positive airway pressure (CPAP) therapy adherence among patients, as well as on a broader class of synthetic transition matrices. We find that ProbFair preserves utility while providing fairness guarantees.

Unbiased Locally Private Estimator for Polynomials of Laplacian Variables

This work presents a mechanism to debias polynomial functions computed from locally differentially private data. Local differential privacy is a widely used privacy notion where users add Laplacian noise to their information before submitting it to a central server. That, however, causes bias when we calculate non-linear functions based on those noisy information. Our proposed recursive algorithm debiases these functions, with a calculation time of O(r n log n), where r is the polynomial degree and n is the number of users. We evaluate our method on the problems of k-star counting and variance estimation, comparing results with state-of-the-art algorithms. The results show that our method not only eliminates bias, but also provides at least 100 times more accuracy than previous works.

Graph Neural Processes for Spatio-Temporal Extrapolation

We study the task of spatio-temporal extrapolation that generates data at target locations from surrounding contexts in a graph. This task is crucial as sensors that collect data are sparsely deployed, resulting in a lack of fine-grained information due to high deployment and maintenance costs. Existing methods either use learning-based models like Neural Networks or statistical approaches like Gaussian Processes for this task. However, the former lacks uncertainty estimates and the latter fails to capture complex spatial and temporal correlations effectively. To address these issues, we propose Spatio-Temporal Graph Neural Processes (STGNP), a neural latent variable model which commands these capabilities simultaneously. Specifically, we first learn deterministic spatio-temporal representations by stacking layers of causal convolutions and cross-set graph neural networks. Then, we learn latent variables for target locations through vertical latent state transitions along layers and obtain extrapolations. Importantly during the transitions, we propose Graph Bayesian Aggregation (GBA), a Bayesian graph aggregator that aggregates contexts considering uncertainties in context data and graph structure. Extensive experiments show that STGNP has desirable properties such as uncertainty estimates and strong learning capabilities, and achieves state-of-the-art results by a clear margin.

ST-iFGSM: Enhancing Robustness of Human Mobility Signature Identification Model via Spatial-Temporal Iterative FGSM

The Human Mobility Signature Identification (HuMID) problem aims at determining whether the incoming trajectories were generated by a claimed agent from the historical movement trajectories of a set of individual human agents such as pedestrians and taxi drivers. The HuMID problem is significant, and its solutions have a wide range of real-world applications, such as criminal identification for police departments, risk assessment for auto insurance providers, driver verification in ride-sharing services, and so on. Though Deep neural networks (DNN) based HuMID models on spatial-temporal mobility fingerprint similarity demonstrate remarkable performance in effectively identifying human agents' mobility signatures, it is vulnerable to adversarial attacks as other DNN-based models. Therefore, in this paper, we propose a Spatial-Temporal iterative Fast Gradient Sign Method with L0 regularization - ST-iFGSM - to detect the vulnerability and enhance the robustness of HuMID models. Extensive experiments with real-world taxi trajectory data demonstrate the efficiency and effectiveness of our ST-iFGSM algorithm. We tested our method on both the ST-SiameseNet and an LSTM-based HuMID classification model. It shows that ST-iFGSM can generate successful attacks to fool the HuMID models with only a few steps of attack in a small portion of the trajectories. The generated attacks can be used as augmented data to update and improve the HuMID model accuracy significantly from 47.36% to 76.18% on testing samples after the attack(86.25% on the original testing samples).

Leveraging Relational Graph Neural Network for Transductive Model Ensemble

Traditional methods of pre-training, fine-tuning, and ensembling often overlook essential relational data and task interconnections. To address this gap, our study presents a novel approach to harnessing this relational information via a relational graph-based model. We introduce Relational grAph Model ensemBLE model, abbreviated as RAMBLE. This model distinguishes itself by performing class label inference simultaneously across all data nodes and task nodes, employing the relational graph in a transductive manner. This fine-grained approach allows us to better comprehend and model the intricate interplay between data and tasks. Furthermore, we incorporate a novel variational information bottleneck-guided scheme for embedding fusion and aggregation. This innovative technique facilitates the creation of an informative fusion embedding, honing in on embeddings beneficial for the intended task while simultaneously filtering out potential noise-laden embeddings. Our theoretical analysis, grounded in information theory, confirms that the use of relational information for embedding fusion allows us to achieve higher upper and lower bounds on our target task's accuracy. We thoroughly assess our proposed model across eight diverse datasets, and the experimental results demonstrate the model's effective utilization of relational knowledge derived from all pre-trained models, thereby enhancing its performance on our target tasks.

One for All: Unified Workload Prediction for Dynamic Multi-tenant Edge Cloud Platforms

Workload prediction in multi-tenant edge cloud platforms (MT-ECP) is vital for efficient application deployment and resource provisioning. However, the heterogeneous application patterns, variable infrastructure performance, and frequent deployments in MT-ECP pose significant challenges for accurate and efficient workload prediction. Clustering-based methods for dynamic MT-ECP modeling often incur excessive costs due to the need to maintain numerous data clusters and models, which leads to excessive costs. Existing end-to-end time series prediction methods are challenging to provide consistent prediction performance in dynamic MT-ECP. In this paper, we propose an end-to-end framework with global pooling and static content awareness, DynEformer, to provide a unified workload prediction scheme for dynamic MT-ECP. Meticulously designed global pooling and information merging mechanisms can effectively identify and utilize global application patterns to drive local workload predictions. The integration of static content-aware mechanisms enhances model robustness in real-world scenarios. Through experiments on five real-world datasets, DynEformer achieved state-of-the-art in the dynamic scene of MT-ECP and provided a unified end-to-end prediction scheme for MT-ECP.

Generalizing Graph ODE for Learning Complex System Dynamics across Environments

Learning multi-agent system dynamics have been extensively studied for various real-world applications, such as molecular dynamics in biology, multi-body system prediction in physics, and particle dynamics in material science. Most of the existing models are built to learn single system dynamics, which learn the dynamics from observed historical data and predict the future trajectory. In practice, however, we might observe multiple systems that are generated across different environments, which differ in latent exogenous factors such as temperature and gravity. One simple solution is to learn multiple environment-specific models, but it fails to exploit the potential commonalities among the dynamics across environments and offers poor prediction results where per-environment data is sparse or limited. Here, we present GG-ODE (Generalized Graph Ordinary Differential Equations), a machine learning framework for learning continuous multi-agent system dynamics across environments. Our model learns system dynamics using neural ordinary differential equations (ODE) parameterized by Graph Neural Networks (GNNs) to capture the continuous interaction among agents. We achieve the model generalization by assuming the dynamics across different environments are governed by common physics laws that can be captured via learning a shared ODE function. The distinct latent exogenous factors learned for each environment are incorporated into the ODE function to account for their differences. To improve model performance, we additionally design two regularization losses to (1) enforce the orthogonality between the learned initial states and exogenous factors via mutual information minimization; and (2) reduce the temporal variance of learned exogenous factors within the same system via contrastive learning. Experiments over various physical simulations show that our model can accurately predict system dynamics, especially in the long range, and can generalize well to new systems with few observations.

The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles

Transformers use the dense self-attention mechanism which gives a lot of flexibility for long-range connectivity. Over multiple layers of a deep transformer, the number of possible connectivity patterns increases exponentially. However, very few of these contribute to the performance of the network, and even fewer are essential. We hypothesize that there are sparsely connected sub-networks within a transformer, called information pathways which can be trained independently. However, the dynamic (i.e., input-dependent) nature of these pathways makes it difficult to prune dense self-attention during training. But the overall distribution of these pathways is often predictable. We take advantage of this fact to propose Stochastically Subsampled self-Attention (SSA) - a general-purpose training strategy for transformers that can reduce both the memory and computational cost of self-attention by 4 to 8 times during training while also serving as a regularization method - improving generalization over dense training. We show that an ensemble of sub-models can be formed from the subsampled pathways within a network, which can achieve better performance than its densely attended counterpart. We perform experiments on a variety of NLP, computer vision and graph learning tasks in both generative and discriminative settings to provide empirical evidence for our claims and show the effectiveness of the proposed method.

Sequential Learning Algorithms for Contextual Model-Free Influence Maximization

The online influence maximization (OIM) problem aims to learn sequentially an optimal policy for selecting seed nodes which maximize the cumulative spread of information (influence) in a diffusion medium, throughout a multi-round diffusion campaign. We consider the sub-class of OIM problems where (i) the reward of a given round of the ongoing campaign consists of only the new activations(not observed at previous rounds), and (ii) the round's context and the historical data from previous rounds can be exploited to learn the best policy. This problem is directly motivated by the real-world scenarios of information diffusion in influencer marketing, where (i) only a target user's first / unique activation is of interest (and this activation will persist as an acquired, latent one throughout the campaign), and (ii) valuable side-information is available to the learning agent. We call this OIM formulation Episodic Contextual Influence Maximization with Persistence (in short, ECIMP). We propose the algorithm LSVI-GT-UCB, which implements the optimism in the face of uncertainty principle for episodic reinforcement learning with linear approximation. The learning agent estimates for each seed node its remaining potential with a Good-Turing estimator, modified by an estimated Q-function. The algorithm is empirically proven to perform better than state-of-the-art methods on two real-world datasets and a synthetically generated one.

COMET: Learning Cardinality Constrained Mixture of Experts with Trees and Local Search

The sparse Mixture-of-Experts (Sparse-MoE) framework efficiently scales up model capacity in various domains, such as natural language processing and vision. Sparse-MoEs select a subset of the "experts" (thus, only a portion of the overall network) for each input sample using a sparse, trainable gate. Existing sparse gates are prone to convergence and performance issues when training with first-order optimization methods. In this paper, we introduce two improvements to current MoE approaches. First, we propose a new sparse gate: COMET, which relies on a novel tree-based mechanism. COMET is differentiable, can exploit sparsity to speed up computation, and outperforms state-of-the-art gates. Second, due to the challenging combinatorial nature of sparse expert selection, first-order methods are typically prone to low-quality solutions. To deal with this challenge, we propose a novel, permutation-based local search method that can complement first-order methods in training any sparse gate, e.g., Hash routing, Top-k, DSelect-k, and COMET. We show that local search can help networks escape bad initializations or solutions. We performed large-scale experiments on various domains, including recommender systems, vision, and natural language processing. On standard vision and recommender systems benchmarks, COMET+ (COMET with local search) achieves up to 13% improvement in ROC AUC over popular gates, e.g., Hash routing and Top-k, and up to 9% over prior differentiable gates e.g., DSelect-k. When Top-k and Hash gates are combined with local search, we see up to 100X reduction in the budget needed for hyperparameter tuning. Moreover, for language modeling, our approach improves over the state-of-the-art MoEBERT model for distilling BERT on 5/7 GLUE benchmarks as well as SQuAD dataset.

Generative Perturbation Analysis for Probabilistic Black-Box Anomaly Attribution

We address the task of probabilistic anomaly attribution in the black-box regression setting, where the goal is to compute the probability distribution of the attribution score of each input variable, given an observed anomaly. The training dataset is assumed to be unavailable. This task differs from the standard XAI (explainable AI) scenario, since we wish to explain the anomalous deviation from a black-box prediction rather than the black-box model itself.

We begin by showing that mainstream model-agnostic explanation methods, such as the Shapley values, are not suitable for this task because of their ''deviation-agnostic property.'' We then propose a novel framework for probabilistic anomaly attribution that allows us to not only compute attribution scores as the predictive mean but also quantify the uncertainty of those scores. This is done by considering a generative process for perturbations that counter-factually bring the observed anomalous observation back to normalcy. We introduce a variational Bayes algorithm for deriving the distributions of per variable attribution scores. To the best of our knowledge, this is the first probabilistic anomaly attribution framework that is free from being deviation-agnostic.

Parameter-free Spikelet: Discovering Different Length and Warped Time Series Motifs using an Adaptive Time Series Representation

Over the last two decades, time series motif discovery has emerged as a useful primitive for many downstream analytical tasks, including clustering, classification, rule discovery, segmentation, and summarization. In parallel, it has long been known that Dynamic Time Warping (DTW) is superior to other similarity measures such as Euclidean Distance under most settings. Recently an algorithm to allow scalable DTW motif discovery was proposed; however, it is limited to finding pairs of subsequences whose subsequence lengths are the same. Moreover, that length must be provided by the user ahead of time. In this work, we propose a novel method to discover "warped" motifs whose lengths may differ. Moreover, our method allows input parameters that are not fixed lengths but rather just bounds on the maximum length of motifs to find. This allows us to quickly find different-length motifs without the burdensome trial-and-error of conventional methods. With extensive empirical work, we show that our method is scalable enough for real-world datasets and enables us to find variable-length and "warped" motifs that would otherwise escape the attention of conventional algorithms.

Similarity Preserving Adversarial Graph Contrastive Learning

Recent works demonstrate that GNN models are vulnerable to adversarial attacks, which refer to imperceptible perturbation on the graph structure and node features. Among various GNN models, graph contrastive learning (GCL) based methods specifically suffer from adversarial attacks due to their inherent design that highly depends on the self-supervision signals derived from the original graph, which however already contains noise when the graph is attacked. To achieve adversarial robustness against such attacks, existing methods adopt adversarial training (AT) to the GCL framework, which considers the attacked graph as an augmentation under the GCL framework. However, we find that existing adversarially trained GCL methods achieve robustness at the expense of not being able to preserve the node feature similarity. In this paper, we propose a similarity-preserving adversarial graph contrastive learning (SP-AGCL) framework that contrasts the clean graph with two auxiliary views of different properties (i.e., the node similarity-preserving view and the adversarial view). Extensive experiments demonstrate that SP-AGCL achieves a competitive performance on several downstream tasks, and shows its effectiveness in various scenarios, e.g., a network with adversarial attacks,noisy labels, and heterophilous neighbors. Our code is available at https://github.com/yeonjun-in/torch-SP-AGCL.

Fast and Accurate Dual-Way Streaming PARAFAC2 for Irregular Tensors - Algorithm and Application

How can we efficiently and accurately analyze an irregular tensor in a dual-way streaming setting where the sizes of two dimensions of the tensor increase over time? What types of anomalies are there in the dual-way streaming setting? An irregular tensor is a collection of matrices whose column lengths are the same while their row lengths are different. In a dual-way streaming setting, both new rows of existing matrices and new matrices arrive over time. PARAFAC2 decomposition is a crucial tool for analyzing irregular tensors. Although real-time analysis is necessary in the dual-way streaming, static PARAFAC2 decomposition methods fail to efficiently work in this setting since they perform PARAFAC2 decomposition for accumulated tensors whenever new data arrive. Existing streaming PARAFAC2 decomposition methods work in a limited setting and fail to handle new rows of matrices efficiently.

In this paper, we propose Dash, an efficient and accurate PARAFAC2 decomposition method working in the dual-way streaming setting. When new data are given, Dash efficiently performs PARAFAC2 decomposition by carefully dividing the terms related to old and new data and avoiding naive computations involved with old data. Furthermore, applying a forgetting factor makes Dash follow recent movements. Extensive experiments show that Dash achieves up to 14.0x faster speed than existing PARAFAC2 decomposition methods for newly arrived data. We also provide discoveries for detecting anomalies in real-world datasets, including Subprime Mortgage Crisis and COVID-19.

Hierarchical Proxy Modeling for Improved HPO in Time Series Forecasting

Selecting the right set of hyperparameters is crucial in time series forecasting. The classical temporal cross-validation framework for hyperparameter optimization (HPO) often leads to poor test performance because of a possible mismatch between validation and test periods. To address this test-validation mismatch, we propose a novel technique, H-Pro to drive HPO via test proxies by exploiting data hierarchies often associated with time series datasets. Since higher-level aggregated time series often show less irregularity and better predictability as compared to the lowest-level time series which can be sparse and intermittent, we optimize the hyperparameters of the lowest-level base-forecaster by leveraging the proxy forecasts for the test period generated from the forecasters at higher levels. H-Pro can be applied on any off-the-shelf machine learning model to perform HPO. We validate the efficacy of our technique with extensive empirical evaluation on five publicly available hierarchical forecasting datasets. Our approach outperforms existing state-of-the-art methods in Tourism, Wiki, and Traffic datasets, and achieves competitive result in Tourism-L dataset, without any model-specific enhancements. Moreover, our method outperforms the winning method of the M5 forecast accuracy competition.

Precursor-of-Anomaly Detection for Irregular Time Series

Anomaly detection is an important field that aims to identify unexpected patterns or data points, and it is closely related to many real-world problems, particularly to applications in finance, manufacturing, cyber security, and so on. While anomaly detection has been studied extensively in various fields, detecting future anomalies before they occur remains an unexplored territory. In this paper, we present a novel type of anomaly detection, called Precursor-of-Anomaly (PoA) detection. Unlike conventional anomaly detection, which focuses on determining whether a given time series observation is an anomaly or not, PoA detection aims to detect future anomalies before they happen. To solve both problems at the same time, we present a neural controlled differential equation-based neural network and its multi-task learning algorithm. We conduct experiments using 17 baselines and 3 datasets, including regular and irregular time series, and demonstrate that our presented method outperforms the baselines in almost all cases. Our ablation studies also indicate that the multitasking training method significantly enhances the overall performance for both anomaly and PoA detection.

Community-based Dynamic Graph Learning for Popularity Prediction

Popularity prediction, which aims to forecast how many users would like to interact with a target item or online content in the future, can help online shopping or social media platforms to identify popular items or digital contents. Many efforts have been made to study how the multi-faceted factors, such as item features, user preferences, and social influence, affect user-item interactions, but little work has focused on the evolutionary dynamics of these factors for individuals or groups. In that light, this paper develops a community-based dynamic graph learning method for popularity prediction. First, a dynamic graph learning framework is proposed to maintain a dynamic representation for each item or user entity and update the representations according to the newly observed user-item interactions. Second, a community detection module is designed to capture the evolving community structures and identify the most influential nodes. More importantly, our framework leverages a community-level message passing during the learning process to balance local and global information propagation. Finally, we predict the popularity of the target item or online content based on the learned representations. Our experimental results based on three real-world datasets demonstrate that the proposed method achieves better performance than the baselines. Our method could not only model the changes in a user's preferences, but also capture how the communities evolve over time.

GetPt: Graph-enhanced General Table Pre-training with Alternate Attention Network

Tables are widely used for data storage and presentation due to their high flexibility in layout. The importance of tables as information carriers and the complexity of tabular data understanding attract a great deal of research on large-scale pre-training for tabular data. However, most of the works design models for specific types of tables, such as relational tables and tables with well-structured headers, neglecting tables with complex layouts. In real-world scenarios, there are many such tables beyond their target scope that cannot be well supported. In this paper, we propose GetPt, a unified pre-training architecture for general table representation applicable even to tables with complex structures and layouts. First, we convert a table to a heterogeneous graph with multiple types of edges to represent the layout of the table. Based on the graph, a specially designed transformer is applied to jointly model the semantics and structure of the table. Second, we devise the Alternate Attention Network (AAN) to better model the contextual information across multiple granularities of a table including tokens, cells, and the table. To better support a wide range of downstream tasks, we further employ three pre-training objectives and pre-train the model on a large table dataset. We fine-tune and evaluate GetPt model on two representative tasks, table type classification, and table structure recognition. Experiments show that GetPt outperforms existing state-of-the-art methods on these tasks.

Enhancing Node-Level Adversarial Defenses by Lipschitz Regularization of Graph Neural Networks

Graph neural networks (GNNs) have shown considerable promise for graph-structured data. However, they are also known to be unstable and vulnerable to perturbations and attacks. Recently, the Lipschitz constant has been adopted as a control on the stability of Euclidean neural networks, but calculating the exact constant is also known to be difficult even for very shallow networks. In this paper, we extend the Lipschitz analysis to graphs by providing a systematic scheme for estimating upper bounds of the Lipschitz constants of GNNs. We also derive concrete bounds for widely used GNN architectures including GCN, GraphSAGE and GAT. We then use these Lipschitz bounds for regularized GNN training for improved stability. Our numerical results on Lipschitz regularization of GNNs not only illustrate enhanced test accuracy under random noise, but also show consistent improvement for state-of-the-art defense methods against adversarial attacks.

Semantic Dissimilarity Guided Locality Preserving Projections for Partial Label Dimensionality Reduction

Partial label learning (PLL) is a significant weakly supervised learning framework, where each training example corresponds to a set of candidate labels among which only one is the ground-truth label. Existing works on partial label dimensionality reduction only exploit the disambiguated labels, but overlook the available semantic dissimilarity relationship hidden in the disambiguated labeling confidence, i.e., the smaller the inner product of the labeling confidences of two instances, the less likely they have the same ground-truth label. By combining such global dissimilarity relationship with local neighborhood information, we propose a novel partial label dimensionality reduction method named SDLPP, which employs an alternating procedure including candidate label disambiguation, semantic dissimilarity generation and dimensionality reduction. The labeling confidences of candidate labels and semantic dissimilarity relationship are constantly updated through the alternating procedure, where the processes in each iteration are based on the low-dimensional data obtained in the previous iteration. After the alternating procedure, SDLPP maps the original data to a pre-specified low-dimensional feature space. Comprehensive experiments on both synthetic and real-world data sets validate that SDLPP can improve the generalization performance of different PLL algorithms, and outperform state-of-the-art partial label dimensionality reduction methods. The codes can be publicly accessible on the link https://github.com/jhjiangSEU/SDLPP.

Complementary Classifier Induced Partial Label Learning

In partial label learning (PLL), each training sample is associated with a set of candidate labels, among which only one is valid. The core of PLL is to disambiguate the candidate labels to get the ground-truth one. In disambiguation, the existing works usually do not fully investigate the effectiveness of the non-candidate label set (a.k.a. complementary labels), which accurately indicates a set of labels that do not belong to a sample. In this paper, we use the non-candidate labels to induce a complementary classifier, which naturally forms an adversarial relationship against the traditional PLL classifier, to eliminate the false-positive labels in the candidate label set. Besides, we assume the feature space and the label space share the same local topological structure captured by a dynamic graph, and use it to assist disambiguation. Extensive experimental results validate the superiority of the proposed approach against state-of-the-art PLL methods on 4 controlled UCI data sets and 6 real-world data sets and reveal the usefulness of complementary learning in PLL. The code has been released in the link https://github.com/Chongjie-Si/PL-CL

Anomaly Detection with Score Distribution Discrimination

Recent studies give more attention to the anomaly detection (AD) methods that can leverage a handful of labeled anomalies along with abundant unlabeled data. These existing anomaly-informed AD methods rely on manually predefined score target(s), e.g., prior constant or margin hyperparameter(s), to realize discrimination in anomaly scores between normal and abnormal data. However, such methods would be vulnerable to the existence of anomaly contamination in the unlabeled data, and also lack adaptation to different data scenarios.

In this paper, we propose to optimize the anomaly scoring function from the view of score distribution, thus better retaining the diversity and more fine-grained information of input data, especially when the unlabeled data contains anomaly noises in more practical AD scenarios. We design a novel loss function called Overlap loss that minimizes the overlap area between the score distributions of normal and abnormal samples, which no longer depends on prior anomaly score targets and thus acquires adaptability to various datasets. Overlap loss consists of Score Distribution Estimator and Overlap Area Calculation, which are introduced to overcome challenges when estimating arbitrary score distributions, and to ensure the boundness of training loss. As a general loss component, Overlap loss can be effectively integrated into multiple network architectures for constructing AD models. Extensive experimental results indicate that Overlap loss based AD models significantly outperform their state-of-the-art counterparts, and achieve better performance on different types of anomalies.

CF-GODE: Continuous-Time Causal Inference for Multi-Agent Dynamical Systems

Multi-agent dynamical systems refer to scenarios where multiple units (aka agents) interact with each other and evolve collectively over time. For instance, people's health conditions are mutually influenced. Receiving vaccinations not only strengthens the long-term health status of one unit but also provides protection for those in their immediate surroundings. To make informed decisions in multi-agent dynamical systems, such as determining the optimal vaccine distribution plan, it is essential for decision-makers to estimate the continuous-time counterfactual outcomes. However, existing studies of causal inference over time rely on the assumption that units are mutually independent, which is not valid for multi-agent dynamical systems. In this paper, we aim to bridge this gap and study how to estimate counterfactual outcomes in multi-agent dynamical systems. Causal inference in a multi-agent dynamical system has unique challenges: 1) Confounders are time-varying and are present in both individual unit covariates and those of other units; 2) Units are affected by not only their own but also others' treatments; 3) The treatments are naturally dynamic, such as receiving vaccines and boosters in a seasonal manner. To this end, we model a multi-agent dynamical system as a graph and propose a novel model called CF-GODE (C ounterFactual Graph Ordinary Differential Equations). CF-GODE is a causal model that estimates continuous-time counterfactual outcomes in the presence of inter-dependencies between units. To facilitate continuous-time estimation, we propose Treatment-Induced GraphODE, a novel ordinary differential equation based on graph neural networks (GNNs), which can incorporate dynamical treatments as additional inputs to predict potential outcomes over time. To remove confounding bias, we propose two domain adversarial learning based objectives that learn balanced continuous representation trajectories, which are not predictive of treatments and interference. We further provide theoretical justification to prove their effectiveness. Experiments on two semi-synthetic datasets confirm that CF-GODE outperforms baselines on counterfactual estimation. We also provide extensive analyses to understand how our model works.

FedSkill: Privacy Preserved Interpretable Skill Learning via Imitation

Imitation learning that replicates experts' skills via their demonstrations has shown significant success in various decision-making tasks. However, two critical challenges still hinder the deployment of imitation learning techniques in real-world application scenarios. First, existing methods lack the intrinsic interpretability to explicitly explain the underlying rationale of the learned skill and thus making learned policy untrustworthy. Second, due to the scarcity of expert demonstrations from each end user (client), learning a policy based on different data silos is necessary but challenging in privacy-sensitive applications such as finance and healthcare. To this end, we present a privacy-preserved interpretable skill learning framework (FedSkill) that enables global policy learning to incorporate data from different sources and provides explainable interpretations to each local user without violating privacy and data sovereignty. Specifically, our proposed interpretable skill learning model can capture the varying patterns in the trajectories of expert demonstrations, and extract prototypical information as skills that provide implicit guidance for policy learning and explicit explanations in the reasoning process. Moreover, we design a novel aggregation mechanism coupled with the based skill learning model to preserve global information utilization and maintain local interpretability under the federated framework. Thoroughly experiments on three datasets and empirical studies demonstrate that our proposed FedSkill framework not only outperforms state-of-the-art imitation learning methods but also exhibits good interpretability under a federated setting. Our proposed FedSkill framework is the first attempt to bridge the gaps among federated learning, interpretable machine learning, and imitation learning.

Heterformer: Transformer-based Deep Node Representation Learning on Heterogeneous Text-Rich Networks

Representation learning on networks aims to derive a meaningful vector representation for each node, thereby facilitating downstream tasks such as link prediction, node classification, and node clustering. In heterogeneous text-rich networks, this task is more challenging due to (1) presence or absence of text: Some nodes are associated with rich textual information, while others are not; (2) diversity of types: Nodes and edges of multiple types form a heterogeneous network structure. As pretrained language models (PLMs) have demonstrated their effectiveness in obtaining widely generalizable text representations, a substantial amount of effort has been made to incorporate PLMs into representation learning on text-rich networks. However, few of them can jointly consider heterogeneous structure (network) information as well as rich textual semantic information of each node effectively. In this paper, we propose Heterformer, a Heterogeneous Network-Empowered Transformer that performs contextualized text encoding and heterogeneous structure encoding in a unified model. Specifically, we inject heterogeneous structure information into each Transformer layer when encoding node texts. Meanwhile, Heterformer is capable of characterizing node/edge type heterogeneity and encoding nodes with or without texts. We conduct comprehensive experiments on three tasks (i.e., link prediction, node classification, and node clustering) on three large-scale datasets from different domains, where Heterformer outperforms competitive baselines significantly and consistently. The code can be found at https://github.com/PeterGriffinJin/Heterformer.

Transferable Graph Structure Learning for Graph-based Traffic Forecasting Across Cities

Graph-based deep learning models are powerful in modeling spatio-temporal graphs for traffic forecasting. In practice, accurate forecasting models rely on sufficient traffic data, which may not be accessible in real-world applications. To address this problem, transfer learning methods are designed to transfer knowledge from the source graph with abundant data to the target graph with limited data. However, existing methods adopt pre-defined graph structures for knowledge extraction and transfer, which may be noisy or biased and negatively impact the performance of knowledge transfer. To address the problem, we propose TransGTR, a transferable structure learning framework for traffic forecasting that jointly learns and transfers the graph structures and forecasting models across cities. TransGTR consists of a node feature network, a structure generator, and a forecasting model. We train the node feature network with knowledge distillation to extract city-agnostic node features, such that the structure generator, taking the node features as inputs, can be transferred across both cities. Furthermore, we train the structure generator via a temporal decoupled regularization, such that the spatial features learned with the generated graphs share similar distributions across cities and thus facilitate knowledge transfer for the forecasting model. We evaluate TransGTR on real-world traffic speed datasets, where under a fair comparison, TransGTR outperforms state-of-the-art baselines by up to 5.4%.

Predicting Information Pathways Across Online Communities

The problem of community-level information pathway prediction (CLIPP) aims at predicting the transmission trajectory of content across online communities. A successful solution to CLIPP holds significance as it facilitates the distribution of valuable information to a larger audience and prevents the proliferation of misinfor- mation. Notably, solving CLIPP is non-trivial as inter-community relationships and influence are unknown, information spread is multi-modal, and new content and new communities appear over time. In this work, we address CLIPP by collecting large-scale, multi-modal datasets to examine the diffusion of online YouTube videos on Reddit. We analyze these datasets to construct community influence graphs (CIGs) and develop a novel dynamic graph frame- work, INPAC (Information Pathway Across Online Communities), which incorporates CIGs to capture the temporal variability and multi-modal nature of video propagation across communities. Ex- perimental results in both warm-start and cold-start scenarios show that INPAC outperforms seven baselines in CLIPP. Our code and datasets are available at https://github.com/claws-lab/INPAC

When Rigidity Hurts: Soft Consistency Regularization for Probabilistic Hierarchical Time Series Forecasting

Probabilistic hierarchical time-series forecasting is an important variant of time-series forecasting, where the goal is to model and forecast multivariate time-series that have hierarchical relations. Previous works assume rigid consistency over the given hierarchies and do not adapt well to real-world data that show deviation from this assumption. Moreover, recent state-of-art neural probabilistic methods also impose hierarchical relations on point predictions and samples of the predictive distribution. This does not account for full forecast distributions being consistent with the hierarchy and leading to poorly calibrated forecasts. We close both these gaps and propose PROFHiT, a probabilistic hierarchical forecasting model that jointly models forecast distributions over the entire hierarchy. PROFHiT (1) uses a flexible probabilistic Bayesian approach and (2) introduces soft distributional consistency regularization that enables end-to-end learning of the entire forecast distribution leveraging information from the underlying hierarchy. This enables calibrated forecasts as well as adaptation to real-life data with varied hierarchical consistency. PROFHiT provides 41-88% better performance in accuracy and significantly better calibration over a wide range of dataset consistency. Furthermore, PROFHiT adapts to missing data and can provide reliable forecasts even if up to 10% of input time-series data is missing, whereas other methods' performance severely degrades by over 70%

R-Mixup: Riemannian Mixup for Biological Networks

Biological networks are commonly used in biomedical and healthcare domains to effectively model the structure of complex biological systems with interactions linking biological entities. However, due to their characteristics of high dimensionality and low sample size, directly applying deep learning models on biological networks usually faces severe overfitting. In this work, we propose R-MIXUP, a Mixup-based data augmentation technique that suits the symmetric positive definite (SPD) property of adjacency matrices from biological networks with optimized training efficiency. The interpolation process in R-MIXUP leverages the log-Euclidean distance metrics from the Riemannian manifold, effectively addressing the swelling effect and arbitrarily incorrect label issues of vanilla Mixup. We demonstrate the effectiveness of R-MIXUP with five real-world biological network datasets on both regression and classification tasks. Besides, we derive a commonly ignored necessary condition for identifying the SPD matrices of biological networks and empirically study its influence on the model performance. The code implementation can be found in Appendix D.

Exploiting Relation-aware Attribute Representation Learning in Knowledge Graph Embedding for Numerical Reasoning

Numerical reasoning is an essential task for supporting machine learning applications, such as recommendation and information retrieval. The reasoning task aims to compare two items and infer new facts (e.g., is taller than) by leveraging existing relational information and numerical attributes (e.g., the height of an entity) in knowledge graphs. However, most existing methods rely on leveraging attribute encoders or additional loss functions to predict numerical relations. Therefore, the prediction performance is often not robust in cases when attributes are sparsely observed. In this paper, we propose a Relation-AAware attribute representation learning-based Knowledge Graph Embedding method for numerical reasoning tasks, which we call RAKGE. RAKGE incorporates a newly proposed attribute representation learning mechanism, which can leverage the association between relations and their corresponding numerical attributes. In addition, we introduce a robust self-supervised learning method to generate unseen positive and negative examples, thereby making our approach more reliable when numerical attributes are sparsely available. In the evaluation of three real-world datasets, our proposed model outperformed state-of-the-art methods, achieving an improvement of up to 65.1% in Hits@1 and up to 52.6% in MRR compared to the best competitor. Our implementation code is available at https://github.com/learndatalab/RAKGE.

Efficient Distributed Approximate k-Nearest Neighbor Graph Construction by Multiway Random Division Forest

k-nearest neighbor graphs, shortly k-NN graphs, are widely used in many data mining applications like recommendation, information retrieval, and similarity search. Approximate k-NN graph construction has been getting a lot of attention, and most researches focus on developing algorithms that operate efficiently and quickly on a single machine. A few pioneering studies propose distributed algorithms to increase the size of data that can be processed to billions. However, we notice that the distributed algorithms don't perform well enough due to the problems of graph fragmentation and massive data exchange. In this paper, we propose MRDF (Multiway Random Division Forest), a scalable distributed algorithm that constructs highly accurate k-NN graph from numerous high-dimensional vectors quickly. MRDF resolves the problems that the existing distributed algorithms suffer from, through coarse-grained partitioning based on tree path annotation. Experimental results on real-world datasets show that MRDF outperforms the state-of-the-art distributed algorithms with up to 7.6 times faster speed and up to 56%p better accuracy than the second best results.

Task Relation-aware Continual User Representation Learning

User modeling, which learns to represent users into a low-dimensional representation space based on their past behaviors, got a surge of interest from the industry for providing personalized services to users. Previous efforts in user modeling mainly focus on learning a task-specific user representation that is designed for a single task. However, since learning task-specific user representations for every task is infeasible, recent studies introduce the concept of universal user representation, which is a more generalized representation of a user that is relevant to a variety of tasks. Despite their effectiveness, existing approaches for learning universal user representations are impractical in real-world applications due to the data requirement, catastrophic forgetting and the limited learning capability for continually added tasks. In this paper, we propose a novel continual user representation learning method, called TERACON, whose learning capability is not limited as the number of learned tasks increases while capturing the relationship between the tasks. The main idea is to introduce an embedding for each task, i.e., task embedding, which is utilized to generate task-specific soft masks that not only allow the entire model parameters to be updated until the end of training sequence, but also facilitate the relationship between the tasks to be captured. Moreover, we introduce a novel knowledge retention module with pseudo-labeling strategy that successfully alleviates the long-standing problem of continual learning, i.e., catastrophic forgetting. Extensive experiments on public and proprietary real-world datasets demonstrate the superiority and practicality of TERACON. Our code is available at https://github.com/Sein-Kim/TERACON.

Task-Equivariant Graph Few-shot Learning

Although Graph Neural Networks (GNNs) have been successful in node classification tasks, their performance heavily relies on the availability of a sufficient number of labeled nodes per class. In real-world situations, not all classes have many labeled nodes and there may be instances where the model needs to classify new classes, making manual labeling difficult. To solve this problem, it is important for GNNs to be able to classify nodes with a limited number of labeled nodes, known as few-shot node classification. Previous episodic meta-learning based methods have demonstrated success in few-shot node classification, but our findings suggest that optimal performance can only be achieved with a substantial amount of diverse training meta-tasks. To address this challenge of meta-learning based few-shot learning (FSL), we propose a new approach, the Task-Equivariant Graph few-shot learning (TEG) framework. Our TEG framework enables the model to learn transferable task-adaptation strategies using a limited number of training meta-tasks, allowing it to acquire meta-knowledge for a wide range of meta-tasks. By incorporating equivariant neural networks, TEG can utilize their strong generalization abilities to learn highly adaptable task-specific strategies. As a result, TEG achieves state-of-the-art performance with limited training meta-tasks. Our experiments on various benchmark datasets demonstrate TEG's superiority in terms of accuracy and generalization ability, even when using minimal meta-training data, highlighting the effectiveness of our proposed approach in addressing the challenges of meta-learning based few-shot node classification. Our code is available at the following link: https://github.com/sung-won-kim/TEG

How Transitive Are Real-World Group Interactions? - Measurement and Reproduction

Many real-world interactions (e.g., researcher collaborations and email communication) occur among multiple entities. These group interactions are naturally modeled as hypergraphs. In graphs, transitivity is helpful to understand the connections between node pairs sharing a neighbor, and it has extensive applications in various domains. Hypergraphs, an extension of graphs, are designed to represent group relations. However, to the best of our knowledge, there has been no examination regarding the transitivity of real-world group interactions. In this work, we investigate the transitivity of group interactions in real-world hypergraphs. We first suggest intuitive axioms as necessary characteristics of hypergraph transitivity measures. Then, we propose a principled hypergraph transitivity measure HyperTrans, which satisfies all the proposed axioms, with a fast computation algorithm Fast-HyperTrans. After that, we analyze the transitivity patterns in real-world hypergraphs distinguished from those in random hypergraphs. Lastly, we propose a scalable hypergraph generator THera. It reproduces the observed transitivity patterns by leveraging community structures, which are pervasive in real-world hypergraphs. Our code and datasets are available at https://github.com/kswoo97/hypertrans.

LATTE: A Framework for Learning Item-Features to Make a Domain-Expert for Effective Conversational Recommendation

For high-quality conversational recommender systems (CRS), it is important to recommend the suitable items by capturing the items' features mentioned in the dialog and to explain the appropriate ones among the various features of the recommended item. We argue that the CRS model should be a domain-expert who is (1) knowledgeable about the relationships between items and their various features and (2) able to explain the recommended item with its features relevant to dialog context. To this end, we propose a novel framework, named as LATTE, to pre-train each core module in CRS (i.e., the recommendation and the conversation module) through abundant external data. For the recommendation module, we pre-train the recommendation module to comprehensively understand the relationships between items and their various features by leveraging both multi-reviews and a knowledge graph. For pre-training the conversation module, we create the synthetic dialogs, which contain responses providing the explanation relevant to the dialog context by using all the items' features and dialog templates. Through extensive experiments on two public CRS datasets, we demonstrate that LATTE exhibits (1) the effectiveness of each module in LATTE, (2) the superiority over 7 state-of-the art methods, and (3) the interpretations based on visualization.

Off-Policy Evaluation of Ranking Policies under Diverse User Behavior

Ranking interfaces are everywhere in online platforms. There is thus an ever growing interest in their Off-Policy Evaluation (OPE), aiming towards an accurate performance evaluation of ranking policies using logged data. A de-facto approach for OPE is Inverse Propensity Scoring (IPS), which provides an unbiased and consistent value estimate. However, it becomes extremely inaccurate in the ranking setup due to its high variance under large action spaces. To deal with this problem, previous studies assume either independent or cascade user behavior, resulting in some ranking versions of IPS. While these estimators are somewhat effective in reducing the variance, all existing estimators apply a single universal assumption to every user, causing excessive bias and variance. Therefore, this work explores a far more general formulation where user behavior is diverse and can vary depending on the user context. We show that the resulting estimator, which we call Adaptive IPS (AIPS), can be unbiased under any complex user behavior. Moreover, AIPS achieves the minimum variance among all unbiased estimators based on IPS. We further develop a procedure to identify the appropriate user behavior model to minimize the mean squared error (MSE) of AIPS in a data-driven fashion. Extensive experiments demonstrate that the empirical accuracy improvement can be significant, enabling effective OPE of ranking systems even under diverse user behavior.

Deception by Omission: Using Adversarial Missingness to Poison Causal Structure Learning

Causality-informed machine learning has been proposed as an avenue for achieving many of the goals of modern machine learning, from ensuring generalization under domain shifts to attaining fairness, robustness, and interpretability. A key component of causal machine learning is the inference of causal structures from observational data; in practice, this data may be incompletely observed. Prior work has demonstrated that adversarial perturbations of completely observed training data may be used to force the learning of inaccurate causal structural models (SCMs). However, when the data can be audited for correctness (e.g., it is cryptographically signed by its source), this adversarial mechanism is invalidated. This work introduces a novel attack methodology wherein the adversary deceptively omits a portion of the true training data to bias the learned causal structures in a desired manner (under strong signed sample input validation, this behavior seems to be the only strategy available to the adversary). Under this model, theoretically sound attack mechanisms are derived for the case of arbitrary SCMs, and a sample-efficient learning-based heuristic is given. Experimental validation of these approaches on real and synthetic data sets demonstrates the effectiveness of adversarial missingness attacks at deceiving popular causal structure learning algorithms.

Optimizing Traffic Control with Model-Based Learning: A Pessimistic Approach to Data-Efficient Policy Inference

Traffic signal control is an important problem in urban mobility with a significant potential for economic and environmental impact. While there is a growing interest in Reinforcement Learning (RL) for traffic signal control, the work so far has focussed on learning through simulations which could lead to inaccuracies due to simplifying assumptions. Instead, real experience data on traffic is available and could be exploited at minimal costs. Recent progress in offline or batch RL has enabled just that. Model-based offline RL methods, in particular, have been shown to generalize from the experience data much better than others.

We build a model-based learning framework that infers a Markov Decision Process (MDP) from a dataset collected using a cyclic traffic signal control policy that is both commonplace and easy to gather. The MDP is built with pessimistic costs to manage out-of-distribution scenarios using an adaptive shaping of rewards which is shown to provide better regularization compared to the prior related work in addition to being PAC-optimal. Our model is evaluated on a complex signalized roundabout and a large multi-intersection environment, demonstrating that highly performant traffic control policies can be built in a data-efficient manner.

MM-DAG: Multi-task DAG Learning for Multi-modal Data - with Application for Traffic Congestion Analysis

This paper proposes to learn Multi-task, Multi-modal Direct Acyclic Graphs (MM-DAGs), which are commonly observed in complex systems, e.g., traffic, manufacturing, and weather systems, whose variables are multi-modal with scalars, vectors, and functions. This paper takes the traffic congestion analysis as a concrete case, where a traffic intersection is usually regarded as a DAG. In a road network of multiple intersections, different intersections can only have someoverlapping and distinct variables observed. For example, a signalized intersection has traffic light-related variables, whereas unsignalized ones do not. This encourages the multi-task design: with each DAG as a task, the MM-DAG tries to learn the multiple DAGs jointly so that their consensus and consistency are maximized. To this end, we innovatively propose a multi-modal regression for linear causal relationship description of different variables. Then we develop a novel Causality Difference (CD) measure and its differentiable approximator. Compared with existing SOTA measures, CD can penalize the causal structural difference among DAGs with distinct nodes and can better consider the uncertainty of causal orders. We rigidly prove our design's topological interpretation and consistency properties. We conduct thorough simulations and one case study to show the effectiveness of our MM-DAG. The code is available under https://github.com/Lantian72/MM-DAG.

Shift-Robust Molecular Relational Learning with Causal Substructure

Recently, molecular relational learning, whose goal is to predict the interaction behavior between molecular pairs, got a surge of interest in molecular sciences due to its wide range of applications. In this work, we propose CMRL that is robust to the distributional shift in molecular relational learning by detecting the core substructure that is causally related to chemical reactions. To do so, we first assume a causal relationship based on the domain knowledge of molecular sciences and construct a structural causal model (SCM) that reveals the relationship between variables. Based on the SCM, we introduce a novel conditional intervention framework whose intervention is conditioned on the paired molecule. With the conditional intervention framework, our model successfully learns from the causal substructure and alleviates the confounding effect of shortcut substructures that are spuriously correlated to chemical reactions. Extensive experiments on various tasks with real-world and synthetic datasets demonstrate the superiority of CMRL over state-of-the-art baseline models.

Boosting Multitask Learning on Graphs through Higher-Order Task Affinities

Predicting node labels on a given graph is a widely studied problem with many applications, including community detection and molecular graph prediction. This paper considers predicting multiple node labeling functions on graphs simultaneously and revisits this problem from a multitask learning perspective. For a concrete example, consider overlapping community detection: each community membership is a binary node classification task. Due to complex overlapping patterns, we find that negative transfer is prevalent when we apply naive multitask learning to multiple community detection, as task relationships are highly nonlinear across different node labeling. To address the challenge, we develop an algorithm to cluster tasks into groups based on a higher-order task affinity measure. We then fit a multitask model on each task group, resulting in a boosting procedure on top of the baseline model. We estimate the higher-order task affinity measure between two tasks as the prediction loss of one task in the presence of another task and a random subset of other tasks. Then, we use spectral clustering on the affinity score matrix to identify task grouping. We design several speedup techniques to compute the higher-order affinity scores efficiently and show that they can predict negative transfers more accurately than pairwise task affinities. We validate our procedure using various community detection and molecular graph prediction data sets, showing favorable results compared with existing methods. Lastly, we provide a theoretical analysis to show that under a planted block model of tasks on graphs, our affinity scores can provably separate tasks into groups.

Interpretable Sparsification of Brain Graphs: Better Practices and Effective Designs for Graph Neural Networks

Brain graphs, which model the structural and functional relationships between brain regions, are crucial in neuroscientific and clinical applications that can be formulated as graph classification tasks. However, dense brain graphs pose computational challenges such as large time and memory consumption and poor model interpretability. In this paper, we investigate effective designs in Graph Neural Networks (GNNs) to sparsify brain graphs by eliminating noisy edges. Many prior works select noisy edges based on explainability or task-irrelevant properties, but this does not guarantee performance improvement when using the sparsified graphs. Additionally, the selection of noisy edges is often tailored to each individual graph, making it challenging to sparsify multiple graphs collectively using the same approach.

To address the issues above, we first introduce an iterative framework to analyze the effectiveness of different sparsification models. By utilizing this framework, we find that (i) methods that prioritize interpretability may not be suitable for graph sparsification, as the sparsified graphs may degenerate the performance of GNN models; (ii) it is beneficial to learn the edge selection during the training of the GNN, rather than after the GNN has converged; (iii) learning a joint edge selection shared across all graphs achieves higher performance than generating separate edge selection for each graph; and (iv) gradient information, which is task-relevant, helps with edge selection. Based on these insights, we propose a new model, Interpretable Graph Sparsification (IGS), which improves the graph classification performance by up to 5.1% with 55.0% fewer edges than the original graphs. The retained edges identified by IGS provide neuroscientific interpretations and are supported by well-established literature.

Who Should Be Given Incentives? Counterfactual Optimal Treatment Regimes Learning for Recommendation

Effective personalized incentives can improve user experience and increase platform revenue, resulting in a win-win situation between users and e-commerce companies. Previous studies have used uplift modeling methods to estimate the conditional average treatment effects of users' incentives, and then placed the incentives by maximizing the sum of estimated treatment effects under a limited budget. However, some users will always buy whether incentives are given or not, and they will actively collect and use incentives if provided, named "Always Buyers". Identifying and predicting these "Always Buyers" and reducing incentive delivery to them can lead to a more rational incentive allocation. In this paper, we first divide users into five strata from an individual counterfactual perspective, and reveal the failure of previous uplift modeling methods to identify and predict the "Always Buyers". Then, we propose principled counterfactual identification and estimation methods and prove their unbiasedness. We further propose a counterfactual entire-space multi-task learning approach to accurately perform personalized incentive policy learning with a limited budget. We also theoretically derive a lower bound on the reward of the learned policy. Extensive experiments are conducted on three real-world datasets with two common incentive scenarios, and the results demonstrate the effectiveness of the proposed approaches.

UCEpic: Unifying Aspect Planning and Lexical Constraints for Generating Explanations in Recommendation

Personalized natural language generation for explainable recommendations plays a key role in justifying why a recommendation might match a user's interests. Existing models usually control the generation process by aspect planning. While promising, these aspect-planning methods struggle to generate specific information correctly, which prevents generated explanations from being convincing. In this paper, we claim that introducing lexical constraints can alleviate the above issues. We propose a model, UCEpic, that generates high-quality personalized explanations for recommendation results by unifying aspect planning and lexical constraints in an insertion-based generation manner.

Methodologically, to ensure text generation quality and robustness to various lexical constraints, we pre-train a non-personalized text generator via our proposed robust insertion process. Then, to obtain personalized explanations under this framework of insertion-based generation, we design a method of incorporating aspect planning and personalized references into the insertion process. Hence, UCEpic unifies aspect planning and lexical constraints into one framework and generates explanations for recommendations under different settings. Compared to previous recommendation explanation generators controlled by only aspects, UCEpic incorporates specific information from keyphrases and then largely improves the diversity and informativeness of generated explanations for recommendations on datasets such as RateBeer and Yelp.

Text Is All You Need: Learning Language Representations for Sequential Recommendation

Sequential recommendation aims to model dynamic user behavior from historical interactions. Existing methods rely on either explicit item IDs or general textual features for sequence modeling to understand user preferences. While promising, these approaches still struggle to model cold-start items or transfer knowledge to new datasets. In this paper, we propose to model user preferences and item features as language representations that can be generalized to new items and datasets. To this end, we present a novel framework, named Recformer, which effectively learns language representations for sequential recommendation. Specifically, we propose to formulate an item as a "sentence" (word sequence) by flattening item key-value attributes described by text so that an item sequence for a user becomes a sequence of sentences. For recommendation, Recformer is trained to understand the "sentence" sequence and retrieve the next "sentence". To encode item sequences, we design a bi-directional Transformer similar to the model Longformer but with different embedding layers for sequential recommendation. For effective representation learning, we propose novel pretraining and finetuning methods which combine language understanding and recommendation tasks. Therefore, Recformer can effectively recommend the next item based on language representations. Extensive experiments conducted on six datasets demonstrate the effectiveness of Recformer for sequential recommendation, especially in low-resource and cold-start settings.

What's Behind the Mask: Understanding Masked Graph Modeling for Graph Autoencoders

The last years have witnessed the emergence of a promising self-supervised learning strategy, referred to as masked autoencoding. However, there is a lack of theoretical understanding of how masking matters on graph autoencoders (GAEs). In this work, we present masked graph autoencoder (MaskGAE), a self-supervised learning framework for graph-structured data. Different from standard GAEs, MaskGAE adopts masked graph modeling (MGM) as a principled pretext task - masking a portion of edges and attempting to reconstruct the missing part with partially visible, unmasked graph structure. To understand whether MGM can help GAEs learn better representations, we provide both theoretical and empirical evidence to comprehensively justify the benefits of this pretext task. Theoretically, we establish close connections between GAEs and contrastive learning, showing that MGM significantly improves the self-supervised learning scheme of GAEs. Empirically, we conduct extensive experiments on a variety of graph benchmarks, demonstrating the superiority of MaskGAE over several state-of-the-arts on both link prediction and node classification tasks.

Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference

Deploying pre-trained transformer models like BERT on downstream tasks in resource-constrained scenarios is challenging due to their high inference cost, which grows rapidly with input sequence length. In this work, we propose a constraint-aware and ranking-distilled token pruning method ToP, which selectively removes unnecessary tokens as input sequence passes through layers, allowing the model to improve online inference speed while preserving accuracy. ToP overcomes the limitation of inaccurate token importance ranking in the conventional self-attention mechanism through a ranking-distilled token distillation technique, which distills effective token rankings from the final layer of unpruned models to early layers of pruned models. Then, ToP introduces a coarse-to-fine pruning approach that automatically selects the optimal subset of transformer layers and optimizes token pruning decisions within these layers through improved L0 regularization. Extensive experiments on GLUE benchmark and SQuAD tasks demonstrate that ToP outperforms state-of-the-art token pruning and model compression methods with improved accuracy and speedups. ToP reduces the average FLOPs of BERT by 8.1X while achieving competitive accuracy on GLUE, and provides a real latency speedup of up to 7.4X on an Intel CPU. Code is available at https://github.com/microsoft/Moonlit/tree/main/ToP

Learning-Based Ad Auction Design with Externalities: The Framework and A Matching-Based Approach

Learning-based ad auctions have increasingly been adopted in online advertising. However, existing approaches neglect externalities, such as the interaction between ads and organic items. In this paper, we propose a general framework, namely Score-Weighted VCG, for designing learning-based ad auctions that account for externalities. The framework decomposes the optimal auction design into two parts: designing a monotone score function and an allocation algorithm, which facilitates data-driven implementation. Theoretical results demonstrate that this framework produces the optimal incentive-compatible and individually rational ad auction under various externality-aware CTR models while being data-efficient and robust. Moreover, we present an approach to implement the proposed framework with a matching-based allocation algorithm. Experiment results on both real-world and synthetic data illustrate the effectiveness of the proposed approach.

OPORP: One Permutation + One Random Projection

OPORP is a variant of the count-sketch data structure by using a fixed-length binning scheme and a normalization step for the estimation. In our experience, we find engineers like the name "one permutation + one random projection" as it tells the exact steps. Consider two vectors (e.g., embeddings): u, v ? R D with p = cos(u, v). In embedding-based applications (e.g., EBR), D = 256 - 4096 are common. With OPORP, we first apply a permutation on the data vectors. A vector r ∈ R D is generated i.i.d. with E(i) = 0, E(r 2 i ) = 1, E(r 3 i ) = 0, E(r4 i ) = s, where s ≥ 1. We multiply (as Hadamard product) r with all permuted data vectors. Then we break the D columns into k equal-length bins and aggregate (i.e., sum) the values in each bin to obtain k samples from each data vector. One crucial step is to normalize the k samples to the unit l2 norm. We show that the estimation variance equals:(s - 1)A + D-k/D-1 1/k [(1-p2)2 -2A ] , A ≥ 0, s ≥ 1, which reveals several key properties of the proposed scheme:

  • We need s = 1, otherwise the variance would have a term which does not decrease with increasing sample size k.
  • The factor d-k/d-1 is beneficial in reducing variances, especially for short vectors which are common in embeddings.
  • The term (1-p2)2 is a drastic variance reduction compared to (1+p 2 ) which is the variance term without normalization.

Moreover, the technique in our work also substantially improves the "very sparse random projections" (VSRP) in kDD'06. Another major use of OPORP will be in differential privacy (DP).

Multi-Temporal Relationship Inference in Urban Areas

Finding multiple temporal relationships among locations can benefit a bunch of urban applications, such as dynamic offline advertising and smart public transport planning. While some efforts have been made on finding static relationships among locations, little attention is focused on studying time-aware location relationships. Indeed, abundant location-based human activities are time-varying and the availability of these data enables a new paradigm for understanding the dynamic relationships in a period among connective locations. To this end, we propose to study a new problem, namely multi-Temporal relationship inference among locations (Trial for short), where the major challenge is how to integrate dynamic and geographical influence under the relationship sparsity constraint. Specifically, we propose a solution to Trial with a graph learning scheme, which includes a spatially evolving graph neural network (SEENet) with two collaborative components: spatially evolving graph convolution module (SEConv) and spatially evolving self-supervised learning strategy (SE-SSL). SEConv performs the intra-time aggregation and inter-time propagation to capture the multifaceted spatially evolving contexts from the view of location message passing. In addition, SE-SSL designs time-aware self-supervised learning tasks in a global-local manner with additional evolving constraint to enhance the location representation learning and further handle the relationship sparsity. Finally, experiments on four real-world datasets demonstrate the superiority of our method over several state-of-the-art approaches.

GraphSHA: Synthesizing Harder Samples for Class-Imbalanced Node Classification

Class imbalance is the phenomenon that some classes have much fewer instances than others, which is ubiquitous in real-world graph-structured scenarios. Recent studies find that off-the-shelf Graph Neural Networks (GNNs) would under-represent minor class samples. We investigate this phenomenon and discover that the subspaces of minor classes being squeezed by those of the major ones in the latent space is the main cause of this failure. We are naturally inspired to enlarge the decision boundaries of minor classes and propose a general framework GraphSHA by Synthesizing HArder minor samples. Furthermore, to avoid the enlarged minor boundary violating the subspaces of neighbor classes, we also propose a module called SemiMixup to transmit enlarged boundary information to the interior of the minor classes while blocking information propagation from minor classes to neighbor classes. Empirically, GraphSHA shows its effectiveness in enlarging the decision boundaries of minor classes, as it outperforms various baseline methods in class-imbalanced node classification with different GNN backbone encoders over seven public benchmark datasets. Code is avilable at https://github.com/wenzhilics/GraphSHA.

HomoGCL: Rethinking Homophily in Graph Contrastive Learning

Contrastive learning (CL) has become the de-facto learning paradigm in self-supervised learning on graphs, which generally follows the "augmenting-contrasting'' learning scheme. However, we observe that unlike CL in computer vision domain, CL in graph domain performs decently even without augmentation. We conduct a systematic analysis of this phenomenon and argue that homophily, i.e., the principle that "like attracts like'', plays a key role in the success of graph CL. Inspired to leverage this property explicitly, we propose HomoGCL, a model-agnostic framework to expand the positive set using neighbor nodes with neighbor-specific significances. Theoretically, HomoGCL introduces a stricter lower bound of the mutual information between raw node features and node embeddings in augmented views. Furthermore, HomoGCL can be combined with existing graph CL models in a plug-and-play way with light extra computational overhead. Extensive experiments demonstrate that HomoGCL yields multiple state-of-the-art results across six public datasets and consistently brings notable performance improvements when applied to various graph CL methods. Code is avilable at https://github.com/wenzhilics/HomoGCL.

Learning Balanced Tree Indexes for Large-Scale Vector Retrieval

Vector retrieval focuses on finding the k-nearest neighbors from a bunch of data points, and is widely used in a diverse set of areas such as information retrieval and recommender system. The current state-of-the-art methods represented by HNSW usually generate indexes with a big memory footprint, restricting the scale of data they can handle, except resorting to a hybrid index with external storage. The space-partitioning learned indexes, which only occupy a small memory, have made great breakthroughs in recent years. However, these methods rely on a large amount of labeled data for supervised learning, so model complexity affects the generalization.

To this end, we propose a lightweight learnable hierarchical space partitioning index based on a balanced K-ary tree, called BAlanced Tree Learner (BATL), where the same bucket of data points are represented by a path from the root to the corresponding leaf. Instead of mapping each query into a bucket, BATL classifies it into a sequence of branches (i.e. a path), which drastically reduces the number of classes and potentially improves generalization. BATL updates the classifier and the balanced tree in an alternating way. When updating the classifier, we innovatively leverage the sequence-to-sequence learning paradigm for learning to route each query into the ground-truth leaf on the balanced tree. Retrieval is then boiled down into a sequence (i.e. path) generation task, which can be simply achieved by beam search on the encoder-decoder. When updating a balanced tree, we apply the classifier for navigating each data point into the tree nodes layer by layer under the balance constraints. We finally evaluate BATL with several large-scale vector datasets, where the experimental results show the superiority of the proposed method to the SOTA baselines in the tradeoff among latency, accuracy, and memory cost.

Urban Region Representation Learning with OpenStreetMap Building Footprints

The prosperity of crowdsourcing geospatial data provides increasing opportunities to understand our cities. In particular, OpenStreetMap (OSM) has become a prominent vault of geospatial data on the Web. In this context, learning urban region representations from OSM data, which is unexplored in previous work, could be profitable for various downstream tasks. In this work, we utilize OSM buildings (footprints) complemented with points of interest (POIs) to learn region representations, as buildings' shapes, spatial distributions, and properties have tight linkages to different urban functions. However, appealing as it seems, urban buildings often exhibit complex patterns to form dense or sparse areas, which brings significant challenges for unsupervised feature extraction. To address the challenges, we propose RegionDCL1, an unsupervised framework to deeply mine urban buildings. In a nutshell, we leverage random points generated by Poisson Disk Sampling to tackle data-sparse areas and utilize triplet loss with a novel adaptive margin to preserve inter-region correlations. Furthermore, we train our model with group-level and region-level contrastive learning, making it adaptive to varying region partitions. Extensive experiments in two global cities demonstrate that RegionDCL consistently outperforms the state-of-the-art counterparts across different region partitions, and outputs effective representations for inferring urban land use and population density.

Machine Unlearning in Gradient Boosting Decision Trees

Various machine learning applications take users' data to train the models. Recently enforced legislation requires companies to remove users' data upon requests, i.e.,the right to be forgotten. In the context of machine learning, the trained model potentially memorizes the training data. Machine learning algorithms have to be able to unlearn the user data that are requested to delete to meet the requirement. Gradient Boosting Decision Trees (GBDT) is a widely deployed model in many machine learning applications. However, few studies investigate the unlearning on GBDT. This paper proposes a novel unlearning framework for GBDT. To the best of our knowledge, this is the first work that considers machine unlearning on GBDT. It is not straightforward to transfer the unlearning methods of DNN to GBDT settings. We formalized the machine unlearning problem and its relaxed version. We propose an unlearning framework that efficiently and effectively unlearns a given collection of data without retraining the model from scratch. We introduce a collection of techniques, including random split point selection and random partitioning layers training, to the training process of the original tree models to ensure that the trained model requires few subtree retrainings during the unlearning. We investigate the intermediate data and statistics to store as an auxiliary data structure during the training so that we can immediately determine if a subtree is required to be retrained without touching the original training dataset. Furthermore, a lazy update technique is proposed as a trade-off between unlearning time and model functionality. We experimentally evaluate our proposed methods on public datasets. The empirical results confirm the effectiveness of our framework.

MAP: A Model-agnostic Pretraining Framework for Click-through Rate Prediction

With the widespread application of online advertising systems, click-through rate (CTR) prediction has received more and more attention and research. The most prominent features of CTR prediction are its multi-field categorical data format, and vast and daily-growing data volume (e.g., billions of user click logs). The large capacity of neural models helps digest such massive amounts of data under the supervised learning paradigm, yet they fail to utilize the substantial data to its full potential, since click signals are not sufficient enough for the model to learn capable representations of features and instances. The self-supervised learning paradigm provides a more promising pretrain-finetune solution to better exploit the large amount of user click logs and learn more robust and effective representations. However, current works on this line are still preliminary and rudimentary, leaving self-supervised learning for CTR prediction still an open question. To this end, we propose a Model-agnostic Pretraining (MAP) framework that applies feature corruption and recovery on multi-field categorical data, and more specifically, we derive two practical algorithms: masked feature prediction (MFP) and replaced feature detection (RFD). MFP digs into feature interactions within each instance through masking and predicting a small portion of input features, and we also introduce Noise Contrastive Estimation (NCE) to handle large feature spaces. RFD further turns MFP into a binary classification mode through replacing and detecting changes in input features, making it even simpler and more effective for CTR pretraining. Our extensive experiments on two real-world million-level datasets (i.e., Avazu, Criteo) demonstrate the advantages of these two methods over several strong baselines, and achieve new state-of-the-art in terms of both performance and efficiency for CTR prediction.

Fire: An Optimization Approach for Fast Interpretable Rule Extraction

We present FIRE, Fast Interpretable Rule Extraction, an optimization-based framework to extract a small but useful collection of decision rules from tree ensembles. FIRE selects sparse representative subsets of rules from tree ensembles, that are easy for a practitioner to examine. To further enhance the interpretability of the extracted model, FIRE encourages fusing rules during selection, so that many of the selected decision rules share common antecedents. The optimization framework utilizes a fusion regularization penalty to accomplish this, along with a non-convex sparsity-inducing penalty to aggressively select rules. Optimization problems in FIRE pose a challenge to off-the-shelf solvers due to problem scale and the non-convexity of the penalties. To address this, making use of problem-structure, we develop a specialized solver based on block coordinate descent principles; our solver performs up to 40x faster than existing solvers. We show in our experiments that FIRE outperforms state-of-the-art rule ensemble algorithms at building sparse rule sets, and can deliver more interpretable models compared to existing methods.

Communication Efficient Distributed Newton Method with Fast Convergence Rates

We propose a communication and computation efficient second-order method for distributed optimization. For each iteration, our method only requires O (d) communication complexity, where d is the problem dimension. We also provide theoretical analysis to show the proposed method has the similar convergence rate as the classical second-order optimization algorithms. Concretely, our method can find (∈, √dLe,)-second-order stationary points for nonconvex problem by O (√dL,∈-3/2) iterations, where L is the Lipschitz constant of Hessian. Moreover, it enjoys a local superlinear convergence under the strongly-convex assumption. Experiments on both convex and nonconvex problems show that our proposed method performs significantly better than baselines.

Robust Spatiotemporal Traffic Forecasting with Reinforced Dynamic Adversarial Training

Machine learning-based forecasting models are commonly used in Intelligent Transportation Systems (ITS) to predict traffic patterns and provide city-wide services. However, most of the existing models are susceptible to adversarial attacks, which can lead to inaccurate predictions and negative consequences such as congestion and delays. Therefore, improving the adversarial robustness of these models is crucial for ITS. In this paper, we propose a novel framework for incorporating adversarial training into spatiotemporal traffic forecasting tasks. We demonstrate that traditional adversarial training methods designated for static domains cannot be directly applied to traffic forecasting tasks, as they fail to effectively defend against dynamic adversarial attacks. Then, we propose a reinforcement learning-based method to learn the optimal node selection strategy for adversarial examples, which simultaneously strengthens the dynamic attack defense capability and reduces the model overfitting. Additionally, we introduce a self-knowledge distillation regularization module to overcome the "forgetting issue" caused by continuously changing adversarial nodes during training. We evaluate our approach on two real-world traffic datasets and demonstrate its superiority over other baselines. Our method effectively enhances the adversarial robustness of spatiotemporal traffic forecasting models. The source code for our framework is available at https://github.com/usail-hkust/RDAT.

Discovering Dynamic Causal Space for DAG Structure Learning

Discovering causal structure from purely observational data (i.e., causal discovery), aiming to identify causal relationships among variables, is a fundamental task in machine learning.The recent invention of differentiable score-based DAG learners is a crucial enabler, which reframes the combinatorial optimization problem into a differentiable optimization with a DAG constraint over directed graph space. Despite their great success, these cutting-edge DAG learners incorporate DAG-ness independent score functions to evaluate the directed graph candidates, lacking in considering graph structure. As a result, measuring the data fitness alone regardless of DAG-ness inevitably leads to discovering suboptimal DAGs and model vulnerabilities.

Towards this end, we propose a dynamic csusal space for DAG structure learning, coined CASPER, that integrates the graph structure into the score function as a new measure in the causal space to faithfully reflect the causal distance between estimated and ground-truth DAG. CASPER revises the learning process as well as enhances the DAG structure learning via adaptive attention to DAG-ness. Grounded by empirical visualization, CASPER, as a space, satisfies a series of desired properties, such as structure awareness and noise-robustness. Extensive experiments on both synthetic and real-world datasets clearly validate the superiority of our CASPER over the state-of-the-art causal discovery methods in terms of accuracy and robustness.

Meta Multi-agent Exercise Recommendation: A Game Application Perspective

Exercise recommendation is a fundamental and important task in the E-learning system, facilitating students' personalized learning. Most existing exercise recommendation algorithms design a scoring criterion (e.g., weakest mastery, lowest historical correctness) in conjunction with experience, and then recommend the recommended knowledge concepts (KCs). These algorithms rely entirely on the scoring criteria by treating exercise recommendations as a centralized system. However, it is a complex problem for the centralized system to choose a limited number of exercises in a period of time to consolidate and learn the KCs efficiently. Moreover, different groups of students (e.g., different countries, schools, or classes) have different solutions for the same group of KCs according to their own situations, in the spirit of competency-based instructing. Therefore, we propose Meta Multi-Agent Exercise Recommendation (MMER). Specifically, we design the multi-agent exercise recommendation module, in which the KCs involved in exercises are considered agents with competition and cooperation among them. And the meta-training stage is designed to learn a robust recommendation module for new student groups. Extensive experiments on real-world datasets validate the satisfactory performance of the proposed model. Furthermore, the effectiveness of the multi-agent and meta-training part is demonstrated for the model in recommendation applications.

Semi-Supervised Graph Imbalanced Regression

Data imbalance is easily found in annotated data when the observations of certain continuous label values are difficult to collect for regression tasks. When they come to molecule and polymer property predictions, the annotated graph datasets are often small because labeling them requires expensive equipment and effort. To address the lack of examples of rare label values in graph regression tasks, we propose a semi-supervised framework to progressively balance training data and reduce model bias via self-training. The training data balance is achieved by (1) pseudo-labeling more graphs for under-represented labels with a novel regression confidence measurement and (2) augmenting graph examples in latent space for remaining rare labels after data balancing with pseudo-labels. The former is to identify quality examples from unlabeled data whose labels are confidently predicted and sample a subset of them with a reverse distribution from the imbalanced annotated data. The latter collaborates with the former to target a perfect balance using a novel label-anchored mixup algorithm. We perform experiments in seven regression tasks on graph datasets. Results demonstrate that the proposed framework significantly reduces the error of predicted graph properties, especially in under-represented label areas.

Enhancing Graph Representations Learning with Decorrelated Propagation

In recent years, graph neural networks (GNNs) have been widely used in many domains due to their powerful capability in representation learning on graph-structured data. While a majority of extant studies focus on mitigating the over-smoothing problem, recent works also reveal the limitation of GNN from a new over-correlation perspective which states that the learned representation becomes highly correlated after feature transformation and propagation in GNNs. In this paper, we thoroughly re-examine the issue of over-correlation in deep GNNs, both empirically and theoretically. We demonstrate that the propagation operator in GNNs exacerbates the feature correlation. In addition, we discovered through empirical study that existing decorrelation solutions fall short of maintaining a low feature correlation, potentially encoding redundant information. Thus, to more effectively address the over-correlation problem, we propose a decorrelated propagation scheme (DeProp) as a fundamental component to decorrelate the feature learning in GNN models, which achieves feature decorrelation at the propagation step. Comprehensive experiments on multiple real-world datasets demonstrate that DeProp can be easily integrated into prevalent GNNs, leading to significant performance enhancements. Furthermore, we find that it can be used to solve over-smoothing and over-correlation problems simultaneously and significantly outperform state-of-the-art methods on missing feature settings. The code is available at https://github.com/hualiu829/DeProp.

Guiding Mathematical Reasoning via Mastering Commonsense Formula Knowledge

Math formulas (e.g., "distance = speed X time'') serve as one of the fundamental commonsense knowledge in human cognition, where humans naturally acquire and manipulate them in logical thinking for mathematical reasoning problems. However, existing reasoning models mainly focus on learning heuristic linguistics or patterns to generate answers, but do not pay enough attention on learning with such formula knowledge. Thus, they are not transparent (thus uninterpretable) in terms of understanding and grasping basic mathematical logic. In this paper, to promote a step forward in the domain, we first construct two datasets (Math23K-F and MAWPS-F) with precise annotations of formula usage in each reasoning step for math word problems. Especially, our datasets are refined on the benchmark datasets, and thus ensure the generality and comparability for relevant research. Then, we propose a novel Formula-mastered Solver (FOMAS) with the guidance of mastering formula knowledge to solve the problems. Specifically, we establish FOMAS with two systems drawing insight from the dual process theory, including a Knowledge System and a Reasoning System, to learn and apply formula knowledge, respectively. The Knowledge System accumulates the math formulas, where we propose a novel pretraining manner to mimic how humans grasp the mathematical logic behind them. Then, in the Reasoning System, we develop elaborate formula-guided symbol prediction and goal generation methods that retrieve the necessary formula knowledge from Knowledge System to improve both reasoning accuracy and interpretability. It organically simulates how humans conduct complex reasoning under the explicit instruction of math formulas. Experimental results prove that FOMAS has a stronger reasoning ability and achieves a more interpretable reasoning process, which verifies the necessity of introducing formula knowledge transparently.

B2-Sampling: Fusing Balanced and Biased Sampling for Graph Contrastive Learning

Graph contrastive learning (GCL), aiming for an embedding space where semantically similar nodes are closer, has been widely applied in graph-structured data. Researchers have proposed many approaches to define positive and negative pairs (i.e., semantically similar and dissimilar pairs) on the graph, serving as labels to learn their embedding distances. Despite the effectiveness, those approaches usually suffer from two typical learning challenges. First, the number of candidate negative pairs is enormous. Thus, it is non-trivial to select representative ones to train the model in a more effective way. Second, the heuristics (e.g., graph views or meta-path patterns) to define positive and negative pairs are sometimes less reliable, causing considerable noise for both "labelled'' positive and negative pairs. In this work, we propose a novel sampling approach B2-Sampling to address the above challenges in a unified way. On the one hand, we use balanced sampling to select the most representative negative pairs regarding both the topological and embedding diversities. On the other hand, we use biased sampling to learn and correct the labels of the most error-prone negative pairs during the training. The balanced and biased samplings can be applied iteratively for discriminating and correcting training pairs, boosting the performance of GCL models. B2-Sampling is designed as a framework to support many known GCL models. Our extensive experiments on node classification, node clustering, and graph classification tasks show that B2-Sampling significantly improves the performance of GCL models with acceptable runtime overhead. Our website[11] https://sites.google.com/view/b2-sampling/home provides access to our codes and additional experiment results.

Using Motif Transitions for Temporal Graph Generation

Graph generative models are highly important for sharing surrogate data and benchmarking purposes. Real-world complex systems often exhibit dynamic nature, where the interactions among nodes change over time in the form of a temporal network. Most temporal network generation models extend the static graph generation models by incorporating temporality in the generation process. More recently, temporal motifs are used to generate temporal networks with better success. However, existing models are often restricted to a small set of predefined motif patterns due to the high computational cost of counting temporal motifs. In this work, we develop a practical temporal graph generator, Motif Transition Model (MTM), to generate synthetic temporal networks with realistic global and local features. Our key idea is modeling the arrival of new events as temporal motif transition processes. We first calculate the transition properties from the input graph and then simulate the motif transition processes based on the transition probabilities and transition rates. We demonstrate that our model consistently outperforms the baselines with respect to preserving various global and local temporal graph statistics and runtime performance.

Fairness-Aware Continuous Predictions of Multiple Analytics Targets in Dynamic Networks

We study a novel problem of continuously predicting a number of user-subscribed continuous analytics targets (CATs) in dynamic networks. Our architecture includes any dynamic graph neural network model as the back end applied over the network data, and per CAT front end models that return results with their confidence to users. We devise a data filtering algorithm that feeds a provably optimal subset of data in the embedding space from back end model to front end models. Secondly, to ensure fairness in terms of query result accuracy for different CATs and users, we propose a fairness metric and a fairness-aware training scheduling algorithm, along with accuracy guarantees on fairness estimation. Our experiments over five real-world datasets show that our proposed solution is effective, efficient, fair, extensible, and adaptive.

Generative Flow Network for Listwise Recommendation

Personalized recommender systems fulfill the daily demands of customers and boost online businesses. The goal is to learn a policy that can generate a list of items that matches the user's demand or interest. While most existing methods learn a pointwise scoring model that predicts the ranking score of each individual item, recent research shows that the listwise approach can further improve the recommendation quality by modeling the intra-list correlations of items that are exposed together. This has motivated the recent list reranking and generative recommendation approaches that optimize the overall utility of the entire list. However, it is challenging to explore the combinatorial space of list actions and existing methods that use cross-entropy loss may suffer from low diversity issues. In this work, we aim to learn a policy that can generate sufficiently diverse item lists for users while maintaining high recommendation quality. The proposed solution, GFN4Rec, is a generative method that takes the insight of the flow network to ensure the alignment between list generation probability and its reward. The key advantages of our solution are the log scale reward matching loss that intrinsically improves the generation diversity and the autoregressive item selection model that captures the item mutual influences while capturing future reward of the list. As validation of our method's effectiveness and its superior diversity during active exploration, we conduct experiments on simulated online environments as well as an offline evaluation framework for two real-world datasets.

Decoupled Rationalization with Asymmetric Learning Rates: A Flexible Lipschitz Restraint

A self-explaining rationalization model is generally constructed by a cooperative game where a generator selects the most human-intelligible pieces from the input text as rationales, followed by a predictor that makes predictions based on the selected rationales. However, such a cooperative game may incur the degeneration problem where the predictor overfits to the uninformative pieces generated by a not yet well-trained generator and in turn, leads the generator to converge to a sub-optimal model that tends to select senseless pieces. In this paper, we theoretically bridge degeneration with the predictor's Lipschitz continuity. Then, we empirically propose a simple but effective method named DR, which can naturally and flexibly restrain the Lipschitz constant of the predictor, to address the problem of degeneration. The main idea of DR is to decouple the generator and predictor to allocate them with asymmetric learning rates. A series of experiments conducted on two widely used benchmarks have verified the effectiveness of the proposed method. Codes: https://github.com/jugechengzi/Rationalization-DR.

FLOOD: A Flexible Invariant Learning Framework for Out-of-Distribution Generalization on Graphs

Graph Neural Networks (GNNs) have achieved remarkable success in various domains but most of them are developed under the in-distribution assumption. Under out-of-distribution (OOD) settings, they suffer from the distribution shift between the training set and the test set and may not generalize well to the test distribution. Several methods have tried the invariance principle to improve the generalization of GNNs in OOD settings. However, in previous solutions, the graph encoder is immutable after the invariant learning and cannot be adapted to the target distribution flexibly. Confronting the distribution shift, a flexible encoder with refinement to the target distribution can generalize better on the test set than the stable invariant encoder. To remedy these weaknesses, we propose a Flexible invariant Learning framework for Out-Of-Distribution generalization on graphs (FLOOD), which comprises two key components, invariant learning and bootstrapped learning. The invariant learning component constructs multiple environments from graph data augmentation and learns invariant representation under risk extrapolation. Besides, the bootstrapped learning component is devised to be trained in a self-supervised way with a shared graph encoder with the invariant learning part. During the test phase, the shared encoder is flexible to be refined with the bootstrapped learning on the test set. Extensive experiments are conducted for both transductive and inductive node classification tasks. The results demonstrate that FLOOD consistently outperforms other graph OOD generalization methods and effectively improves the generalization ability.

Learning Strong Graph Neural Networks with Weak Information

Graph Neural Networks (GNNs) have exhibited impressive performance in many graph learning tasks. Nevertheless, the performance of GNNs can deteriorate when the input graph data suffer from weak information, i.e., incomplete structure, incomplete features, and insufficient labels. Most prior studies, which attempt to learn from the graph data with a specific type of weak information, are far from effective in dealing with the scenario where diverse data deficiencies exist and mutually affect each other. To fill the gap, in this paper, we aim to develop an effective and principled approach to the problem of graph learning with weak information (GLWI). Based on the findings from our empirical analysis, we derive two design focal points for solving the problem of GLWI, i.e., enabling long-range propagation in GNNs and allowing information propagation to those stray nodes isolated from the largest connected component. Accordingly, we propose D2PT, a dual-channel GNN framework that performs long-range information propagation not only on the input graph with incomplete structure, but also on a global graph that encodes global semantic similarities. We further develop a prototype contrastive alignment algorithm that aligns the class-level prototypes learned from two channels, such that the two different information propagation processes can mutually benefit from each other and the finally learned model can well handle the GLWI problem. Extensive experiments on eight real-world benchmark datasets demonstrate the effectiveness and efficiency of our proposed methods in various GLWI scenarios.

QTIAH-GNN: Quantity and Topology Imbalance-aware Heterogeneous Graph Neural Network for Bankruptcy Prediction

The timely prediction of bankruptcy is highly desirable to guarantee an upward spiral for overall societal well-being. By extracting multifaceted information from the business interaction networks, Graph Neural Networks (GNNs) may be able to automatically make more informed predictions for bankruptcy, as compared to methods that rely heavily on abundant manpower to a large extent. Yet in real applications, bankruptcy prediction faces the key issue of quantity-imbalance: data usually comes with a long-tailed distribution wherein bankrupt corporates occupy the least of the data proportion but are our target to be identified. Apart from that, the topology-imbalance issue behind graph-structural data exacerbates prediction deterioration: feature propagation is dominated by non-bankrupt nodes through messages passing between nodes; thus, bankrupt nodes receive highly confusing information and could be easily assimilated by nearby non-bankrupt nodes. Unfortunately, the existing GNN methods are not immune to these two imbalance issues. To tackle the challenging but practically useful scenario, we propose a novel bankruptcy prediction model called the Quantity and Topology Imbalance-Aware Heterogeneous Graph Neural Network (QTIAH-GNN) to boost the final performance. Specifically, QTIAH-GNN employs the multi-hierarchy label-aware neighbor selection to conquer the topology-imbalance issue by using the class-semantic representation and the learnable parameterized similarity metric, and employs the imbalance-oriented loss to obtain the optimal tradeoff between the accuracies of the majority and minority classes. In experiments, we evaluate the proposed QTIAH-GNN on two large-scale, real-world datasets. The results show that QTIAH-GNN outperforms other state-of-the-art baselines in terms of prediction accuracy with superior efficiency and generalization ability, has stronger robustness to data imbalance, and provides meaningful model interpretation.

Multi-Grained Multimodal Interaction Network for Entity Linking

Multimodal entity linking (MEL) task, which aims at resolving ambiguous mentions to a multimodal knowledge graph, has attracted wide attention in recent years. Though large efforts have been made to explore the complementary effect among multiple modalities, however, they may fail to fully absorb the comprehensive expression of abbreviated textual context and implicit visual indication. Even worse, the inevitable noisy data may cause inconsistency of different modalities during the learning process, which severely degenerates the performance. To address the above issues, in this paper, we propose a novel Multi-GraIned Multimodal InteraCtion Network (MIMIC) framework for solving the MEL task. Specifically, the unified inputs of mentions and entities are first encoded by textual/visual encoders separately, to extract global descriptive features and local detailed features. Then, to derive the similarity matching score for each mention-entity pair, we device three interaction units to comprehensively explore the intra-modal interaction and inter-modal fusion among features of entities and mentions. In particular, three modules, namely the Text-based Global-Local interaction Unit (TGLU), Vision-based DuaL interaction Unit (VDLU) and Cross-Modal Fusion-based interaction Unit (CMFU) are designed to capture and integrate the fine-grained representation lying in abbreviated text and implicit visual cues. Afterwards, we introduce a unit-consistency objective function via contrastive learning to avoid inconsistency and model degradation. Experimental results on three public benchmark datasets demonstrate that our solution outperforms various state-of-the-art baselines, and ablation studies verify the effectiveness of designed modules.

Physics-Guided Discovery of Highly Nonlinear Parametric Partial Differential Equations

Partial differential equations (PDEs) that fit scientific data can represent physical laws with explainable mechanisms for various mathematically-oriented subjects, such as physics and finance. The data-driven discovery of PDEs from scientific data thrives as a new attempt to model complex phenomena in nature, but the effectiveness of current practice is typically limited by the scarcity of data and the complexity of phenomena. Especially, the discovery of PDEs with highly nonlinear coefficients from low-quality data remains largely under-addressed. To deal with this challenge, we propose a novel physics-guided learning method, which can not only encode observation knowledge such as initial and boundary conditions but also incorporate the basic physical principles and laws to guide the model optimization. We theoretically show that our proposed method strictly reduces the coefficient estimation error of existing baselines, and is also robust against noise. Extensive experiments show that the proposed method is more robust against data noise, and can reduce the estimation error by a large margin. Moreover, all the PDEs in the experiments are correctly discovered, and for the first time we are able to discover three-dimensional PDEs with highly nonlinear coefficients.

Augmenting Recurrent Graph Neural Networks with a Cache

While graph neural networks (GNNs) provide a powerful way to learn structured representations, it remains challenging to learn long-range dependencies in graphs. Recurrent GNNs only partly address this problem. In this paper, we propose a general approach for augmenting recurrent GNNs with a cache memory to improve their expressivity, especially for modeling long-range dependencies. Specifically, we first introduce a method of augmenting recurrent GNNs with a cache of previous hidden states. Then we further propose a general Cache-GNN framework by adding additional modules, including attention mechanism and positional/structural encoders, to improve the expressivity. We show that the Cache-GNNs outperforms other models on synthetic datasets as well as tasks on real-world datasets that require long-range information.

Learning for Counterfactual Fairness from Observational Data

Fairness-aware machine learning has attracted a surge of attention in many domains, such as online advertising, personalized recommendation, and social media analysis in web applications. Fairness-aware machine learning aims to eliminate biases of learning models against certain subgroups described by certain protected (sensitive) attributes such as race, gender, and age. Among many existing fairness notions, counterfactual fairness is a popular notion defined from a causal perspective. It measures the fairness of a predictor by comparing the prediction of each individual in the original world and that in the counterfactual worlds in which the value of the sensitive attribute is modified. A prerequisite for existing methods to achieve counterfactual fairness is the prior human knowledge of the causal model for the data. However, in real-world scenarios, the underlying causal model is often unknown, and acquiring such human knowledge could be very difficult. In these scenarios, it is risky to directly trust the causal models obtained from information sources with unknown reliability and even causal discovery methods, as incorrect causal models can consequently bring biases to the predictor and lead to unfair predictions. In this work, we address the problem of counterfactually fair prediction from observational data without given causal models by proposing a novel framework CLAIRE. Specifically, under certain general assumptions, CLAIRE effectively mitigates the biases from the sensitive attribute with a representation learning framework based on counterfactual data augmentation and an invariant penalty. Experiments conducted on both synthetic and real-world datasets validate the superiority of CLAIRE in both counterfactual fairness and prediction performance.

Towards Graph-level Anomaly Detection via Deep Evolutionary Mapping

Graph-level anomaly detection aims at capturing anomalous individual graphs in a graph set. Due to its significance in various real-world application fields, e.g., identifying rare molecules in chemistry and detecting potential frauds in online social networks, graph-level anomaly detection has received great attention recently. In distinction from node- and edge-level anomaly detection that is devoted to identifying anomalies on a single graph, graph-level anomaly detection faces more significant challenges because both the intra- and inter- graph structural and attribute patterns need to be taken into account to distinguish anomalies that exhibit deviating structures, rare attributes or the both. Although deep graph representation learning shows effectiveness in fusing high-level representations and capturing characters of individual graphs, most of the existing works are defective in graph-level anomaly detection because of their limited capability in exploring information across graphs, the imbalanced data distribution of anomalies, and low interpretability of the black-box graph neural networks (GNNs). To overcome these limitations, we propose a novel deep evolutionary graph mapping framework named GmapAD1, which can adaptively map each graph into a new feature space based on its similarity to a set of representative nodes chosen from the graph set. By automatically adjusting the candidate nodes using a specially designed evolutionary algorithm, anomalies and normal graphs are mapped to separate areas in the new feature space where a clear boundary between them can be learned. The selected candidate nodes can therefore be regarded as a benchmark for explaining anomalies because anomalies are more dissimilar/similar to the benchmark than normal graphs. Through our extensive experiments on nine real-world datasets, we demonstrate that exploring both intra- and inter- graph structural and attribute information is critical to spot anomalous graphs, and our method has achieved statistically significant improvements compared to the state of the art in terms of precision, recall, F1 score, and AUC.

Context-aware Event Forecasting via Graph Disentanglement

Event forecasting has been a demanding and challenging task throughout the entire human history. It plays a pivotal role in crisis alarming and disaster prevention in various aspects of the whole society. The task of event forecasting aims to model the relational and temporal patterns based on historical events and makes forecasting to what will happen in the future. Most existing studies on event forecasting formulate it as a problem of link prediction on temporal event graphs. However, such pure structured formulation suffers from two main limitations: 1) most events fall into general and high-level types in the event ontology, and therefore they tend to be coarse-grained and offers little utility which inevitably harms the forecasting accuracy; and 2) the events defined by a fixed ontology are unable to retain the out-of-ontology contextual information.

To address these limitations, we propose a novel task of context-aware event forecasting which incorporates auxiliary contextual information. First, the categorical context provides supplementary fine-grained information to the coarse-grained events. Second and more importantly, the context provides additional information towards specific situation and condition, which is crucial or even determinant to what will happen next. However, it is challenging to properly integrate context into the event forecasting framework, considering the complex patterns in the multi-context scenario. Towards this end, we design a novel framework named Separation and Collaboration Graph Disentanglement (short as SeCoGD) for context-aware event forecasting. In the separation stage, we leverage the context as a prior guidance to disentangle the event graph into multiple sub-graphs, followed by a context-specific module to model the relational-temporal patterns within each context. In the collaboration stage, we design a cross-context module to retain the collaborative associations among multiple contexts. Since there is no available dataset for this novel task, we construct three large- scale datasets based on GDELT. Experimental results demonstrate hat our model outperforms a list of SOTA methods. The dataset and code are released via https://github.com/yecchen/SeCoGD.

Querywise Fair Learning to Rank through Multi-Objective Optimization

In Learning-to-Rank (LTR) problems, the task of delivering relevant search results and allocating fair exposure to items of a protected group can conflict. Previous works in Fair LTR have attempted to resolve this by combining the objectives of relevant ranking and fair ranking into a single linear combination, but this approach is limited by the nonconvexity of the objective functions and can result in suboptimal relevance in ranking outputs. To address this, we propose a solution using Multi-Objective Optimization (MOO) algorithms. We extend these algorithms to querywise MOO to reduce the exposure disparity, not only on average but also at the query level. Interestingly, for moderate fairness requirements, it improves the relevance of ranking instead of deteriorating. We attribute this improvement to the benefits of multi-task learning and study the effect of fair ranking on the relevant ranking task. Moreover, we significantly improve the computational efficiency compared to previous methods by using the Gumbel max trick to sample the Plackett-Luce distribution. We evaluate our proposed methods on three real-world datasets and show their improvement in relevance ranking over state-of-the-art solutions.

Online Fairness Auditing through Iterative Refinement

A sizable proportion of deployed machine learning models make their decisions in a black-box manner. Such decision-making procedures are susceptible to intrinsic biases, which has led to a call for accountability in deployed decision systems. In this work, we investigate mechanisms that help audit claimed mathematical guarantees of the fairness of such systems. We construct AVOIR, a system that reduces the number of observations required for the runtime monitoring of probabilistic assertions over fairness metrics specified on decision functions associated with black-box AI models. AVOIR provides an adaptive process that automates the inference of probabilistic guarantees associated with estimating a wide range of fairness metrics. In addition, AVOIR enables the exploration of fairness violations aligned with governance and regulatory requirements. We conduct case studies with fairness metrics on three different datasets and demonstrate how AVOIR can help detect and localize fairness violations and ameliorate the issues with faulty fairness metric design.

End-to-End Inventory Prediction and Contract Allocation for Guaranteed Delivery Advertising

Guaranteed Delivery (GD) advertising plays an essential part in e-commerce marketing, where the ad publisher signs contracts with advertisers in advance by promising delivery of advertising impressions to fulfill targeting requirements for advertisers. Previous research on GD advertising mainly focused on online serving yet overlooked the importance of contract allocation at the GD selling stage. Traditional GD selling approaches consider impression inventory prediction and contract allocation as two separate stages. However, such a two-stage optimization often leads to inferior contract allocation performance. In this paper, our goal is to reduce this performance gap with a novel end-to-end approach. Specifically, we propose the Neural Lagrangian Selling (NLS) model to jointly predict the impression inventory and optimize the contract allocation of advertising impressions with a unified learning objective. To this end, we first develop a differentiable Lagrangian layer to backpropagate the allocation problem through the neural network and allow direct optimization of the allocation regret. Then, for effective optimization with various allocation targets and constraints, we design a graph convolutional neural network to extract predictive features from the bipartite allocation graph. Extensive experiments show that our approach can improve GD selling performance compared with existing two-stage approaches. Particularly, our optimization layer can outperform the baseline solvers in both computational efficiency and solution quality. To the best of our knowledge, this is the first study to apply the end-to-end prediction and optimization approach for industrial GD selling problems. Our work has implications for general prediction and allocation problems as well.

Impatient Bandits: Optimizing Recommendations for the Long-Term Without Delay

Recommender systems are a ubiquitous feature of online platforms. Increasingly, they are explicitly tasked with increasing users' long-term satisfaction. In this context, we study a content exploration task, which we formalize as a multi-armed bandit problem with delayed rewards. We observe that there is an apparent trade-off in choosing the learning signal: Waiting for the full reward to become available might take several weeks, hurting the rate at which learning happens, whereas measuring short-term proxy rewards reflects the actual long-term goal only imperfectly. We address this challenge in two steps. First, we develop a predictive model of delayed rewards that incorporates all information obtained to date. Full observations as well as partial (short or medium-term) outcomes are combined through a Bayesian filter to obtain a probabilistic belief. Second, we devise a bandit algorithm that takes advantage of this new predictive model. The algorithm quickly learns to identify content aligned with long-term success by carefully balancing exploration and exploitation. We apply our approach to a podcast recommendation problem, where we seek to identify shows that users engage with repeatedly over two months. We empirically validate that our approach results in substantially better performance compared to approaches that either optimize for short-term proxies, or wait for the long-term outcome to be fully realized.

Hyper-USS: Answering Subset Query Over Multi-Attribute Data Stream

Sketching algorithms are considered as promising solutions for answering approximate query on massive data stream. In real scenarios, a large number of problems can be abstracted as subset query over multiple attributes. Existing sketches are designed for query on single attributes, and therefore are inefficient for query on multiple attributes. In this work, we propose Hyper-USS, an innovative sketching algorithm that supports subset query over multiple attributes accurately and efficiently. To the best of our knowledge, this work is the first sketching algorithm designed to answer approximate query over multi-attribute data stream. We utilize the key technique, Joint Variance Optimization, to guarantee high estimation accuracy on all attributes. Experiment results show that, compared with the state-of-the-art (SOTA) sketches that support subset query on single attributes, Hyper-USS improves the accuracy by 16.67X and the throughput by 8.54X. The code is open-sourced at Github.

Densest Diverse Subgraphs: How to Plan a Successful Cocktail Party with Diversity

Dense subgraph discovery methods are routinely used in a variety of applications including the identification of a team of skilled individuals for collaboration from a social network. However, when the network's node set is associated with a sensitive attribute such as race, gender, religion, or political opinion, the lack of diversity can lead to lawsuits.

In this work, we focus on the problem of finding a densest diverse subgraph in a graph whose nodes have different attribute values/types that we refer to as colors. We propose two novel formulations motivated by different realistic scenarios. Our first formulation, called the densest diverse subgraph problem (DDSP), guarantees that no color represents more than some fraction of the nodes in the output subgraph, which generalizes the state-of-the-art due to Anagnostopoulos et al. (CIKM 2020). By varying the fraction we can range the diversity constraint and interpolate from a diverse dense subgraph where all colors have to be equally represented to an unconstrained dense subgraph. We design a scalable Ω(1/√n)-approximation algorithm, where n is the number of nodes. Our second formulation is motivated by the setting where any specified color should not be overlooked. We propose the densest at-least-→k-subgraph problem (Dal→kS), a novel generalization of the classic DalkS, where instead of a single value k, we have a vector k of cardinality demands with one coordinate per color class. We design a 1/3-approximation algorithm using linear programming together with an acceleration technique. Computational experiments using synthetic and real-world datasets demonstrate that our proposed algorithms are effective in extracting dense diverse clusters.

Learning to Relate to Previous Turns in Conversational Search

Conversational search allows a user to interact with a search system in multiple turns. A query is strongly dependent on the conversation context. An effective way to improve retrieval effectiveness is to expand the current query with historical queries. However, not all the previous queries are related to, and useful for expanding the current query. In this paper, we propose a new method to select relevant historical queries that are useful for the current query. To cope with the lack of labeled training data, we use a pseudo-labeling approach to annotate useful historical queries based on their impact on the retrieval results. The pseudo-labeled data are used to train a selection model. We further propose a multi-task learning framework to jointly train the selector and the retriever during fine-tuning, allowing us to mitigate the possible inconsistency between the pseudo labels and the changed retriever. Extensive experiments on four conversational search datasets demonstrate the effectiveness and broad applicability of our method compared with several strong baselines.

Online Level-wise Hierarchical Clustering

Online hierarchical clustering algorithms, compared to their scalable batch setting counterparts, typically provide more limited accuracy and efficiency performance. Yet, when data is incrementally arriving, a crucial setting in many clustering applications (e.g., entity resolution and concept discovery), these batch setting algorithms do not apply. This paper presents a family of new algorithms for online hierarchical clustering that combine high quality trees and fast per-point insertion time--made possible through a limited number of parallel non-greedy tree re-arrangements. We analyze our methods under assumptions about the data and the separability of clusters. Empirically, we find that our proposed algorithms yield state-of-the-art results in hierarchical clustering dendrogram purity and in building compressed prototypes for a k-nearest representative classifier.

Causal Inference via Style Transfer for Out-of-distribution Generalisation

Out-of-distribution (OOD) generalisation aims to build a model that can generalise well on an unseen target domain using knowledge from multiple source domains. To this end, the model should seek the causal dependence between inputs and labels, which may be determined by the semantics of inputs and remain invariant across domains. However, statistical or non-causal methods often cannot capture this dependence and perform poorly due to not considering spurious correlations learnt from model training via unobserved confounders. A well-known existing causal inference method like back-door adjustment cannot be applied to remove spurious correlations as it requires the observation of confounders. In this paper, we propose a novel method that effectively deals with hidden confounders by successfully implementing front-door adjustment (FA). FA requires the choice of a mediator, which we regard as the semantic information of images that helps access the causal mechanism without the need for observing confounders. Further, we propose to estimate the combination of the mediator with other observed images in the front-door formula via style transfer algorithms. Our use of style transfer to estimate FA is novel and sensible for OOD generalisation, which we justify by extensive experimental results on widely used benchmark datasets.

DotHash: Estimating Set Similarity Metrics for Link Prediction and Document Deduplication

Metrics for set similarity are a core aspect of several data mining tasks. To remove duplicate results in a Web search, for example, a common approach looks at the Jaccard index between all pairs of pages. In social network analysis, a much-celebrated metric is the Adamic-Adar index, widely used to compare node neighborhood sets in the important problem of predicting links. However, with the increasing amount of data to be processed, calculating the exact similarity between all pairs can be intractable. The challenge of working at this scale has motivated research into efficient estimators for set similarity metrics. The two most popular estimators, MinHash and SimHash, are indeed used in applications such as document deduplication and recommender systems where large volumes of data need to be processed. Given the importance of these tasks, the demand for advancing estimators is evident. We propose DotHash, an unbiased estimator for the intersection size of two sets. DotHash can be used to estimate the Jaccard index and, to the best of our knowledge, is the first method that can also estimate the Adamic-Adar index and a family of related metrics. We formally define this family of metrics, provide theoretical bounds on the probability of estimate errors, and analyze its empirical performance. Our experimental results indicate that DotHash is more accurate than the other estimators in link prediction and detecting duplicate documents with the same complexity and similar comparison time.

A Higher-Order Temporal H-Index for Evolving Networks

The H-index of a node in a static network is the maximum value h such that at least h of its neighbors have a degree of at least h. Recently, a generalized version, the n-th order H-index, was introduced, allowing to relate degree centrality, H-index, and the k-core of a node. We extend the n-th order H-index to temporal networks and define corresponding temporal centrality measures and temporal core decompositions. Our n-th order temporal H-index respects the reachability in temporal networks leading to node rankings, which reflect the importance of nodes in spreading processes. We derive natural decompositions of temporal networks into subgraphs with strong temporal coherence. We analyze a recursive computation scheme and develop a highly scalable streaming algorithm. Our experimental evaluation demonstrates the efficiency of our algorithms and the conceptional validity of our approach. Specifically, we show that the n-th order temporal H-index is a strong heuristic for identifying possible super-spreaders in evolving social networks and detects temporally well-connected components.

Cracking White-box DNN Watermarks via Invariant Neuron Transforms

Recently, how to protect the Intellectual Property (IP) of deep neural networks (DNN) becomes a major concern for the AI industry. To combat potential model piracy, recent works explore various watermarking strategies to embed secret identity messages into the prediction behaviors or the internals (e.g., weights and neuron activation) of the target model. Sacrificing less functionality and involving more knowledge about the target model, the latter branch of watermarking schemes (i.e., white-box model watermarking) is claimed to be accurate, credible and secure against most known watermark removal attacks, with emerging research efforts and applications in the industry.

In this paper, we present the first effective removal attack which cracks almost all the existing white-box watermarking schemes with provably no performance overhead and no required prior knowledge. By analyzing these IP protection mechanisms at the granularity of neurons, we for the first time discover their common dependence on a set of fragile features of a local neuron group, all of which can be arbitrarily tampered by our proposed chain of invariant neuron transforms. On nine state-of-the-art white-box watermarking schemes and a broad set of industry-level DNN architectures, our attack for the first time reduces the embedded identity message in the protected models to be almost random. Meanwhile, unlike known removal attacks, our attack requires no prior knowledge on the training data distribution or the adopted watermark algorithms, and leaves model functionality intact.

Deep Weakly-supervised Anomaly Detection

Recent semi-supervised anomaly detection methods that are trained using small labeled anomaly examples and large unlabeled data (mostly normal data) have shown largely improved performance over unsupervised methods. However, these methods often focus on fitting abnormalities illustrated by the given anomaly examples only (i.e., seen anomalies), and consequently they fail to generalize to those that are not, i.e., new types/classes of anomaly unseen during training. To detect both seen and unseen anomalies, we introduce a novel deep weakly-supervised approach, namely Pairwise Relation prediction Network (PReNet), that learns pairwise relation features and anomaly scores by predicting the relation of any two randomly sampled training instances, in which the pairwise relation can be anomaly-anomaly, anomaly-unlabeled, or unlabeled-unlabeled. Since unlabeled instances are mostly normal, the relation prediction enforces a joint learning of anomaly-anomaly, anomaly-normal, and normal-normal pairwise discriminative patterns, respectively. PReNet can then detect any seen/unseen abnormalities that fit the learned pairwise abnormal patterns, or deviate from the normal patterns. Further, this pairwise approach also seamlessly and significantly augments the training anomaly data. Empirical results on 12 real-world datasets show that PReNet significantly outperforms nine competing methods in detecting seen and unseen anomalies. We also theoretically and empirically justify the robustness of our model w.r.t. anomaly contamination in the unlabeled data. The code is available at https://github.com/mala-lab/PReNet.

Criteria Tell You More than Ratings: Criteria Preference-Aware Light Graph Convolution for Effective Multi-Criteria Recommendation

The multi-criteria (MC) recommender system, which leverages MC rating information in a wide range of e-commerce areas, is ubiquitous nowadays. Surprisingly, although graph neural networks (GNNs) have been widely applied to develop various recommender systems due to GNN's high expressive capability in learning graph representations, it has been still unexplored how to design MC recommender systems with GNNs. In light of this, we make the first attempt towards designing a GNN-aided MC recommender system. Specifically, rather than straightforwardly adopting existing GNN-based recommendation methods, we devise a novel criteria preference-aware light graph convolution (CPA-LGC ) method, which is capable of precisely capturing the criteria preference of users as well as the collaborative signal in complex high-order connectivities. To this end, we first construct an MC expansion graph that transforms user-item MC ratings into an expanded bipartite graph to potentially learn from the collaborative signal in MC ratings. Next, to strengthen the capability of criteria preference awareness, CPA-LGC incorporates newly characterized embeddings, including user-specific criteria-preference embeddings and item-specific criterion embeddings, into our graph convolution model. Through comprehensive evaluations using four real-world datasets, we demonstrate (a) the superiority over benchmark MC recommendation methods and benchmark recommendation methods using GNNs with tremendous gains, (b) the effectiveness of core components in CPA-LGC, and (c) the computational efficiency.

Domain-Guided Spatio-Temporal Self-Attention for Egocentric 3D Pose Estimation

Vision-based ego-centric 3D human pose estimation (ego-HPE) is essential to support critical applications of xR-technologies. However, severe self-occlusions and strong distortion introduced by the fish-eye view from the head mounted camera, make ego-HPE extremely challenging. To address these challenges, we propose a domain-guided spatio-temporal transformer model that leverages information specific to ego-views. Powered by this domain-guided transformer, we build Egocentric Spatio-Temporal Self-Attention Network (Ego-STAN), which uses 2D image representations and spatio-temporal attention to address both distortions and self-occlusions in ego-HPE. Additionally, we introduce a spatial concept called feature map tokens (FMT) which endows Ego-STAN with the ability to draw complex spatio-temporal information encoded in ego-centric videos. Our quantitative evaluation on the contemporary xR-EgoPose dataset, achieves a 38.2% improvement on the highest error joints against the SOTA ego-HPE model, while accomplishing a 22% decrease in the number of parameters. Finally, we also demonstrate the generalization capabilities of our model to real-world HPE tasks beyond ego-views achieving 7.7% improvement on 2D human pose estimation with the Human3.6M dataset. Our code is also made available at: https://github.com/jmpark0808/Ego-STAN

FedDefender: Client-Side Attack-Tolerant Federated Learning

Federated learning enables learning from decentralized data sources without compromising privacy, which makes it a crucial technique. However, it is vulnerable to model poisoning attacks, where malicious clients interfere with the training process. Previous defense mechanisms have focused on the server-side by using careful model aggregation, but this may not be effective when the data is not identically distributed or when attackers can access the information of benign clients. In this paper, we propose a new defense mechanism that focuses on the client-side, called FedDefender, to help benign clients train robust local models and avoid the adverse impact of malicious model updates from attackers, even when a server-side defense cannot identify or remove adversaries. Our method consists of two main components: (1) attack-tolerant local meta update and (2) attack-tolerant global knowledge distillation. These components are used to find noise-resilient model parameters while accurately extracting knowledge from a potentially corrupted global model. Our client-side defense strategy has a flexible structure and can work in conjunction with any existing server-side strategies. Evaluations of real-world scenarios across multiple datasets show that the proposed method enhances the robustness of federated learning against model poisoning attacks.

Few-shot Low-resource Knowledge Graph Completion with Multi-view Task Representation Generation

Despite their capacity to convey knowledge, most existing knowledge graphs (KGs) are created for specific domains using low-resource data sources, especially those in non-global languages, and thus unavoidably suffer from the incompleteness problem. The automatic discovery of missing triples for KG completion is thus hindered by the challenging long-tail relations problem in low-resource KGs. Few-shot learning models trained on rich-resource KGs are unable to tackle this challenge due to a lack of generalization. To alleviate the impact of the intractable long-tail problem on low-resource KG completion, in this paper, we propose a novel few-shot learning framework empowered by multi-view task representation generation. The framework consists of four components, i.e., few-shot learner, perturbed few-shot learner, relation knowledge distiller, and pairwise contrastive distiller. The key idea is to utilize the different views of each few-shot task to improve and regulate the training of the few-shot learner. For each few-shot task, instead of augmenting it by complicated task designs, we generate its representation of different views using the relation knowledge distiller and perturbed few-shot learner, which are obtained by distilling knowledge from a KG encoder and perturbing the few-shot learner. Then, the generated representation of different views is utilized by the pairwise contrastive distiller based on a teacher-student framework to distill the knowledge of how to represent relations from different views into the few-shot learner and facilitate few-shot learning. Extensive experiments conducted on several real-world low-resource KGs validate the effectiveness of our proposed method.

Efficient Centrality Maximization with Rademacher Averages

The identification of the set of k most central nodes of a graph, or centrality maximization, is a key task in network analysis, with various applications ranging from finding communities in social and biological networks to understanding which seed nodes are important to diffuse information in a graph. As the exact computation of centrality measures does not scale to modern-sized networks, the most practical solution is to resort to rigorous, but efficiently computable, randomized approximations. In this work we present CentRA, the first algorithm based on progressive sampling to compute high-quality approximations of the set of k most central nodes. CentRA is based on a novel approach to efficiently estimate Monte Carlo Rademacher Averages, a powerful tool from statistical learning theory to compute sharp data-dependent approximation bounds. Then, we study the sample complexity of centrality maximization using the VC-dimension, a key concept from statistical learning theory. We show that the number of random samples required to compute high-quality approximations scales with finer characteristics of the graph, such as its vertex diameter, or of the centrality of interest, significantly improving looser bounds derived from standard techniques. We apply CentRA to analyze large real-world networks, showing that it significantly outperforms the state-of-the-art approximation algorithm in terms of number of samples, running times, and accuracy.

Locality Sensitive Hashing for Optimizing Subgraph Query Processing in Parallel Computing Systems

This paper explores parallel computing systems for efficient subgraph query processing in large graphs. We investigate how to take advantage of the inherent parallelism of parallel computing systems for both intraquery and interquery optimization during subgraph query processing. Rather than relying on widely-used hash-based methods, we utilize and extend locality sensitive hashing methods. For intraquery optimization, we use the structures of both the data graph and subgraph query to design a query-constraint locality sensitive hashing method named QCMH, which can be used to merge multiple tasks during a single subgraph query processing. For interquery optimization, we propose a query locality sensitive hashing method named QMH, which can be used to detect common subgraphs among different subgraph queries, thereby merging multiple subgraph queries. Our proposed methods can reduce the redundant computation among multiple tasks duringa single subgraph query processing or multiple queries. Extensive experimental studies on large real and synthetic graphs show that our proposed methods can improve query performance compared to state-of-the-art methods by 10% to 50%.

Learning from Positive and Unlabeled Multi-Instance Bags in Anomaly Detection

In the multi-instance learning (MIL) setting instances are grouped together into bags. Labels are provided only for the bags and not on the level of individual instances. A positive bag label means that at least one instance inside the bag is positive, while a negative bag label restricts all the instances in the bag to be negative. MIL data naturally arises in many contexts, such as anomaly detection, where labels are rare and costly, and one often ends up annotating the label for sets of instances. Moreover, in many real-world anomaly detection problems, only positive labels are collected because they usually represent critical events. Such a setting, where only positive labels are provided along with unlabeled data, is called Positive and Unlabeled (PU) learning. Despite being useful for several use cases, there is no work dedicated to learning from positive and unlabeled data in a multi-instance setting for anomaly detection. Therefore, we propose the first method that learns from PU bags in anomaly detection. Our method uses an autoencoder as an underlying anomaly detector. We alter the autoencoder's objective function and propose a new loss that allows it to learn from positive and unlabeled bags of instances. We theoretically analyze this method. Experimentally, we evaluate our method on 30 datasets and show that it performs better than multiple baselines adapted to work in our setting.

Deep Pipeline Embeddings for AutoML

Automated Machine Learning (AutoML) is a promising direction for democratizing AI by automatically deploying Machine Learning systems with minimal human expertise. The core technical challenge behind AutoML is optimizing the pipelines of Machine Learning systems (e.g. the choice of preprocessing, augmentations, models, optimizers, etc.). Existing Pipeline Optimization techniques fail to explore deep interactions between pipeline stages/components. As a remedy, this paper proposes a novel neural architecture that captures the deep interaction between the components of a Machine Learning pipeline. We propose embedding pipelines into a latent representation through a novel per-component encoder mechanism. To search for optimal pipelines, such pipeline embeddings are used within deep-kernel Gaussian Process surrogates inside a Bayesian Optimization setup. Furthermore, we meta-learn the parameters of the pipeline embedding network using existing evaluations of pipelines on diverse collections of related datasets (a.k.a. meta-datasets). Through extensive experiments on three large-scale meta-datasets, we demonstrate that pipeline embeddings yield state-of-the-art results in Pipeline Optimization.

Graph Neural Bandits

Contextual bandits algorithms aim to choose the optimal arm with the highest reward out of a set of candidates based on the contextual information. Various bandit algorithms have been applied to real-world applications due to their ability of tackling the exploitation-exploration dilemma. Motivated by online recommendation scenarios, in this paper, we propose a framework named Graph Neural Bandits (GNB) to leverage the collaborative nature among users empowered by graph neural networks (GNNs). Instead of estimating rigid user clusters as in existing works, we model the "fine-grained" collaborative effects through estimated user graphs in terms of exploitation and exploration respectively. Then, to refine the recommendation strategy, we utilize separate GNN-based models on estimated user graphs for exploitation and adaptive exploration. Theoretical analysis and experimental results on multiple real data sets in comparison with state-of-the-art baselines are provided to demonstrate the effectiveness of our proposed framework.

Towards Understanding and Enhancing Robustness of Deep Learning Models against Malicious Unlearning Attacks

Given the availability of abundant data, deep learning models have been advanced and become ubiquitous in the past decade. In practice, due to many different reasons (e.g., privacy, usability, and fidelity), individuals also want the trained deep models to forget some specific data. Motivated by this, machine unlearning (also known as selective data forgetting) has been intensively studied, which aims at removing the influence that any particular training sample had on the trained model during the unlearning process. However, people usually employ machine unlearning methods as trusted basic tools and rarely have any doubt about their reliability. In fact, the increasingly critical role of machine unlearning makes deep learning models susceptible to the risk of being maliciously attacked. To well understand the performance of deep learning models in malicious environments, we believe that it is critical to study the robustness of deep learning models to malicious unlearning attacks, which happen during the unlearning process. To bridge this gap, in this paper, we first demonstrate that malicious unlearning attacks pose immense threats to the security of deep learning systems. Specifically, we present a broad class of malicious unlearning attacks wherein maliciously crafted unlearning requests trigger deep learning models to misbehave on target samples in a highly controllable and predictable manner. In addition, to improve the robustness of deep learning models, we also present a general defense mechanism, which aims to identify and unlearn effective malicious unlearning requests based on their gradient influence on the unlearned models. Further, theoretical analyses are conducted to analyze the proposed methods. Extensive experiments on real-world datasets validate the vulnerabilities of deep learning models to malicious unlearning attacks and the effectiveness of the introduced defense mechanism.

Generalizable Low-Resource Activity Recognition with Diverse and Discriminative Representation Learning

Human activity recognition (HAR) is a time series classification task that focuses on identifying the motion patterns from human sensor readings. Adequate data is essential but a major bottleneck for training a generalizable HAR model, which assists customization and optimization of online web applications. However, it is costly in time and economy to collect large-scale labeled data in reality, i.e., the low-resource challenge. Meanwhile, data collected from different persons have distribution shifts due to different living habits, body shapes, age groups, etc. The low-resource and distribution shift challenges are detrimental to HAR when applying the trained model to new unseen subjects. In this paper, we propose a novel approach called Diverse and Discriminative representation Learning (DDLearn) for generalizable low-resource HAR. DDLearn simultaneously considers diversity and discrimination learning. With the constructed self-supervised learning task, DDLearn enlarges the data diversity and explores the latent activity properties. Then, we propose a diversity preservation module to preserve the diversity of learned features by enlarging the distribution divergence between the original and augmented domains. Meanwhile, DDLearn also enhances semantic discrimination by learning discriminative representations with supervised contrastive learning. Extensive experiments on three public HAR datasets demonstrate that our method significantly outperforms state-of-art methods by an average accuracy improvement of 9.5% under the low-resource distribution shift scenarios, while being a generic, explainable, and flexible framework. Code is available at: https://github.com/microsoft/robustlearn.

FedAPEN: Personalized Cross-silo Federated Learning with Adaptability to Statistical Heterogeneity

In cross-silo federated learning (FL), the data among clients are usually statistically heterogeneous (aka not independent and identically distributed, non-IID) due to diversified data sources, lowering the accuracy of FL. Although many personalized FL (PFL) approaches have been proposed to address this issue, they are only suitable for data with specific degrees of statistical heterogeneity. In the real world, the heterogeneity of data among clients is often immeasurable due to privacy concern, making the targeted selection of PFL approaches difficult. Besides, in cross-silo FL, clients are usually from different organizations, tending to hold architecturally different private models. In this work, we propose a novel FL framework, FedAPEN, which combines mutual learning and ensemble learning to take the advantages of private and shared global models while allowing heterogeneous models. Within FedAPEN, we propose two mechanisms to coordinate and promote model ensemble such that FedAPEN achieves excellent accuracy on various data distributions without prior knowledge of data heterogeneity, and thus, obtains the adaptability to data heterogeneity. We conduct extensive experiments on four real-world datasets, including: 1) Fashion MNIST, CIFAR-10, and CIFAR-100, each with ten different types and degrees of label distribution skew; and 2) eICU with feature distribution skew. The experiments demonstrate that FedAPEN almost obtains superior accuracy on data with varying types and degrees of heterogeneity compared with baselines.

3D-IDS: Doubly Disentangled Dynamic Intrusion Detection

Network-based intrusion detection system (NIDS) monitors network traffic for malicious activities, forming the frontline defense against increasing attacks over information infrastructures. Although promising, our quantitative analysis shows that existing methods perform inconsistently in declaring various unknown attacks (e.g., 9% and 35% F1 respectively for two distinct unknown threats for an SVM-based method) or detecting diverse known attacks (e.g., 31% F1 for the Backdoor and 93% F1 for DDoS for a GCN-based state-of-the-art method), and reveals that the underlying cause is entangled distributions of flow features. This motivates us to propose 3D-IDS, a novel method that aims to tackle the above issues through two-step feature disentanglements and a dynamic graph diffusion scheme. Specifically, we first disentangle traffic features by a non-parameterized optimization based on mutual information, automatically differentiating tens and hundreds of complex features of various attacks. Such differentiated features will be fed into a memory model to generate representations, which are further disentangled to highlight the attack-specific features. Finally, we use a novel graph diffusion method that dynamically fuses the network topology for spatial-temporal aggregation in evolving data streams. By doing so, we can effectively identify various attacks in encrypted traffics, including unknown threats and known ones that are not easily detected. Experiments show the superiority of our 3D-IDS. We also demonstrate that our two-step feature disentanglements benefit the explainability of NIDS.

Reconstructing Graph Diffusion History from a Single Snapshot

Diffusion on graphs is ubiquitous with numerous high-impact applications, ranging from the study of residential segregation in socioeconomics and activation cascading in neuroscience, to the modeling of disease contagion in epidemiology and malware spreading in cybersecurity. In these applications, complete diffusion histories play an essential role in terms of identifying dynamical patterns, reflecting on precaution actions, and forecasting intervention effects. Despite their importance, complete diffusion histories are rarely available and are highly challenging to reconstruct due to ill-posedness, explosive search space, and scarcity of training data. To date, few methods exist for diffusion history reconstruction. They are exclusively based on the maximum likelihood estimation (MLE) formulation and require to know true diffusion parameters. In this paper, we study an even harder problem, namely reconstructing Diffusion history from A single SnapsHot (DASH), where we seek to reconstruct the history from only the final snapshot without knowing true diffusion parameters. We start with theoretical analyses that reveal a fundamental limitation of the MLE formulation. We prove: (a) estimation error of diffusion parameters is unavoidable due to NP-hardness of diffusion parameter estimation, and (b) the MLE formulation is sensitive to estimation error of diffusion parameters. To overcome the inherent limitation of the MLE formulation, we propose a novel barycenter formulation: finding the barycenter of the posterior distribution of histories, which is provably stable against the estimation error of diffusion parameters. We further develop an effective solver named DIffusion hiTting Times with Optimal proposal (DITTO) by reducing the problem to estimating posterior expected hitting times via the Metropolis-Hastings Markov chain Monte Carlo method (M-H MCMC) and employing an unsupervised graph neural network to learn an optimal proposal to accelerate the convergence of M-H MCMC. We conduct extensive experiments to demonstrate the efficacy of the proposed method. Our code is available at https://github.com/q-rz/KDD23-DITTO. The appendix can be found at https://arxiv.org/abs/2306.00488.

Source-Free Domain Adaptation with Temporal Imputation for Time Series Data

Source-free domain adaptation (SFDA) aims to adapt a pretrained model from a labeled source domain to an unlabeled target domain without access to the source domain data, preserving source domain privacy. Despite its prevalence in visual applications, SFDA is largely unexplored in time series applications. The existing SFDA methods that are mainly designed for visual applications may fail to handle the temporal dynamics in time series, leading to impaired adaptation performance. To address this challenge, this paper presents a simple yet effective approach for source-free domain adaptation on time series data, namely MAsk and imPUte (MAPU). First, to capture temporal information of the source domain, our method performs random masking on the time series signals while leveraging a novel temporal imputer to recover the original signal from a masked version in the embedding space. Second, in the adaptation step, the imputer network is leveraged to guide the target model to produce target features that are temporally consistent with the source features. To this end, our MAPU can explicitly account for temporal dependency during the adaptation while avoiding the imputation in the noisy input space. Our method is the first to handle temporal consistency in SFDA for time series data and can be seamlessly equipped with other existing SFDA methods. Extensive experiments conducted on three real-world time series datasets demonstrate that our MAPU achieves significant performance gain over existing methods. Our code is available at: https://github.com/mohamedr002/MAPU_SFDA_TS.

FedPseudo: Privacy-Preserving Pseudo Value-Based Deep Learning Models for Federated Survival Analysis

Survival analysis, aka time-to-event analysis, has a wide-ranging impact on patient care. Federated Survival Analysis (FSA) is an emerging Federated Learning (FL) paradigm for performing survival analysis on distributed decentralized data available at multiple medical institutions. FSA enables individual medical institutions, referred to as clients, to improve their survival predictions while ensuring privacy. However, FSA faces challenges due to non-linear and non-IID data distributions among clients, as well as bias caused by censoring. Although recent studies have adapted Cox Proportional Hazards (CoxPH) survival models for FSA, a systematic exploration of these challenges is currently lacking. In this paper, we address these critical challenges by introducing FedPseudo, a pseudo value-based deep learning framework for FSA. FedPseudo uses deep learning models to learn robust representations from non-linear survival data, leverages the power of pseudo values to handle non-uniform censoring, and employs FL algorithms such as FedAvg to learn model parameters. We propose a novel and simple approach for estimating pseudo values for FSA. We provide theoretical proof that the estimated pseudo values, referred to as Federated Pseudo Values, are consistent. Moreover, our empirical results demonstrate that they can be computed faster than traditional methods of deriving pseudo values. To ensure and enhance the privacy of both the estimated pseudo values and the shared model parameters, we systematically investigate the application of differential privacy (DP) on both the federated pseudo values and local model updates. Furthermore, we adapt V -Usable Information metric to quantify the informativeness of a client's data for training a survival model and utilize this metric to show the advantages of participating in FSA. We conducted extensive experiments on synthetic and real-world survival datasets to demonstrate that our FedPseudo framework achieves better performance than other FSA approaches and performs similarly to the best centrally trained deep survival model. Moreover, FedPseudo consistently achieves superior results across different censoring settings.

Robustness Certification for Structured Prediction with General Inputs via Safe Region Modeling in the Semimetric Output Space

Many real-world machine learning problems involve structured prediction beyond categorical labels. However, most existing robustness certification works are devoted to the classification case. It remains open for robustness certification for more general outputs. In this paper, we propose a novel framework of robustness certification for structured prediction problems, where the output space is modeled as a semimetric space with a distance function that satisfies non-negativity and symmetry but not necessarily the triangle inequality. We further develop our tailored certification methods for binary, numerical, and hybrid inputs in structured prediction. Experiment results show that our method achieves tighter robustness guarantees than the SOTA structured certification baseline for numerical inputs (for which it only supports) with ℓ2 norm perturbation when outputs are measured by intersection over union (IoU) similarity, total variation distance, and perceptual distance. Moreover, we achieve good robustness certification for binary inputs with ℓ0 norm perturbation and hybrid inputs with corresponding perturbation when outputs are measured by Manhattan distance.

Temporal Dynamics-Aware Adversarial Attacks on Discrete-Time Dynamic Graph Models

Real-world graphs such as social networks, communication networks, and rating networks are constantly evolving over time. Many deep learning architectures have been developed to learn effective node representations using both graph structure and dynamics. While being crucial for practical applications, the robustness of these representation learners for dynamic graphs in the presence of adversarial attacks is highly understudied. In this work, we design a novel adversarial attack on discrete-time dynamic graph models where we desire to perturb the input graph sequence in a manner that preserves the temporal dynamics of the graph while dropping the performance of representation learners. To this end, we motivate a novel Temporal Dynamics-Aware Perturbation (TDAP) constraint, which ensures that perturbations introduced at each time step are restricted to only a small fraction of the number of changes in the graph since the previous time step. We present a theoretically-motivated Projected Gradient Descent approach for dynamic graphs to find effective perturbations under the TDAP constraint. Experiments on two tasks - dynamic link prediction and node classification, show that our approach is up to 4x more effective than the baseline methods for attacking these models. We extend our approach to a more practical online setting where graphs become available in real-time and show up to 5x superior performance over baselines We also show that our approach successfully evades state-of-the-art neural approaches for anomaly detection, thereby promoting the need to study robustness as a part of representation-learning approaches for dynamic graphs.

CARL-G: Clustering-Accelerated Representation Learning on Graphs

Self-supervised learning on graphs has made large strides in achieving great performance in various downstream tasks. However, many state-of-the-art methods suffer from a number of impediments, which prevent them from realizing their full potential. For instance, contrastive methods typically require negative sampling, which is often computationally costly. While non-contrastive methods avoid this expensive step, most existing methods either rely on overly complex architectures or dataset-specific augmentations. In this paper, we ask: Can we borrow from classical unsupervised machine learning literature in order to overcome those obstacles? Guided by our key insight that the goal of distance-based clustering closely resembles that of contrastive learning: both attempt to pull representations of similar items together and dissimilar items apart. As a result, we propose CARL-G - a novel clustering-based framework for graph representation learning that uses a loss inspired by Cluster Validation Indices (CVIs), i.e., internal measures of cluster quality (no ground truth required). CARL-G is adaptable to different clustering methods and CVIs, and we show that with the right choice of clustering method and CVI, CARL-G outperforms node classification baselines on 4/5 datasets with up to a 79× training speedup compared to the best-performing baseline. CARL-G also performs at par or better than baselines in node clustering and similarity search tasks, training up to 1,500× faster than the best-performing baseline. Finally, we also provide theoretical foundations for the use of CVI-inspired losses in graph representation learning.

One-shot Joint Extraction, Registration and Segmentation of Neuroimaging Data

Brain extraction, registration and segmentation are indispensable preprocessing steps in neuroimaging studies. The aim is to extract the brain from raw imaging scans (i.e., extraction step), align it with a target brain image (i.e., registration step) and label the anatomical brain regions (i.e., segmentation step). Conventional studies typically focus on developing separate methods for the extraction, registration and segmentation tasks in a supervised setting. The performance of these methods is largely contingent on the quantity of training samples and the extent of visual inspections carried out by experts for error correction. Nevertheless, collecting voxel-level labels and performing manual quality control on high-dimensional neuroimages (e.g., 3D MRI) are expensive and time-consuming in many medical studies. In this paper, we study the problem of one-shot joint extraction, registration and segmentation in neuroimaging data, which exploits only one labeled template image (a.k.a. atlas) and a few unlabeled raw images for training. We propose a unified end-to-end framework, called JERS, to jointly optimize the extraction, registration and segmentation tasks, allowing feedback among them. Specifically, we use a group of extraction, registration and segmentation modules to learn the extraction mask, transformation and segmentation mask, where modules are interconnected and mutually reinforced by self-supervision. Empirical results on real-world datasets demonstrate that our proposed method performs exceptionally in the extraction, registration and segmentation tasks.

Learning Autoregressive Model in LSM-Tree based Store

Database-native machine learning operators are highly desired for efficient I/O and computation costs. While most existing machine learning algorithms assume the time series data fully available and readily ordered by timestamps, it is not the case in practice. Commodity time series databases store the data in pages with possibly overlapping time ranges, known as LSM-Tree based storage. Data points in a page could be incomplete, owing to either missing values or out-of-order arrivals, which may be inserted by the imputed or delayed points in the following pages. Likewise, data points in a page could also be updated by others in another page, for dirty data repairing or re-transmission. A straightforward idea is thus to first merge and order the data points by timestamps, and then apply the existing learning algorithms. It is not only costly in I/O but also prevents pre-computation of model learning. In this paper, we propose to offline learn the AR models locally in each page on incomplete data, and online aggregate the stored models in different pages with the consideration of the aforesaid inserted and updated data points. Remarkably, the proposed method has been deployed and included as a function in an open source time series database, Apache IoTDB. Extensive experiments in the system demonstrate that our proposal LSMAR shows up to one order-of-magnitude improvement in learning time cost. It needs only about 10s of milliseconds for learning over 1 million data points.

PSLOG: Pretraining with Search Logs for Document Ranking

Recently, pretrained models have achieved remarkable performance not only in natural language processing but also in information retrieval (IR). Previous studies show that IR-oriented pretraining tasks can achieve better performance than only finetuning pretrained language models in IR datasets. Besides, the massive search log data obtained from mainstream search engines can be used in IR pretraining, for it contains users' implicit judgments of document relevance under a concrete query. However, existing methods mainly use direct query-document click signals to pretrain models. The potential supervision signals from search logs are far from being well explored. In this paper, we propose to comprehensively leverage four query-document relevance relations, including co-interaction and multi-hop relations, to pretrain ranking models in IR. Specifically, we focus on the user's click behavior and construct an Interaction Graph to represent the global relevance relations between queries and documents from all search logs. With the graph, we can consider the co-interaction and multi-hop q-d relationships through their neighbor nodes. Based on the relations extracted from the interaction graph, we propose four strategies to generate contrastive positive and negative q-d pairs and use these data to pretrain ranking models. Experimental results on both industrial and academic datasets demonstrate the effectiveness of our method.

Enhance Diffusion to Improve Robust Generalization

Deep neural networks are susceptible to human imperceptible adversarial perturbations. One of the strongest defense mechanisms is Adversarial Training (AT). In this paper, we aim to address two predominant problems in AT. First, there is still little consensus on how to set hyperparameters with a performance guarantee for AT research, and customized settings impede a fair comparison between different model designs in AT research. Second, the robustly trained neural networks struggle to generalize well and suffer from tremendous overfitting. This paper focuses on the primary AT framework - Projected Gradient Descent Adversarial Training (PGD-AT). We approximate the dynamic of PGD-AT by a continuous-time Stochastic Differential Equation (SDE), and show that the diffusion term of this SDE determines the robust generalization. An immediate implication of this theoretical finding is that robust generalization is positively correlated with the ratio between learning rate and batch size. We further propose a novel approach, Diffusion Enhanced Adversarial Training (DEAT), to manipulate the diffusion term to improve robust generalization with virtually no extra computational burden. We theoretically show that DEAT obtains a tighter generalization bound than PGD-AT. Our empirical investigation is extensive and firmly attests that DEAT universally outperforms PGD-AT by a significant margin.

ShapleyFL: Robust Federated Learning Based on Shapley Value

Federated Learning (FL) allows clients to form a consortium to train a global model under the orchestration of a central server while keeping data on the local client without sharing it, thus mitigating data privacy issues. However, training a robust global model is challenging since the local data is invisible to the server. The local data of clients are naturally heterogeneous, while some clients can use corrupted data or send malicious updates to interfere with the training process artificially. Meanwhile, communication and computation costs are inevitable challenges in designing a practical FL algorithm. In this paper, to improve the robustness of FL, we propose a Shapley value-inspired adaptive weighting mechanism, which regards the FL training as sequential cooperative games and adjusts clients' weights according to their contributions. We also develop a client sampling strategy based on importance sampling, which can reduce the communication cost by optimizing the variance of the global updates according to the weights of clients. Furthermore, to diminish the computation cost of the server, we propose a weight calculation method by estimating differences between the Shapley value of clients. Our experimental results on several real data sets demonstrate the effectiveness of our approaches.

Mastering Stock Markets with Efficient Mixture of Diversified Trading Experts

Quantitative stock investment is a fundamental financial task that highly relies on accurate prediction of market status and profitable investment decision making. Despite recent advances in deep learning (DL) have shown stellar performance on capturing trading opportunities in the stochastic stock market, the performance of existing DL methods is unstable with sensitivity to network initialization and hyperparameter selection. One major limitation of existing works is that investment decisions are made based on one individual neural network predictor with high uncertainty, which is inconsistent with the workflow in real-world trading firms. To tackle this limitation, we propose AlphaMix, a novel three-stage mixture-of-experts (MoE) framework for quantitative investment to mimic the efficient bottom-up hierarchical trading strategy design workflow of successful trading companies. In Stage one, we introduce an efficient ensemble learning method, whose computational and memory costs are significantly lower comparing to traditional ensemble methods, to train multiple groups of trading experts with personalised market understanding and trading styles. In Stage two, we collect diversified investment suggestions through building a pool of trading experts utilizing hyperparameter level and initialization level diversity of neural networks for post hoc ensemble construction. In Stage three, we design three different mechanisms, namely as-needed router, with-replacement selection and integrated expert soup, to dynamically pick experts from the expert pool, which takes the responsibility of a portfolio manager. Through extensive experiments on US and Chinese stock markets, we demonstrate that AlphaMix significantly outperforms many state-of-the-art baselines in terms of 7 popular financial criteria.

All in One: Multi-Task Prompting for Graph Neural Networks

Recently, "pre-training and fine-tuning'' has been adopted as a standard workflow for many graph tasks since it can take general graph knowledge to relieve the lack of graph annotations from each application. However, graph tasks with node level, edge level, and graph level are far diversified, making the pre-training pretext often incompatible with these multiple tasks. This gap may even cause a "negative transfer'' to the specific application, leading to poor results. Inspired by the prompt learning in natural language processing (NLP), which has presented significant effectiveness in leveraging prior knowledge for various NLP tasks, we study the prompting topic for graphs with the motivation of filling the gap between pre-trained models and various graph tasks. In this paper, we propose a novel multi-task prompting method for graph models. Specifically, we first unify the format of graph prompts and language prompts with the prompt token, token structure, and inserting pattern. In this way, the prompting idea from NLP can be seamlessly introduced to the graph area. Then, to further narrow the gap between various graph tasks and state-of-the-art pre-training strategies, we further study the task space of various graph applications and reformulate downstream problems to the graph-level task. Afterward, we introduce meta-learning to efficiently learn a better initialization for the multi-task prompt of graphs so that our prompting framework can be more reliable and general for different tasks. We conduct extensive experiments, results from which demonstrate the superiority of our method.

Joint Pre-training and Local Re-training: Transferable Representation Learning on Multi-source Knowledge Graphs

In this paper, we present the "joint pre-training and local re-training'' framework for learning and applying multi-source knowledge graph (KG) embeddings. We are motivated by the fact that different KGs contain complementary information to improve KG embeddings and downstream tasks. We pre-train a large teacher KG embedding model over linked multi-source KGs and distill knowledge to train a student model for a task-specific KG. To enable knowledge transfer across different KGs, we use entity alignment to build a linked subgraph for connecting the pre-trained KGs and the target KG. The linked subgraph is re-trained for three-level knowledge distillation from the teacher to the student, i.e., feature knowledge distillation, network knowledge distillation, and prediction knowledge distillation, to generate more expressive embeddings. The teacher model can be reused for different target KGs and tasks without having to train from scratch. We conduct extensive experiments to demonstrate the effectiveness and efficiency of our framework.

Causal Effect Estimation on Hierarchical Spatial Graph Data

Estimating individual treatment effects from observational data is a fundamental problem in causal inference. To accurately estimate treatment effects in the spatial domain, we need to address certain aspects such as how to use the spatial coordinates of covariates and treatments and how the covariates and the treatments interact spatially. We introduce a new problem of predicting treatment effects on time series outcomes from spatial graph data with a hierarchical structure. To address this problem, we propose a spatial intervention neural network (SINet) that leverages the hierarchical structure of spatial graphs to learn a rich representation of the covariates and the treatments and exploits this representation to predict a time series of treatment outcome. Using a multi-agent simulator, we synthesized a crowd movement guidance dataset and conduct experiments to estimate the conditional average treatment effect, where we considered the initial locations of the crowds as covariates, route guidance as a treatment, and number of agents reaching a goal at each time stamp as the outcome. We employed state-of-the-art spatio-temporal graph neural networks and neural network-based causal inference methods as baselines, and show that our proposed method outperformed baselines both quantitatively and qualitatively.

PERT-GNN: Latency Prediction for Microservice-based Cloud-Native Applications via Graph Neural Networks

Cloud-native applications using microservice architectures are rapidly replacing traditional monolithic applications. To meet end-to-end QoS guarantees and enhance user experience, each component microservice must be provisioned with sufficient resources to handle incoming API calls. Accurately predicting the latency of microservices-based applications is critical for optimizing resource allocation, which turns out to be extremely challenging due to the complex dependencies between microservices and the inherent stochasticity. To tackle this problem, various predictors have been designed based on the Microservice Call Graph. However, Microservice Call Graphs do not take into account the API-specific information, cannot capture important temporal dependencies, and cannot scale to large-scale applications.

In this paper, we propose PERT-GNN, a generic graph neural network based framework to predict the end-to-end latency for microservice applications. PERT-GNN characterizes the interactions or dependency of component microservices observed from prior execution traces of the application using the Program Evaluation and Review Technique (PERT). We then construct a graph neural network based on the generated PERT Graphs, and formulate the latency prediction task as a supervised graph regression problem using the graph transformer method. PERT-GNN can capture the complex temporal causality of different microservice traces, thereby producing more accurate latency predictions for various applications. Evaluations based on datasets generated from common benchmarks and large-scale Alibaba microservice traces show that PERT-GNN can outperform other models by a large margin. In particular, PERT-GNN is able to predict the latency of microservice applications with less than 12% mean absolute percentage error.

ExplainableFold: Understanding AlphaFold Prediction with Explainable AI

This paper presents ExplainableFold (xFold), which is an Explainable AI framework for protein structure prediction. Despite the success of AI-based methods such as AlphaFold (αFold) in this field, the underlying reasons for their predictions remain unclear due to the black-box nature of deep learning models. To address this, we propose a counterfactual learning framework inspired by biological principles to generate counterfactual explanations for protein structure prediction, enabling a dry-lab experimentation approach. Our experimental results demonstrate the ability of ExplainableFold to generate high-quality explanations for AlphaFold's predictions, providing near-experimental understanding of the effects of amino acids on 3D protein structure. This framework has the potential to facilitate a deeper understanding of protein structures. Source code and data of the ExplainableFold project are available at https://github.com/rutgerswiselab/ExplainableFold.

Virtual Node Tuning for Few-shot Node Classification

Few-shot Node Classification (FSNC) is a challenge in graph representation learning where only a few labeled nodes per class are available for training. To tackle this issue, meta-learning has been proposed to transfer structural knowledge from base classes with abundant labels to target novel classes. However, existing solutions become ineffective or inapplicable when base classes have no or limited labeled nodes. To address this challenge, we propose an innovative method dubbed Virtual Node Tuning (VNT). Our approach utilizes a pretrained graph transformer as the encoder and injects virtual nodes as soft prompts in the embedding space, which can be optimized with few-shot labels in novel classes to modulate node embeddings for each specific FSNC task. A unique feature of VNT is that, by incorporating a Graph-based Pseudo Prompt Evolution (GPPE) module, VNT-GPPE can handle scenarios with sparse labels in base classes. Experimental results on four datasets demonstrate the superiority of the proposed approach in addressing FSNC with unlabeled or sparsely labeled base classes, outperforming existing state-of-the-art methods and even fully supervised baselines.

Quantitatively Measuring and Contrastively Exploring Heterogeneity for Domain Generalization

Domain generalization (DG) is a prevalent problem in real-world applications, which aims to train well-generalized models for unseen target domains by utilizing several source domains. Since domain labels, i.e., which domain each data point is sampled from, naturally exist, most DG algorithms treat them as a kind of supervision information to improve the generalization performance. However, the original domain labels may not be the optimal supervision signal due to the lack of domain heterogeneity, i.e., the diversity among domains. For example, a sample in one domain may be closer to another domain, its original label thus can be the noise to disturb the generalization learning. Although some methods try to solve it by re-dividing domains and applying the newly generated dividing pattern, the pattern they choose may not be the most heterogeneous due to the lack of the metric for heterogeneity. In this paper, we point out that domain heterogeneity mainly lies in variant features under the invariant learning framework. With contrastive learning, we propose a learning potential-guided metric for domain heterogeneity by promoting learning variant features. Then we notice the differences between seeking variance-based heterogeneity and training invariance-based generalizable model. We thus propose a novel method called H eterogeneity-based Two-stage Contrastive Learning (HTCL) for the DG task. In the first stage, we generate the most heterogeneous dividing pattern with our contrastive metric. In the second stage, we employ an invariance-aimed contrastive learning by re-building pairs with the stable relation hinted by domains and classes, which better utilizes generated domain labels for generalization learning. Extensive experiments show HTCL better digs heterogeneity and yields great generalization performance.

Adversaries with Limited Information in the Friedkin-Johnsen Model

In recent years, online social networks have been the target of adversaries who seek to introduce discord into societies, to undermine democracies and to destabilize communities. Often the goal is not to favor a certain side of a conflict but to increase disagreement and polarization. To get a mathematical understanding of such attacks, researchers use opinion-formation models from sociology, such as the Friedkin--Johnsen model, and formally study how much discord the adversary can produce when altering the opinions for only a small set of users. In this line of work, it is commonly assumed that the adversary has full knowledge about the network topology and the opinions of all users. However, the latter assumption is often unrealistic in practice, where user opinions are not available or simply difficult to estimate accurately.

To address this concern, we raise the following question: Can an attacker sow discord in a social network, even when only the network topology is known? We answer this question affirmatively. We present approximation algorithms for detecting a small set of users who are highly influential for the disagreement and polarization in the network. We show that when the adversary radicalizes these users and if the initial disagreement/polarization in the network is not very high, then our method gives a constant-factor approximation on the setting when the user opinions are known. To find the set of influential users, we provide a novel approximation algorithm for a variant of MaxCut in graphs with positive and negative edge weights. We experimentally evaluate our methods, which have access only to the network topology, and we find that they have similar performance as methods that have access to the network topology and all user opinions. We further present an NP-hardness proof, which was left as an open question by Chen and Racz [IEEE Transactions on Network Science and Engineering, 2021].

Feature-based Learning for Diverse and Privacy-Preserving Counterfactual Explanations

Interpretable machine learning seeks to understand the reasoning process of complex black-box systems that are long notorious for lack of explainability. One flourishing approach is through counterfactual explanations, which provide suggestions on what a user can do to alter an outcome. Not only must a counterfactual example counter the original prediction from the black-box classifier but it should also satisfy various constraints for practical applications. Diversity is one of the critical constraints that however remains less discussed. While diverse counterfactuals are ideal, it is computationally challenging to simultaneously address some other constraints. Furthermore, there is a growing privacy concern over the released counterfactual data. To this end, we propose a feature-based learning framework that effectively handles the counterfactual constraints and contributes itself to the limited pool of private explanation models. We demonstrate the flexibility and effectiveness of our method in generating diverse counterfactuals of actionability and plausibility. Our counterfactual engine is more efficient than counterparts of the same capacity while yielding the lowest re-identification risks.

Pattern Expansion and Consolidation on Evolving Graphs for Continual Traffic Prediction

Recently, spatiotemporal graph convolutional networks are becoming popular in the field of traffic flow prediction and significantly improve prediction accuracy. However, the majority of existing traffic flow prediction models are tailored to static traffic networks and fail to model the continuous evolution and expansion of traffic networks. In this work, we move to investigate the challenge of traffic flow prediction on an expanding traffic network. And we propose an efficient and effective continual learning framework to achieve continuous traffic flow prediction without the access to historical graph data, namely Pattern Expansion and Consolidation based on Pattern Matching based (PECPM). Specifically, we first design a pattern bank based on pattern matching to store representative patterns of the road network. With the expansion of the road network, the model configured with such a bank module can achieve continuous traffic prediction by effectively managing patterns stored in the bank. The core idea is to continuously update new patterns while consolidating learned ones. Specifically, we design a pattern expansion mechanism that can detect evolved and new patterns from the updated network, then these unknown patterns are expanded into the pattern bank to adapt to the updated road network. Additionally, we propose a pattern consolidation mechanism that includes both a bank preservation mechanism and a pattern traceability mechanism. This can effectively consolidate the learned patterns in the bank without requiring access to detailed historical graph data. We construct experiments on real-world traffic datasets to demonstrate the competitive performance, superior efficiency, and strong generalization ability of PECPM.

Financial Default Prediction via Motif-preserving Graph Neural Network with Curriculum Learning

User financial default prediction plays a critical role in credit risk forecasting and management. It aims at predicting the probability that the user will fail to make the repayments in the future. Previous methods mainly extract a set of user individual features regarding his own profiles and behaviors and build a binary-classification model to make default predictions. However, these methods cannot get satisfied results, especially for users with limited information. Although recent efforts suggest that default prediction can be improved by social relations, they fail to capture the higher-order topology structure at the level of small subgraph patterns. In this paper, we fill in this gap by proposing a motif-preserving Graph Neural Network with curriculum learning (MotifGNN) to jointly learn the lower-order structures from the original graph and higher-order structures from multi-view motif-based graphs for financial default prediction. Specifically, to solve the problem of weak connectivity in motif-based graphs, we design the motif-based gating mechanism. It utilizes the information learned from the original graph with good connectivity to strengthen the learning of the higher-order structure. And considering that the motif patterns of different samples are highly unbalanced, we propose a curriculum learning mechanism on the whole learning process to more focus on the samples with uncommon motif distributions. Extensive experiments on one public dataset and two industrial datasets all demonstrate the effectiveness of our proposed method.

Accelerating Antimicrobial Peptide Discovery with Latent Structure

Antimicrobial peptides (AMPs) are promising therapeutic approaches against drug-resistant pathogens. Recently, deep generative models are used to discover new AMPs. However, previous studies mainly focus on peptide sequence attributes and do not consider crucial structure information. In this paper, we propose a latent sequence-structure model for designing AMPs (LSSAMP). LSSAMP exploits multi-scale vector quantization in the latent space to represent secondary structures (e.g. alpha helix and beta sheet). By sampling in the latent space, LSSAMP can simultaneously generate peptides with ideal sequence attributes and secondary structures. Experimental results show that the peptides generated by LSSAMP have a high probability of antimicrobial activity. Our wet laboratory experiments verified that two of the 21 candidates exhibit strong antimicrobial activity. The code is released at https://github.com/dqwang122/LSSAMP.

Networked Time Series Imputation via Position-aware Graph Enhanced Variational Autoencoders

Multivariate time series (MTS) imputation is a widely studied problem in recent years. Existing methods can be divided into two main groups, including (1) deep recurrent or generative models that primarily focus on time series features, and (2) graph neural networks (GNNs) based models that utilize the topological information from the inherent graph structure of MTS as relational inductive bias for imputation. Nevertheless, these methods either neglect topological information or assume the graph structure is fixed and accurately known. Thus, they fail to fully utilize the graph dynamics for precise imputation in more challenging MTS data such as networked time series (NTS), where the underlying graph is constantly changing and might have missing edges. In this paper, we propose a novel approach to overcome these limitations. First, we define the problem of imputation over NTS which contains missing values in both node time series features and graph structures. Then, we design a new model named PoGeVon which leverages variational autoencoder (VAE) to predict missing values over both node time series features and graph structures. In particular, we propose a new node position embedding based on random walk with restart (RWR) in the encoder with provable higher expressive power compared with message-passing based graph neural networks (GNNs). We further design a decoder with 3-stage predictions from the perspective of multi-task learning to impute missing values in both time series and graph structures reciprocally. Experiment results demonstrate the effectiveness of our model over baselines.

Incremental Causal Graph Learning for Online Root Cause Analysis

The task of root cause analysis (RCA) is to identify the root causes of system faults/failures by analyzing system monitoring data. Efficient RCA can greatly accelerate system failure recovery and mitigate system damages or financial losses. However, previous research has mostly focused on developing offline RCA algorithms, which often require manually initiating the RCA process, a significant amount of time and data to train a robust model, and then being retrained from scratch for a new system fault.

In this paper, we propose CORAL, a novel online RCA framework that can automatically trigger the RCA process and incrementally update the RCA model. CORAL consists of Trigger Point Detection, Incremental Disentangled Causal Graph Learning, and Network Propagation-based Root Cause Localization. The Trigger Point Detection component aims to detect system state transitions automatically and in near-real-time. To achieve this, we develop an online trigger point detection approach based on multivariate singular spectrum analysis and cumulative sum statistics. To efficiently update the RCA model, we propose an incremental disentangled causal graph learning approach to decouple the state-invariant and state-dependent information. After that, CORAL applies a random walk with restarts to the updated causal graph to accurately identify root causes. The online RCA process terminates when the causal graph and the generated root cause list converge. Extensive experiments on three real-world datasets demonstrate the effectiveness and superiority of the proposed framework.

GMOCAT: A Graph-Enhanced Multi-Objective Method for Computerized Adaptive Testing

Computerized Adaptive Testing (CAT) refers to an online system that adaptively selects the best-suited question for students with various abilities based on their historical response records. Compared with traditional CAT methods based on heuristic rules, recent data-driven CAT methods obtain higher performance by learning from large-scale datasets. However, most CAT methods only focus on the quality objective of predicting the student ability accurately, but neglect concept diversity or question exposure control, which are important considerations in ensuring the performance and validity of CAT. Besides, the students' response records contain valuable relational information between questions and knowledge concepts. The previous methods ignore this relational information, resulting in the selection of sub-optimal test questions. To address these challenges, we propose a Graph-Enhanced Multi-Objective method for CAT (GMOCAT). Firstly, three objectives, namely quality, diversity and novelty, are introduced into the Scalarized Multi-Objective Reinforcement Learning framework of CAT, which respectively correspond to improving the prediction accuracy, increasing the concept diversity and reducing the question exposure. We use an Actor-Critic Recommender to select questions and optimize three objectives simultaneously by the scalarization function. Secondly, we utilize the graph neural network to learn relation-aware embeddings of questions and concepts. These embeddings are able to aggregate neighborhood information in the relation graphs between questions and concepts. We conduct experiments on three real-world educational datasets. The experimental results show that GMOCAT not only outperforms the state-of-the-art methods in the ability prediction, but also achieve superior performance in improving the concept diversity and alleviating the question exposure.

Treatment Effect Estimation with Adjustment Feature Selection

In causal inference, it is common to select a subset of observed covariates, named the adjustment features, to be adjusted for estimating the treatment effect. For real-world applications, the abundant covariates are usually observed, which contain extra variables partially correlating to the treatment (treatment-only variables, e.g., instrumental variables) or the outcome (outcome-only variables, e.g., precision variables) besides the confounders (variables that affect both the treatment and outcome). In principle, unbiased treatment effect estimation is achieved once the adjustment features contain all the confounders. However, the performance of empirical estimations varies a lot with different extra variables. To solve this issue, variable separation/selection for treatment effect estimation has received growing attention when the extra variables contain instrumental variables and precision variables.

In this paper, assuming no mediator variables exist, we consider a more general setting by allowing for the existence of post-treatment and post-outcome variables rather than instrumental and precision variables in observed covariates. Our target is to separate the treatment-only variables from the adjustment features. To this end, we establish a metric named Optimal Adjustment Features(OAF), which empirically measures the asymptotic variance of the estimation. Theoretically, we show that our OAF metric is minimized if and only if adjustment features consist of the confounders and outcome-only variables, i.e., the treatment-only variables are perfectly separated. As optimizing the OAF metric is a combinatorial optimization problem, we introduce Reinforcement Learning (RL) and adopt the policy gradient to search for the optimal adjustment set. Empirical results on both synthetic and real-world datasets demonstrate that (a) our method successfully searches the optimal adjustment features and (b) the searched adjustment features achieve a more precise estimation of the treatment effect.

LightToken: A Task and Model-agnostic Lightweight Token Embedding Framework for Pre-trained Language Models

Pre-trained language models~(PLMs) such as BERT, RoBERTa, and DeBERTa have achieved state-of-the-art performance on various downstream tasks. The enormous sizes of PLMs hinder their deployment in resource-constrained scenarios, e.g., on edge and mobile devices. To address this issue, many model compression approaches have been proposed to reduce the number of model parameters. This paper focuses on compressing the token embedding matrices of PLMs, which typically make up a large proportion~(around 20-30%) of the entire model parameters. Existing efforts to compress token embedding usually require the introduction of customized compression architectures or the optimization of model compression processes for individual downstream tasks, limiting their applicability in both model and task dimensions. To overcome these limitations and adhere to the principle of "one-for-all", we propose a lightweight token embedding framework named LightToken, which is able to produce compressed token embedding in a task and model-agnostic fashion. LightToken is generally compatible with different architectures and applicable to any downstream task. Specifically, through an integration of low-rank approximation, novel residual binary autoencoder, and a new compression loss function, LightToken can significantly improve the model compression ratio. To demonstrate the effectiveness of LightToken, we conduct comprehensive experiments on natural language understanding and question answering tasks. In particular, LightToken improves the state-of-the-art token embedding compression ratio from 5 to 25 and outperforms the existing token embedding compression approaches by 11% and 5% on GLUE and SQuAD v1.1 benchmarks, respectively.

Adversarial Constrained Bidding via Minimax Regret Optimization with Causality-Aware Reinforcement Learning

The proliferation of the Internet has led to the emergence of online advertising, driven by the mechanics of online auctions. In these repeated auctions, software agents participate on behalf of aggregated advertisers to optimize for their long-term utility. To fulfill the diverse demands, bidding strategies are employed to optimize advertising objectives subject to different spending constraints. Existing approaches on constrained bidding typically rely on i.i.d. train and test conditions, which contradicts the adversarial nature of online ad markets where different parties possess potentially conflicting objectives. In this regard, we explore the problem of constrained bidding in adversarial bidding environments, which assumes no knowledge about the adversarial factors. Instead of relying on the i.i.d. assumption, our insight is to align the train distribution of environments with the potential test distribution meanwhile minimizing policy regret. Based on this insight, we propose a practical Minimax Regret Optimization (MiRO) approach that interleaves between a teacher finding adversarial environments for tutoring and a learner meta-learning its policy over the given distribution of environments. In addition, we pioneer to incorporate expert demonstrations for learning bidding strategies. Through a causality-aware policy design, we improve upon MiRO by distilling knowledge from the experts. Extensive experiments on both industrial data and synthetic data show that our method, MiRO with Causality-aware reinforcement Learning (MiROCL), outperforms prior methods by over 30%.

Efficient and Effective Edge-wise Graph Representation Learning

Graph representation learning (GRL) is a powerful tool for graph analysis, which has gained massive attention from both academia and industry due to its superior performance in various real-world applications. However, the majority of existing works for GRL are dedicated to node-based tasks and thus focus on producing node representations. Despite such methods can be used to derive edge representations by regarding edges as nodes, they suffer from sub-par result utility in practical edge-wise applications, such as financial fraud detection and review spam combating, due to neglecting the unique properties of edges and their inherent drawbacks. Moreover, to our knowledge, there is a paucity of research devoted to edge representation learning. These methods either require high computational costs in sampling random walks or yield severely compromised representation quality because of falling short of capturing high-order information between edges. To address these challenges, we present TER and AER, which generate high-quality edge representation vectors based on the graph structure surrounding edges and edge attributes, respectively. In particular, TER can accurately encode high-order proximities of edges into low-dimensional vectors in a practically efficient and theoretically sound way, while AER augments edge attributes through a carefully-designed feature aggregation scheme. Our extensive experimental study demonstrates that the combined edge representations of TER and AER can achieve significantly superior performance in terms of edge classification on 8 real-life datasets, while being up to one order of magnitude faster than 16 baselines on large graphs.

PROSE: Graph Structure Learning via Progressive Strategy

Graph Neural Networks (GNNs) have been a powerful tool to acquire high-quality node representations dealing with graphs, which strongly depends on a promising graph structure. In the real world scenarios, it is inevitable to introduce noises in graph topology. To prevent GNNs from the disturbance of irrelevant edges or missing edges, graph structure learning is proposed and has attracted considerable attentions in recent years. In this paper, we argue that current graph structure learning methods still pay no regard to the status of nodes and just judge all of their connections simultaneously using a monotonous standard, which will lead to indeterminacy and instability in the optimization process. We designate these methods as status-unaware models. To demonstrate the rationality of our point of view, we conduct exploratory experiments on publicly available datasets, and discover some exciting observations. Afterwards, we propose a new model named Graph Structure Learning via Progressive Strategy (PROSE) according to the observations, which uses a progressive strategy to acquire ideal graph structure in a status-aware way. Concretely, PROSE consists of progressive structure splitting module (PSS) and progressive structure refining module (PSR) to modify node connections according to their global potency, and we also introduce horizontal position encoding and vertical position encoding in order to capture fruitful graph topology information ignored by previous methods. On several widely-used graph datasets, we conduct extensive experiments to demonstrate the effectiveness of our model, and the source code 1 https://github.com/tigerbunny2023/PROSE is provided.

Empower Post-hoc Graph Explanations with Information Bottleneck: A Pre-training and Fine-tuning Perspective

Researchers recently investigated to explain Graph Neural Networks (GNNs) on the access to a task-specific GNN, which may hinder their wide applications in practice. Specifically, task-specific explanation methods are incapable of explaining pretrained GNNs whose downstream tasks are usually inaccessible, not to mention giving explanations for the transferable knowledge in pretrained GNNs. Additionally, task-specific methods only consider target models' output in the label space, which are coarse-grained and insufficient to reflect the model's internal logic. To address these limitations, we consider a two-stage explanation strategy, i.e., explainers are first pretrained in a task-agnostic fashion in the representation space and then further fine-tuned in the task-specific label space and representation space jointly if downstream tasks are accessible. The two-stage explanation strategy endows post-hoc graph explanations with the applicability to pretrained GNNs where downstream tasks are inaccessible and the capacity to explain the transferable knowledge in the pretrained GNNs. Moreover, as the two-stage explanation strategy explains the GNNs in the representation space, the fine-grained information in the representation space also empowers the explanations. Furthermore, to achieve a trade-off between the fidelity and intelligibility of explanations, we propose an explanation framework based on the Information Bottleneck principle, named Explainable Graph Information Bottleneck (EGIB). EGIB subsumes the task-specific explanation and task-agnostic explanation into a unified framework. To optimize EGIB objective, we derive a tractable bound and adopt a simple yet effective explanation generation architecture. Based on the unified framework, we further theoretically prove that task-agnostic explanation is a relaxed sufficient condition of task-specific explanation, which indicates the transferability of task-agnostic explanations. Extensive experimental results demonstrate the effectiveness of our proposed explanation method.

WHEN: A Wavelet-DTW Hybrid Attention Network for Heterogeneous Time Series Analysis

Given its broad applications, time series analysis has gained substantial research attention but remains a very challenging task. Recent years have witnessed the great success of deep learning methods, eg., CNN and RNN, in time series classification and forecasting, but heterogeneity as the very nature of time series has not yet been addressed adequately and remains the performance "treadstone." In this light, we argue that the intra-sequence non-stationarity and inter-sequence asynchronism are two types of heterogeneities widely existed in multiple times series, and propose a hybrid attention network called WHEN as deep learning solution. WHEN features in two attention mechanisms in two different modules. In the WaveAtt module, we propose a novel data-dependent wavelet function and integrate it into the BiLSTM network as the wavelet attention, for the purpose of analyzing dynamic frequency components in nonstationary time series. In the DTWAtt module, we transform the dynamic time warping (DTW) technique into the form as the DTW attention, where all input sequences are synchronized with a universal parameter sequence to overcome the time distortion problem in multiple time series. WHEN with the hybrid attentions is then formulated as task-dependent neural network for either classification or forecasting tasks. Extensive experiments on 30 UEA datasets and 3 real-world datasets with rich competitive baselines demonstrate the excellent performance of our model. The ability of WHEN in dealing with time series heterogeneities is also elaborately explored via specially designed analysis.

Federated Few-shot Learning

Federated Learning (FL) enables multiple clients to collaboratively learn a machine learning model without exchanging their own local data. In this way, the server can exploit the computational power of all clients and train the model on a larger set of data samples among all clients. Although such a mechanism is proven to be effective in various fields, existing works generally assume that each client preserves sufficient data for training. In practice, however, certain clients can only contain a limited number of samples (i.e., few-shot samples). For example, the available photo data taken by a specific user with a new mobile device is relatively rare. In this scenario, existing FL efforts typically encounter a significant performance drop on these clients. Therefore, it is urgent to develop a few-shot model that can generalize to clients with limited data under the FL scenario. In this paper, we refer to this novel problem as federated few-shot learning. Nevertheless, the problem remains challenging due to two major reasons: the global data variance among clients (i.e., the difference in data distributions among clients) and the local data insufficiency in each client (i.e., the lack of adequate local data for training). To overcome these two challenges, we propose a novel federated few-shot learning framework with two separately updated models and dedicated training strategies to reduce the adverse impact of global data variance and local data insufficiency. Extensive experiments on four prevalent datasets that cover news articles and images validate the effectiveness of our framework compared with the state-of-the-art baselines.

Contrastive Meta-Learning for Few-shot Node Classification

Few-shot node classification, which aims to predict labels for nodes on graphs with only limited labeled nodes as references, is of great significance in real-world graph mining tasks. To tackle such a label shortage issue, existing works generally leverage the meta-learning framework, which utilizes a number of episodes to extract transferable knowledge from classes with abundant labeled nodes and generalizes the knowledge to other classes with limited labeled nodes. In essence, the primary aim of few-shot node classification is to learn node embeddings that are generalizable across different classes. To accomplish this, the GNN encoder must be able to distinguish node embeddings between different classes, while also aligning embeddings for nodes in the same class. Thus, in this work, we propose to consider both the intra-class and inter-class generalizability of the model. We create a novel contrastive meta-learning framework on graphs, named COSMIC, with two key designs. First, we propose to enhance the intra-class generalizability by involving a contrastive two-step optimization in each episode to explicitly align node embeddings in the same classes. Second, we strengthen the inter-class generalizability by generating hard node classes for classification via a novel similarity-sensitive mix-up strategy. Extensive experiments on prevalent few-shot node classification datasets verify the effectiveness of our framework and demonstrate its superiority over other state-of-the-art baselines.

Improving Conversational Recommendation Systems via Counterfactual Data Simulation

Conversational recommender systems~(CRSs) aim to provide recommendation services via natural language conversations. Although a number of approaches have been proposed for developing capable CRSs, they typically rely on sufficient training data for training. Since it is difficult to annotate recommendation-oriented dialogue datasets, existing CRS approaches often suffer from the issue of insufficient training due to the scarcity of training data.

To address this issue, in this paper, we propose a CounterFactual data simulation approach for CRS, named CFCRS, to alleviate the issue of data scarcity in CRSs. Our approach is developed based on the framework of counterfactual data augmentation, which gradually incorporates the rewriting to the user preference from a real dialogue without interfering with the entire conversation flow. To develop our approach, we characterize user preference and organize the conversation flow by the entities involved in the dialogue, and design a multi-stage recommendation dialogue simulator based on a conversation flow language model. Under the guidance of the learned user preference and dialogue schema, the flow language model can produce reasonable, coherent conversation flows, which can be further realized into complete dialogues. Based on the simulator, we perform the intervention at the representations of the interacted entities of target users, and design an adversarial training method with a curriculum schedule that can gradually optimize the data augmentation strategy. Extensive experiments show that our approach can consistently boost the performance of several competitive CRSs, and outperform other data augmentation methods, especially when the training data is limited. Our code is publicly available at https://github.com/RUCAIBox/CFCRS.

An Observed Value Consistent Diffusion Model for Imputing Missing Values in Multivariate Time Series

Missing values, which are common in multivariate time series, is most important obstacle towards the utilization and interpretation of those data. Great efforts have been employed on how to accurately impute missing values in multivariate time series, and existing works either use deep learning networks to achieve deterministic imputations or aim at generating different plausible imputations by sampling multiple noises from a same distribution and then denoising them. However, these models either fall short of modeling the uncertainties of imputations due to their deterministic nature or perform poorly in terms of interpretability and imputation accuracy due to their ignorance of the correlations between the latent representations of both observed and missing values which are parts of samples from a same distribution. To this end, in this paper, we explicitly take the correlations between observed and missing values into account, and theoretically re-derive the Evidence Lower BOund (ELBO) of conditional diffusion model in the scenario of multivariate time series imputation. Based on the newly derived ELBO, we further propose a novel multivariate imputation diffusion model (MIDM) which is equipped with novel noise sampling, adding and denoising mechanisms for multivariate time series imputation, and the series of newly designed technologies jointly ensure the involving of the consistency between observed and missing values. Extensive experiments on both the tasks of multivariate time series imputation and forecasting witness the superiority of our proposed MIDM model on generating conditional estimations.

Automated 3D Pre-Training for Molecular Property Prediction

Molecular property prediction is an important problem in drug discovery and materials science. As geometric structures have been demonstrated necessary for molecular property prediction,3D information has been combined with various graph learning methods to boost prediction performance. However, obtaining the geometric structure of molecules is not feasible in many real-world applications due to the high computational cost. In this work, we propose a novel 3D pre-training framework (dubbed 3D PGT), which pre-trains a model on 3D molecular graphs, and then fine-tunes it on molecular graphs without 3D structures. Based on fact that bond length, bond angle, and dihedral angle are three basic geometric descriptors corresponding to a complete molecular 3D conformer, we first develop a multi-task generative pre-train framework based on these three attributes. Next, to automatically fuse these three generative tasks, we design a surrogate metric using the total energy to search for weight distribution of the three pretext tasks since total energy corresponding to the quality of 3D conformer. Extensive experiments on 2D molecular graphs are conducted to demonstrate the accuracy, efficiency and generalization ability of the proposed 3D PGT compared to various pre-training baselines.

Efficient Sparse Linear Bandits under High Dimensional Data

We propose a computationally efficient Lasso Random Project Bandit (LRP-Bandit) algorithm for sparse linear bandit problems under high-dimensional settings with limited samples. LRP-Bandit bridges Lasso and Random Projection as feature selection and dimension reduction techniques to alleviate the computational complexity and improve the regret performance. We demonstrate that for the total feature dimension d, the significant feature dimension s, and the sample size T, the expected cumulative regret under LRP-Bandit is upper bounded by Õ (T2 over 3 s 3 over 2 log7 over 6 d), where Õ suppresses the logarithmic dependence on T. Further, we show that when available samples are larger than a problem-dependent threshold, the regret upper bound for LRP-Bandit can be further improved to Õ (s√T log d). These regret upper bounds on T for both data-poor and data-rich regimes match the theoretical minimax lower bounds up to logarithmic factors. Through experiments, we show that LRP-Bandit is computationally efficient and outperforms other benchmarks on the expected cumulative regret.

Theoretical Convergence Guaranteed Resource-Adaptive Federated Learning with Mixed Heterogeneity

In this paper, we propose an adaptive learning paradigm for resource-constrained cross-device federated learning, in which heterogeneous local submodels with varying resources can be jointly trained to produce a global model. Different from existing studies, the submodel structures of different clients are formed by arbitrarily assigned neurons according to their local resources. Along this line, we first design a general resource-adaptive federated learning algorithm, namely RA-Fed, and rigorously prove its convergence with asymptotically optimal rate O(1/√Γ*TQ) under loose assumptions. Furthermore, to address both submodels heterogeneity and data heterogeneity challenges under non-uniform training, we come up with a new server aggregation mechanism RAM-Fed with the same theoretically proved convergence rate. Moreover, we shed light on several key factors impacting convergence, such as minimum coverage rate, data heterogeneity level, submodel induced noises. Finally, we conduct extensive experiments on two types of tasks with three widely used datasets under different experimental settings. Compared with the state-of-the-arts, our methods improve the accuracy up to 10% on average. Particularly, when submodels jointly train with 50% parameters, RAM-Fed achieves comparable accuracy to FedAvg trained with the full model.

Unbiased Delayed Feedback Label Correction for Conversion Rate Prediction

Conversion rate prediction is critical to many online applications such as digital display advertising. To capture dynamic data distribution, industrial systems often require retraining models on recent data daily or weekly. However, the delay of conversion behavior usually leads to incorrect labeling, which is called delayed feedback problem. Existing work may fail to introduce the correct information about false negative samples due to data sparsity and dynamic data distribution. To directly introduce the correct feedback label information, we propose an Unbiased delayed feedback Label Correction framework (ULC), which uses an auxiliary model to correct labels for observed negative feedback samples. Firstly, we theoretically prove that the label-corrected loss is an unbiased estimate of the oracle loss using true labels. Then, as there are no ready training data for label correction, counterfactual labeling is used to construct artificial training data. Furthermore, since counterfactual labeling utilizes only partial training data, we design an embedding-based alternative training method to enhance performance. Comparative experiments on both public and private datasets and detailed analyses show that our proposed approach effectively alleviates the delayed feedback problem and consistently outperforms the previous state-of-the-art methods.

Rapid Image Labeling via Neuro-Symbolic Learning

The success of Computer Vision (CV) relies heavily on manually annotated data. However, it is prohibitively expensive to annotate images in key domains such as healthcare, where data labeling requires significant domain expertise and cannot be easily delegated to crowd workers. To address this challenge, we propose a neuro-symbolic approach called RAPID, which infers image labeling rules from a small amount of labeled data provided by domain experts and automatically labels unannotated data using the rules. Specifically, RAPID combines pre-trained CV models and inductive logic learning to infer the logic-based labeling rules. RAPID achieves a labeling accuracy of 83.33% to 88.33% on four image labeling tasks with only 12 to 39 labeled samples. In particular, RAPID significantly outperforms finetuned CV models in two highly specialized tasks. These results demonstrate the effectiveness of RAPID in learning from small data and its capability to generalize among different tasks. Code and our dataset are publicly available at https://github.com/Neural-Symbolic-Image-Labeling/Rapid/

Learning to Schedule in Diffusion Probabilistic Models

Recently, the field of generative models has seen a significant advancement with the introduction of Diffusion Probabilistic Models (DPMs). The Denoising Diffusion Implicit Model (DDIM) was designed to reduce computational time by skipping a number of steps in the inference process of DPMs. However, the hand-crafted sampling schedule in DDIM, which relies on human expertise, has its limitations in considering all relevant factors in the sampling process. Additionally, the assumption that all instances should have the same schedule is not always valid. To address these problems, this paper proposes a method that leverages reinforcement learning to automatically search for an optimal sampling schedule for DPMs. This is achieved by a policy network that predicts the next step to visit based on the current state of the noisy image. The optimization of the policy network is accomplished using an episodic actor-critic framework, which incorporates reinforcement learning. Empirical results demonstrate the superiority of our approach over various datasets with different timesteps. We also observe that the trained sampling schedule has a strong generalization ability across different DPM baselines.

A Message Passing Neural Network Space for Better Capturing Data-dependent Receptive Fields

Recently, the message passing neural network (MPNN) has attracted a lot of attention, which learns node representations based on the receptive field of the given node. Despite its success in many graph-related tasks, recent studies find that conventional MPNNs are incapable of handling variant receptive fields required in different graphs, and thereby some upgraded MPNNs have been developed. However, these methods are limited to designing a common solution for different graphs, which fails to capture the impact of different graph properties on the receptive fields. To alleviate such issues, we propose a novel MPNN space for data-dependent receptive fields (MpnnDRF), which enables us to dynamically design suitable MPNNs to capture the receptive field for the given graph. More concretely, we systemically investigate the capability of existing designs and propose several key design dimensions to improve them. Then, to fully explore the proposed designs and useful designs in existing works, we propose a novel search space to incorporate them and formulate a search framework. In the empirical study, the proposed MpnnDRF shows very strong robustness against the increased receptive field, which allows MpnnDRF to learn node representations based on a larger perceptual field. Therefore, MpnnDRF consistently achieves outstanding performance on benchmark node and graph classification tasks.

Efficient Bi-Level Optimization for Recommendation Denoising

The acquisition of explicit user feedback (e.g., ratings) in real-world recommender systems is often hindered by the need for active user involvement. To mitigate this issue, implicit feedback (e.g., clicks) generated during user browsing is exploited as a viable substitute. However, implicit feedback possesses a high degree of noise, which significantly undermines recommendation quality. While many methods have been proposed to address this issue by assigning varying weights to implicit feedback, two shortcomings persist: (1) the weight calculation in these methods is iteration-independent, without considering the influence of weights in previous iterations, and (2) the weight calculation often relies on prior knowledge, which may not always be readily available or universally applicable.

To overcome these two limitations, we model recommendation denoising as a bi-level optimization problem. The inner optimization aims to derive an effective model for the recommendation, as well as guiding the weight determination, thereby eliminating the need for prior knowledge. The outer optimization leverages gradients of the inner optimization and adjusts the weights in a manner considering the impact of previous weights. To efficiently solve this bi-level optimization problem, we employ a weight generator to avoid the storage of weights and a one-step gradient-matching-based loss to significantly reduce computational time. The experimental results on three benchmark datasets demonstrate that our proposed approach outperforms both state-of-the-art general and denoising recommendation models. The code is available at https://github.com/CoderWZW/BOD.

Meta Graph Learning for Long-tail Recommendation

Highly skewed long-tail item distribution commonly hurts model performance on tail items in recommendation systems, especially for graph-based recommendation models. We propose a novel idea to learn relations among items as an auxiliary graph to enhance the graph-based representation learning and make recommendations collectively in a coupled framework. This raises two challenges, 1) the long-tail downstream information may also bias the auxiliary graph learning, and 2) the learned auxiliary graph may cause negative transfer to the original user-item bipartite graph. We innovatively propose a novel Meta Graph Learning framework for long-tail recommendation (MGL) for solving both challenges. The meta-learning strategy is introduced to the learning of an edge generator, which is first tuned to reconstruct a debiased item co-occurrence matrix, and then virtually evaluated on generating item relations for recommendation. Moreover, we propose a popularity-aware contrastive learning strategy to prevent negative transfer by aligning the confident head item representations with those of the learned auxiliary graph. Experiments on public datasets demonstrate that our proposed model significantly outperforms strong baselines for tail items without compromising the overall performance.

To Aggregate or Not? Learning with Separate Noisy Labels

The rawly collected training data often comes with separate noisy labels collected from multiple imperfect annotators (e.g., via crowdsourcing). A typical way of using these separate labels is to first aggregate them into one and apply standard training methods. The literature has also studied extensively on effective aggregation approaches. This paper revisits this choice and aims to provide an answer to the question of whether one should aggregate separate noisy labels into single ones or use them separately as given. We theoretically analyze the performance of both approaches under the empirical risk minimization framework for a number of popular loss functions, including the ones designed specifically for the problem of learning with noisy labels. Our theorems conclude that label separation is preferred over label aggregation when the noise rates are high, or the number of labelers/annotations is insufficient. Extensive empirical results validate our conclusions.

Granger Causal Chain Discovery for Sepsis-Associated Derangements via Continuous-Time Hawkes Processes

Modern health care systems are conducting continuous, automated surveillance of the electronic medical record (EMR) to identify adverse events with increasing frequency; however, many events such as sepsis do not have elucidated prodromes (i.e., event chains) that can be used to identify and intercept the adverse event early in its course. Clinically relevant and interpretable results require a framework that can (i) infer temporal interactions across multiple patient features found in EMR data (e.g., Labs, vital signs, etc.) and (ii) identify patterns that precede and are specific to an impending adverse event (e.g., sepsis). In this work, we propose a linear multivariate Hawkes process model, coupled with ReLU link function, to recover a Granger Causal (GC) graph with both exciting and inhibiting effects. We develop a scalable two-phase gradient-based method to obtain a maximum surrogate-likelihood estimator, which is shown to be effective via extensive numerical simulation. Our method is subsequently extended to a data set of patients admitted to Grady hospital system in Atlanta, GA, USA, where the estimated GC graph identifies several highly interpretable GC chains that precede sepsis. The code is available at https://github.com/SongWei-GT/two-phase-MHP.

Efficient and Joint Hyperparameter and Architecture Search for Collaborative Filtering

Automated Machine Learning (AutoML) techniques have recently been introduced to design Collaborative Filtering (CF) models in a data-specific manner. However, existing works either search architectures or hyperparameters while ignoring the fact they are intrinsically related and should be considered together. This motivates us to consider a joint hyperparameter and architecture search method to design CF models. However, this is not easy because of the large search space and high evaluation cost. To solve these challenges, we reduce the space by screening out usefulness hyperparameter choices through a comprehensive understanding of individual hyperparameters. Next, we propose a two-stage search algorithm to find proper configurations from the reduced space. In the first stage, we leverage knowledge from subsampled datasets to reduce evaluation costs; in the second stage, we efficiently fine-tune top candidate models on the whole dataset. Extensive experiments on real-world datasets show better performance can be achieved compared with both hand-designed and previous searched models. Besides, ablation and case studies demonstrate the effectiveness of our search framework.

Deep Bayesian Active Learning for Accelerating Stochastic Simulation

Stochastic simulations such as large-scale, spatiotemporal, age-structured epidemic models are computationally expensive at fine-grained resolution. While deep surrogate models can speed up the simulations, doing so for stochastic simulations and with active learning approaches is an underexplored area. We propose Interactive Neural Process (INP), a deep Bayesian active learning framework for learning deep surrogate models to accelerate stochastic simulations. INP consists of two components, a spatiotemporal surrogate model built upon Neural Process (NP) family and an acquisition function for active learning. For surrogate modeling, we develop Spatiotemporal Neural Process (STNP) to mimic the simulator dynamics. For active learning, we propose a novel acquisition function, Latent Information Gain (LIG), calculated in the latent space of NP based models. We perform a theoretical analysis and demonstrate that LIG reduces sample complexity compared with random sampling in high dimensions. We also conduct empirical studies on three complex spatiotemporal simulators for reaction diffusion, heat flow, and infectious disease. The results demonstrate that STNP outperforms the baselines in the offline learning setting and LIG achieves the state-of-the-art for Bayesian active learning.

Self-Adaptive Perturbation Radii for Adversarial Training

Adversarial training has been shown to be the most popular and effective technique to protect models from imperceptible adversarial samples. Despite its success, it also accompanies the significant performance degeneration to clean data. To achieve a good performance on both clean and adversarial samples, the main effort is searching for an adaptive perturbation radius for each training sample. However, this method suffers from a conflict between exact searching and computational overhead. To address this conflict, in this paper, firstly we show the superiority of adaptive perturbation radii on the accuracy and robustness respectively. Then we propose our novel self-adaptive adjustment framework for perturbation radii without tedious searching. We also discuss this framework on both deep neural networks (DNNs) and kernel support vector machines (SVMs). Finally, extensive experimental results show that our framework can improve adversarial robustness without compromising the natural generalization. It is also competitive with existing searching strategies in terms of running time.

DECOR: Degree-Corrected Social Graph Refinement for Fake News Detection

Recent efforts in fake news detection have witnessed a surge of interest in using graph neural networks (GNNs) to exploit rich social context. Existing studies generally leverage fixed graph structures, assuming that the graphs accurately represent the related social engagements. However, edge noise remains a critical challenge in real-world graphs, as training on suboptimal structures can severely limit the expressiveness of GNNs. Despite initial efforts in graph structure learning (GSL), prior works often leverage node features to update edge weights, resulting in heavy computational costs that hinder the methods' applicability to large-scale social graphs. In this work, we approach the fake news detection problem with a novel aspect of social graph refinement. We find that the degrees of news article nodes exhibit distinctive patterns, which are indicative of news veracity. Guided by this, we propose DECOR, a novel application of Degree-Corrected Stochastic Blockmodels to the fake news detection problem. Specifically, we encapsulate our empirical observations into a lightweight social graph refinement component that iteratively updates the edge weights via a learnable degree correction mask, which allows for joint optimization with a GNN-based detector. Extensive experiments on two real-world benchmarks validate the effectiveness and efficiency of DECOR1.

Personalized Federated Learning with Parameter Propagation

With decentralized data collected from diverse clients, a personalized federated learning paradigm has been proposed for training machine learning models without exchanging raw data from local clients. We dive into personalized federated learning from the perspective of privacy-preserving transfer learning, and identify the limitations of previous personalized federated learning algorithms. First, previous works suffer from negative knowledge transferability for some clients, when focusing more on the overall performance of all clients. Second, high communication costs are required to explicitly learn statistical task relatedness among clients. Third, it is computationally expensive to generalize the learned knowledge from experienced clients to new clients.

To solve these problems, in this paper, we propose a novel federated parameter propagation (FEDORA) framework for personalized federated learning. Specifically, we reformulate the standard personalized federated learning as a privacy-preserving transfer learning problem, with the goal of improving the generalization performance for every client. The crucial idea behind FEDORA is to learn how to transfer and whether to transfer simultaneously, including (1) adaptive parameter propagation: one client is enforced to adaptively propagate its parameters to others based on their task relatedness (e.g., explicitly measured by distribution similarity), and (2) selective regularization: each client would regularize its local personalized model with received parameters, only when those parameters are positively correlated with the generalization performance of its local model. The experiments on a variety of federated learning benchmarks demonstrate the effectiveness of the proposed FEDORA framework over state-of-the-art personalized federated learning baselines.

Certified Edge Unlearning for Graph Neural Networks

The emergence of evolving data privacy policies and regulations has sparked a growing interest in the concept of "machine unlearning", which involves enabling machine learning models to forget specific data instances. In this paper, we specifically focus on edge unlearning in Graph Neural Networks (GNNs), which entails training a new GNN model as if certain specified edges never existed in the original training graph. Unlike conventional unlearning scenarios where data samples are treated as independent entities, edges in graphs exhibit correlation. Failing to carefully account for this data dependency would result in the incomplete removal of the requested data from the model. While retraining the model from scratch by excluding the specific edges can eliminate their influence, this approach incurs a high computational cost. To overcome this challenge, we introduce CEU, a Certified Edge Unlearning framework. CEU expedites the unlearning process by updating the parameters of the pre-trained GNN model in a single step, ensuring that the update removes the influence of the removed edges from the model. We formally prove that CEU offers a rigorous theoretical guarantee under the assumption of convexity on the loss function. Our empirical analysis further demonstrates the effectiveness and efficiency of CEU for both linear and deep GNNs - it achieves significant speedup gains compared to retraining and existing unlearning methods while maintaining comparable model accuracy to retraining from scratch.

Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph Propagation

Zero-Shot Learning (ZSL), which aims at automatically recognizing unseen objects, is a promising learning paradigm to understand new real-world knowledge for machines continuously. Recently, the Knowledge Graph (KG) has been proven as an effective scheme for handling the zero-shot task with large-scale and non-attribute data. Prior studies always embed relationships of seen and unseen objects into visual information from existing knowledge graphs to promote the cognitive ability of the unseen data. Actually, real-world knowledge is naturally formed by multimodal facts. Compared with ordinary structural knowledge from a graph perspective, multimodal KG can provide cognitive systems with fine-grained knowledge. For example, the text description and visual content can depict more critical details of a fact than only depending on knowledge triplets. Unfortunately, this multimodal fine-grained knowledge is largely unexploited due to the bottleneck of feature alignment between different modalities. To that end, we propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings via a designed dense attention module and self-calibration loss. It makes the semantic transfer process of our ZSL framework learns more differentiated knowledge between entities. Our model also gets rid of the performance limitation of only using rough global features. We conduct extensive experiments and evaluate our model on large-scale real-world data. The experimental results clearly demonstrate the effectiveness of the proposed model in standard zero-shot classification tasks.

Towards Reliable Rare Category Analysis on Graphs via Individual Calibration

Rare categories abound in a number of real-world networks and play a pivotal role in a variety of high-stakes applications, including financial fraud detection, network intrusion detection, and rare disease diagnosis. Rare category analysis (RCA) refers to the task of detecting, characterizing, and comprehending the behaviors of minority classes in a highly-imbalanced data distribution. While the vast majority of existing work on RCA has focused on improving the prediction performance, a few fundamental research questions heretofore have received little attention and are less explored: How confident or uncertain is a prediction model in rare category analysis? How can we quantify the uncertainty in the learning process and enable reliable rare category analysis?

To answer these questions, we start by investigating miscalibration in existing RCA methods. Empirical results reveal that state-of-the-art RCA methods are mainly over-confident in predicting minority classes and under-confident in predicting majority classes. Motivated by the observation, we propose a novel individual calibration framework, named CALIRARE, for alleviating the unique challenges of RCA, thus enabling reliable rare category analysis. In particular, to quantify the uncertainties in RCA, we develop a node-level uncertainty quantification algorithm to model the overlapping support regions with high uncertainty; to handle the rarity of minority classes in miscalibration calculation, we generalize the distribution-based calibration metric to the instance level and propose the first individual calibration measurement on graphs named Expected Individual Calibration Error (EICE). We perform extensive experimental evaluations on real-world datasets, including rare category characterization and model calibration tasks, which demonstrate the significance of our proposed framework.

TransformerLight: A Novel Sequence Modeling Based Traffic Signaling Mechanism via Gated Transformer

Traffic signal control (TSC) is still one of the most significant and challenging research problems in the transportation field. Reinforcement learning (RL) has achieved great success in TSC but suffers from critically high learning costs in practical applications due to the excessive trial-and-error learning process. Offline RL is a promising method to reduce learning costs whereas the data distribution shift issue is still up in the air. To this end, in this paper, we formulate TSC as a sequence modeling problem with a sequence of Markov decision process described by states, actions, and rewards from the traffic environment. A novel framework, namely TransformerLight, is introduced, which does not aim to fit into value functions by averaging all possible returns, but produces the best possible actions using a gated Transformer. Additionally, the learning process of TransformerLight is much more stable by replacing the residual connections with gated transformer blocks due to a dynamic system perspective. Through numerical experiments on offline datasets, we demonstrate that the TransformerLight model: (1) can build a high-performance adaptive TSC model without dynamic programming; (2) achieves a new state-of-the-art compared to most published offline RL methods so far; and (3) shows a more stable learning process than offline RL and recent Transformer-based methods. The relevant dataset and code are available at Github.

Serverless Federated AUPRC Optimization for Multi-Party Collaborative Imbalanced Data Mining

To address the big data challenges, serverless multi-party collaborative training has recently attracted attention in the data mining community, since they can cut down the communications cost by avoiding the server node bottleneck. However, traditional serverless multi-party collaborative training algorithms were mainly designed for balanced data mining tasks and are intended to optimize accuracy (e.g., cross-entropy). The data distribution in many real-world applications is skewed and classifiers, which are trained to improve accuracy, perform poorly when applied to imbalanced data tasks since models could be significantly biased toward the primary class. Therefore, the Area Under Precision-Recall Curve (AUPRC) was introduced as an effective metric. Although multiple single-machine methods have been designed to train models for AUPRC maximization, the algorithm for multi-party collaborative training has never been studied. The change from the single-machine to the multi-party setting poses critical challenges. For example, existing single-machine-based AUPRC maximization algorithms maintain an inner state for local each data point, thus these methods are not applicable to large-scale multi-party collaborative training due to the dependence on each local data point.

To address the above challenge, in this paper, we reformulate the serverless multi-party collaborative AUPRC maximization problem as a conditional stochastic optimization problem in a serverless multi-party collaborative learning setting and propose a new ServerLess biAsed sTochastic gradiEnt (SLATE) algorithm to directly optimize the AUPRC. After that, we use the variance reduction technique and propose ServerLess biAsed sTochastic gradiEnt with Momentum-based variance reduction (SLATE-M) algorithm to improve the convergence rate, which matches the best theoretical convergence result reached by the single-machine online method. To the best of our knowledge, this is the first work to solve the multi-party collaborative AUPRC maximization problem. Finally, extensive experiments show the advantages of directly optimizing the AUPRC with distributed learning methods and also verify the efficiency of our new algorithms (i.e., SLATE and SLATE-M).

MicroscopeSketch: Accurate Sliding Estimation Using Adaptive Zooming

High-accuracy real-time data stream estimations are critical for various applications, and sliding-window-based techniques have attracted wide attention. However, existing solutions struggle to achieve high accuracy, generality, and low memory usage simultaneously. To overcome these limitations, we present MicroscopeSketch, a high-accuracy sketch framework. Our key technique, called adaptive zooming, dynamically adjusts the granularity of counters to maximize accuracy while minimizing memory usage. By applying MicroscopeSketch to three specific tasks---frequency estimation, top-k frequent items discovery, and top-k heavy changes identification-we demonstrate substantial improvements over existing methods, reducing errors by roughly 4 times for frequency estimation and 3 times for identifying top-k items. The relevant source code is available in a GitHub repository.

MedLink: De-Identified Patient Health Record Linkage

A comprehensive patient health history is essential for patient care and healthcare research. However, due to the distributed nature of healthcare services, patient health records are often scattered across multiple systems. Existing record linkage approaches primarily rely on patient identifiers, which have inherent limitations such as privacy invasion and identifier discrepancies. To tackle this problem, we propose linking de-identified patient health records by matching health patterns without strictly relying on sensitive patient identifiers. Our model MedLink solves two challenges faced with the patient linkage task: (1) the challenge of identifying the same patients based on data collected in different timelines as disease progression makes the record matching difficult, and (2) the challenge of identifying distinct health patterns as common medical codes dominate health records and overshadow the more informative low-prevalence codes. To address these challenges, MedLink utilizes bi-directional health prediction to predict future codes forwardly and past codes backwardly, thus accounting for the health progression. MedLink also has a prevalence-aware retrieval design to focus more on the low-prevalence but informative codes during learning. MedLink can be trained end-to-end and is lightweight for efficient inference on large patient databases. We evaluate MedLink against leading baselines on real-world patient datasets, including the critical care dataset MIMIC-III and a large health claims dataset. Results show that MedLink outperforms the best baseline by 4% in top-1 accuracy with only 8% memory cost. Additionally, when combined with existing identifier-based linkage approaches, MedLink can improve their performance by up to 15%.

A Sequence-to-Sequence Approach with Mixed Pointers to Topic Segmentation and Segment Labeling

Topic segmentation is the process of dividing a text into semantically coherent segments, and segment labeling involves assigning a topic label to each of these segments. Previous work on this task has included the use of sequence labeling, segment-extraction, and generative models. While these methods have yielded impressive results, existing generative models have struggled to accurately generate strings of segment boundaries, limiting their competitiveness in this area. In this paper, we present a novel Sequence-to-Sequence approach with Mixed Pointers (Seq2Seq-MP). Seq2Seq-MP employs an encoder-decoder architecture with the pointer mechanism to generate both segment boundaries and topics, which allows for a more robust performance than string-generation models and can handle long-range dependencies better than sequence labeling and segment-extraction models. Additionally, we introduce the pairwise type encoding and type-aware relative position encoding to improve the fusion of type and position information, enhancing the interactions between sentences and topics in the encoder and decoder. Our experiments on public datasets show that Seq2Seq-MP outperforms the current state-of-the-art, with up to 2.9% and 4.0% improvements in Pk and F1, respectively.

User-Regulation Deconfounded Conversational Recommender System with Bandit Feedback

Recent conversational recommender systems (CRSs) have achieved considerable success on addressing the cold-start problem. While they utilize conversational key-terms to efficiently elicit user preferences, most of them, however, neglect that key-terms can also introduce biases. Systems learning key-term-level user preferences may make a biased item recommendation based on an overrated key-term instead of the item itself. As key-term conversation is a crucial part of CRSs, it is important to properly handle such bias resulting from the item-key-term relationship. While many debiasing methods have been proposed for traditional recommender systems, most of them focus on items or item groups re-ranking or re-weighting strategies such as calibration and propensity score, which are not designed to model the relation between item and key-term user preference. There is also no effective way for traditional debiasing methods to measure potentially useful biases through conversational key-terms to enhance the recommendation performance.

In this paper, we develop a deconfounded CRS, which enables the user to provide both item and key-term feedback in each round such that we can promisingly capture more accurate relation between key-term-level and item-level user preference to alleviate the bias. To better model the relations and understand such bias, we view CRSs from a causal perspective and introduce a novel structural causal model (SCM) that identifies the confounding effect of key-term-level user preference. Inspired by our causal view, we devise an online backdoor adjustment approximation to alleviate the confounding effect when making item recommendations. Consider that not all biases are harmful, we utilize the useful bias and propose DecUCB, which leverages conversational key-term feedback to regulate the influence of backdoor adjustment adaptively in a personalized fashion. Extensive experiments on real-world datasets demonstrate the advantages of our proposed method in both recommendation performance and bias mitigation.

Graph Contrastive Learning with Generative Adversarial Network

Graph Neural Networks (GNNs) have demonstrated promising results on exploiting node representations for many downstream tasks through supervised end-to-end training. To deal with the widespread label scarcity issue in real-world applications, Graph Contrastive Learning (GCL) is leveraged to train GNNs with limited or even no labels by maximizing the mutual information between nodes in its augmented views generated from the original graph. However, the distribution of graphs remains unconsidered in view generation, resulting in the ignorance of unseen edges in most existing literature, which is empirically shown to be able to improve GCL's performance in our experiments. To this end, we propose to incorporate graph generative adversarial networks (GANs) to learn the distribution of views for GCL, in order to i) automatically capture the characteristic of graphs for augmentations, and ii) jointly train the graph GAN model and the GCL model. Specifically, we present GACN, a novel Generative Adversarial Contrastive learning Network for graph representation learning. GACN develops a view generator and a view discriminator to generate augmented views automatically in an adversarial style. Then, GACN leverages these views to train a GNN encoder with two carefully designed self-supervised learning losses, including the graph contrastive loss and the Bayesian personalized ranking Loss. Furthermore, we design an optimization framework to train all GACN modules jointly. Extensive experiments on seven real-world datasets show that GACN is able to generate high-quality augmented views for GCL and is superior to twelve state-of-the-art baseline methods. Noticeably, our proposed GACN surprisingly discovers that the generated views in data augmentation finally conform to the well-known preferential attachment rule in online networks.

A Causality Inspired Framework for Model Interpretation

This paper introduces a unified causal lens for understanding representative model interpretation methods. We show that their explanation scores align with the concept of average treatment effect in causal inference, which allows us to evaluate their relative strengths and limitations from a unified causal perspective. Based on our observations, we outline the major challenges in applying causal inference to model interpretation, including identifying common causes that can be generalized across instances and ensuring that explanations provide a complete causal explanation of model predictions. We then present CIMI, a Causality-Inspired Model Interpreter, which addresses these challenges. Our experiments show that CIMI provides more faithful and generalizable explanations with improved sampling efficiency, making it particularly suitable for larger pretrained models.

Imputation-based Time-Series Anomaly Detection with Conditional Weight-Incremental Diffusion Models

Existing anomaly detection models for time series are primarily trained with normal-point-dominant data and would become ineffective when anomalous points intensively occur in certain episodes. To solve this problem, we propose a new approach, called DiffAD, from the perspective of time series imputation. Unlike previous prediction- and reconstruction-based methods that adopt either partial or complete data as observed values for estimation, DiffAD uses a density ratio-based strategy to select normal observations flexibly that can easily adapt to the anomaly concentration scenarios. To alleviate the model bias problem in the presence of anomaly concentration, we design a new denoising diffusion-based imputation method to enhance the imputation performance of missing values with conditional weight-incremental diffusion, which can preserve the information of observed values and substantially improves data generation quality for stable anomaly detection. Besides, we customize a multi-scale state space model to capture the long-term dependencies across episodes with different anomaly patterns. Extensive experimental results on real-world datasets show that DiffAD performs better than state-of-the-art benchmarks.

Spatial Heterophily Aware Graph Neural Networks

Graph Neural Networks (GNNs) have been broadly applied in many urban applications upon formulating a city as an urban graph whose nodes are urban objects like regions or points of interest. Recently, a few enhanced GNN architectures have been developed to tackle heterophily graphs where connected nodes are dissimilar. However, urban graphs usually can be observed to possess a unique spatial heterophily property; that is, the dissimilarity of neighbors at different spatial distances can exhibit great diversity. This property has not been explored, while it often exists. To this end, in this paper, we propose a metric, named Spatial Diversity Score, to quantitatively measure the spatial heterophily and show how it can influence the performance of GNNs. Indeed, our experimental investigation clearly shows that existing heterophilic GNNs are still deficient in handling the urban graph with high spatial diversity score. This, in turn, may degrade their effectiveness in urban applications. Along this line, we propose a Spatial Heterophily Aware Graph Neural Network (SHGNN), to tackle the spatial diversity of heterophily of urban graphs. Based on the key observation that spatially close neighbors on the urban graph present a more similar mode of difference to the central node, we first design a rotation-scaling spatial aggregation module, whose core idea is to properly group the spatially close neighbors and separately process each group with less diversity inside. Then, a heterophily-sensitive spatial interaction module is designed to adaptively capture the commonality and diverse dissimilarity in different spatial groups. Extensive experiments on three real-world urban datasets demonstrate the superiority of our SHGNN over several its competitors.

Reconsidering Learning Objectives in Unbiased Recommendation: A Distribution Shift Perspective

This work studies the problem of learning unbiased algorithms from biased feedback for recommendation. We address this problem from a novel distribution shift perspective. Recent works in unbiased recommendation have advanced the state-of-the-art with various techniques such as re-weighting, multi-task learning, and meta-learning. Despite their empirical successes, most of them lack theoretical guarantees, forming non-negligible gaps between theories and recent algorithms. In this paper, we propose a theoretical understanding of why existing unbiased learning objectives work for unbiased recommendation. We establish a close connection between unbiased recommendation and distribution shift, which shows that existing unbiased learning objectives implicitly align biased training and unbiased test distributions. Built upon this connection, we develop two generalization bounds for existing unbiased learning methods and analyze their learning behavior. Besides, as a result of the distribution shift, we further propose a principled framework, Adversarial Self-Training (AST), for unbiased recommendation. Extensive experiments on real-world and semi-synthetic datasets demonstrate the effectiveness of AST.

A Dual-Agent Scheduler for Distributed Deep Learning Jobs on Public Cloud via Reinforcement Learning

Public cloud GPU clusters are becoming emerging platforms for training distributed deep learning jobs. Under this training paradigm, the job scheduler is a crucial component to improve user experiences, i.e., reducing training fees and job completion time, which can also save power costs for service providers. However, the scheduling problem is known to be NP-hard. Most existing work divides it into two easier sub-tasks, i.e., ordering task and placement task, which are responsible for deciding the scheduling orders of jobs and placement orders of GPU machines, respectively. Due to the superior adaptation ability, learning-based policies can generally perform better than traditional heuristic-based methods. Nevertheless, there are still two main challenges that have not been well-solved. First, most learning-based methods only focus on ordering or placement policy independently, while ignoring their cooperation. Second, the unbalanced machine performances and resource contention impose huge overhead and uncertainty on job duration, but rarely be considered in existing work. To tackle these issues, this paper presents a dual-agent scheduler framework abstracted from the two sub-tasks to jointly learn the ordering and placement policies and make better-informed scheduling decisions. Specifically, we design an ordering agent with a scalable squeeze-and-communicate strategy for better cooperation; for the placement agent, we propose a novel Random Walk Gaussian Process to learn the performance similarities of GPU machines while being aware of the uncertain performance fluctuation. Finally, the dual-agent is jointly optimized with multi-agent reinforcement learning. Extensive experiments conducted on the real-world production cluster trace demonstrate the superiority of our model.

Learning Behavior-oriented Knowledge Tracing

Exploring how learners' knowledge states evolve during the learning activities is a critical task in online learning systems, which can facilitate personalized services downstream, such as course recommendation. Most of existing methods have devoted great efforts to analyzing learners' knowledge states according to their responses (i.e., right or wrong) to different questions. However, the significant effect of learners' learning behaviors (e.g., answering speed, the number of attempts) is omitted, which can reflect their knowledge acquisition deeper and ensure the reliability of the response. In this paper, we propose a Learning Behavior-oriented Knowledge Tracing (LBKT) model, with the goal of explicitly exploring the learning behavior effects on learners' knowledge states. Specifically, we first analyze and summarize several dominated learning behaviors including Speed, Attempts and Hints in the learning process. As the characteristics of different learning behaviors vary greatly, we separately estimate their various effects on learners' knowledge acquisition in a quantitative manner. Then, considering that different learning behaviors are closely dependent with each other, we assess the fused effect of multiple learning behaviors by capturing their complex dependent patterns. Finally, we integrate the forgetting factor with learners' knowledge acquisition to comprehensively update their changing knowledge states in learning. Extensive experimental results on several public datasets demonstrate that our model generates better performance prediction for learners against existing methods. Moreover, LBKT shows good interpretability in tracking learners' knowledge state by incorporating the learning behavior effects. Our codes are available at https://github.com/xbh0720/LBKT.

How does the Memorization of Neural Networks Impact Adversarial Robust Models?

Recent studies suggest that "memorization" is one necessary factor for overparameterized deep neural networks (DNNs) to achieve optimal performance. Specifically, the perfectly fitted DNNs can memorize the labels of many atypical samples, generalize their memorization to correctly classify test atypical samples and enjoy better test performance. While, DNNs which are optimized via adversarial training algorithms can also achieve perfect training performance by memorizing the labels of atypical samples, as well as the adversarially perturbed atypical samples. However, adversarially trained models always suffer from poor generalization, with both relatively low clean accuracy and robustness on the test set. In this work, we study the effect of memorization in adversarial trained DNNs and disclose two important findings: (a) Memorizing atypical samples is only effective to improve DNN's accuracy on clean atypical samples, but hardly improve their adversarial robustness and (b) Memorizing certain atypical samples will even hurt the DNN's performance on typical samples. Based on these two findings, we propose Benign Adversarial Training (BAT) which can facilitate adversarial training to avoid fitting "harmful" atypical samples and fit as more "benign" atypical samples as possible. In our experiments, we validate the effectiveness of BAT, and show that it can achieve better clean accuracy vs. robustness trade-off than baseline methods, in benchmark datasets for image classification.

Grace: Graph Self-Distillation and Completion to Mitigate Degree-Related Biases

Due to the universality of graph data, node classification shows its great importance in a wide range of real-world applications. Despite the successes of Graph Neural Networks (GNNs), GNN based methods rely heavily on rich connections and perform poorly on low-degree nodes. Since many real-world graphs follow a long-tailed distribution in node degrees, they suffer from a substantial performance bottleneck as a significant fraction of nodes is of low degree. In this paper, we point out that under-represented self-representations and low neighborhood homophily ratio of low-degree nodes are two main culprits. Based on that, we propose a novel method Grace which improves the node representation by self-distillation, and increases neighborhood homophily ratio of low-degree nodes by graph completion. To avoid error propagation of graph completion, label propagation is further leveraged. Experimental evidence has shown that our method well supports real-world graphs, and is superior in balancing degree-related bias and overall performance on node classification tasks.

Internal Logical Induction for Pixel-Symbolic Reinforcement Learning

Reinforcement Learning (RL) has experienced rapid advancements in recent years. The widely studied RL algorithms mainly focus on a single input form, such as pixel-based image input or symbolic vector input. These two forms have different characteristics and, in many scenarios, will appear together, while few RL algorithms have studied the problems with mixed input types. Specifically, in the scenario where both pixel and symbolic inputs are available, symbolic input usually offers abstract features with specific semantics, which is more conducive to the agent's focus. Conversely, pixel input provides more comprehensive information, enabling the agent to make well-informed decisions. Tailoring the processing approach based on the properties of these two input types can contribute to solving the problem more effectively. To tackle the above issue, we propose an Internal Logical Induction (ILI) framework that integrates deep RL and rule learning into one system. ILI utilizes the deep RL algorithm to process the pixel input and the rule learning algorithm to induce propositional logic knowledge from symbolic input. To efficiently combine these two mechanisms, we further adopt a reward shaping technique by treating valuable knowledge as intrinsic rewards for the RL procedure. Experimental results demonstrate that the ILI framework outperforms baseline approaches in RL problems with pixel-symbolic input, and its inductive knowledge exhibits transferability advantages when pixel input semantics change.

MimoSketch: A Framework to Mine Item Frequency on Multiple Nodes with Sketches

We abstract a MIMO scenario in distributed data stream mining, where a stream of multiple items is mined by multiple nodes. We design a framework named MimoSketch for the MIMO-specific scenario, which improves the fundamental mining task of item frequency estimation. MimoSketch consists of an algorithm design and a policy to schedule items to nodes. MimoSketch's algorithm applies random counting to preserve a mathematically proven unbiasedness property, which makes it friendly to the aggregate query on multiple nodes; its memory layout is dynamically adaptive to the runtime item size distribution, which maximizes the estimation accuracy by storing more items. MimoSketch's scheduling policy balances items among nodes, avoiding nodes being overloaded or underloaded, which improves the overall mining accuracy. Our prototype and evaluation show that our algorithm can improve the item frequency estimation accuracy by an order of magnitude compared with the state-of-the-art solutions, and the scheduling policy further promotes the performance in MIMO scenarios.

Kernel Ridge Regression-Based Graph Dataset Distillation

The huge volume of emerging graph datasets has become a double-bladed sword for graph machine learning. On the one hand, it empowers the success of a myriad of graph neural networks (GNNs) with strong empirical performance. On the other hand, training modern graph neural networks on huge graph data is computationally expensive. How to distill the given graph dataset while retaining most of the trained models' performance is a challenging problem. Existing efforts try to approach this problem by solving meta-learning-based bilevel optimization objectives. A major hurdle lies in that the exact solutions of these methods are computationally intensive and thus, most, if not all, of them are solved by approximate strategies which in turn hurt the distillation performance. In this paper, inspired by the recent advances in neural network kernel methods, we adopt a kernel ridge regression-based meta-learning objective which has a feasible exact solution. However, the computation of graph neural tangent kernel is very expensive, especially in the context of dataset distillation. As a response, we design a graph kernel, named LiteGNTK, tailored for the dataset distillation problem which is closely related to the classic random walk graph kernel. An effective model named Kernel rıdge regression-based graph Dataset Distillation (KIDD) and its variants are proposed. KIDD shows nice efficiency in both the forward and backward propagation processes. At the same time, KIDD shows strong empirical performance over 7 real-world datasets compared with the state-of-the-art distillation methods. Thanks to the ability to find the exact solution of the distillation objective, the learned training graphs by KIDD can sometimes even outperform the original whole training set with as few as 1.65% training graphs.

Node Classification Beyond Homophily: Towards a General Solution

Graph neural networks (GNNs) have become core building blocks behind a myriad of graph learning tasks. The vast majority of the existing GNNs are built upon, either implicitly or explicitly, the homophily assumption, which is not always true and could heavily degrade the performance of learning tasks. In response, GNNs tailored for heterophilic graphs have been developed. However, most of the existing works are designed for the specific GNN models to address heterophily, which lacks generality. In this paper, we study the problem from the structure learning perspective and propose a family of general solutions named ALT. It can work hand in hand with most of the existing GNNs to handle graphs with either low or high homophily. At the core of our method is learning to (1) decompose a given graph into two components, (2) extract complementary graph signals from these two components, and (3) adaptively integrate the graph signals for node classification. Moreover, analysis based on graph signal processing shows that our framework can empower a broad range of existing GNNs to have adaptive filter characteristics and further modulate the input graph signals, which is critical for handling complex homophilic/heterophilic patterns. The proposed ALT brings significant and consistent performance improvement in node classification for a wide range of GNNs over a variety of real-world datasets.

PrefRec: Recommender Systems with Human Preferences for Reinforcing Long-term User Engagement

Current advances in recommender systems have been remarkably successful in optimizing immediate engagement. However, long-term user engagement, a more desirable performance metric, remains difficult to improve. Meanwhile, recent reinforcement learning (RL) algorithms have shown their effectiveness in a variety of long-term goal optimization tasks. For this reason, RL is widely considered as a promising framework for optimizing long-term user engagement in recommendation. Though promising, the application of RL heavily relies on well-designed rewards, but designing rewards related to long-term user engagement is quite difficult. To mitigate the problem, we propose a novel paradigm, recommender systems with human preferences (or Preference-based Recommender systems), which allows RL recommender systems to learn from preferences about users' historical behaviors rather than explicitly defined rewards. Such preferences are easily accessible through techniques such as crowdsourcing, as they do not require any expert knowledge. With PrefRec, we can fully exploit the advantages of RL in optimizing long-term goals, while avoiding complex reward engineering. PrefRec uses the preferences to automatically train a reward function in an end-to-end manner. The reward function is then used to generate learning signals to train the recommendation policy. Furthermore, we design an effective optimization method for PrefRec, which uses an additional value function, expectile regression and reward model pre-training to improve the performance. We conduct experiments on a variety of long-term user engagement optimization tasks. The results show that PrefRec significantly outperforms previous state-of-the-art methods in all the tasks.

E-commerce Search via Content Collaborative Graph Neural Network

Recently, many E-commerce search models are based on Graph Neural Networks (GNNs). Despite their promising performances, they are (1) lacking proper semantic representation of product contents; (2) less efficient for industry-scale graphs; and (3) less accurate on long-tail queries and cold-start products. To address these problems simultaneously, this paper proposes CC-GNN, a novel Content Collaborative Graph Neural Network. Firstly, CC-GNN enables content phrases to participate explicitly in graph propagation to capture the proper meaning of phrases and semantic drifts. Secondly, CC-GNN presents several efforts towards a more scalable graph learning framework, including efficient graph construction, MetaPath-guided Message Passing, and Difficulty-aware Representation Perturbation for graph contrastive learning. Furthermore, CC-GNN adopts Counterfactual Data Supplement at both supervised and contrastive learning to resolve the long-tail/cold-start problems. Extensive experiments on a real E-commerce dataset of 100-million-scale nodes show that CC-GNN produces significant improvements over existing methods (i.e., more than 10% improvements in terms of several key evaluation metrics for overall, long-tail queries and cold-start products) while reducing computational complexity. The proposed components of CC-GNN can be applied to other models for search and recommendation tasks. Experiments on a public dataset show that applying the proposed components can improve the performance of different recommendation models.

CriticalFL: A Critical Learning Periods Augmented Client Selection Framework for Efficient Federated Learning

Federated learning (FL) is a distributed optimization paradigm that learns from data samples distributed across a number of clients. Adaptive client selection that is cognizant of the training progress of clients has become a major trend to improve FL efficiency but not yet well-understood. Most existing FL methods such as FedAvg and its state-of-the-art variants implicitly assume that all learning phases during the FL training process are equally important. Unfortunately, this assumption has been revealed to be invalid due to recent findings on critical learning periods (CLP), in which small gradient errors may lead to an irrecoverable deficiency on final test accuracy. In this paper, we develop CriticalFL, a CLP augmented FL framework to reveal that adaptively augmenting exiting FL methods with CLP, the resultant performance is significantly improved when the client selection is guided by the discovered CLP. Experiments based on various machine learning models and datasets validate that the proposed CriticalFL framework consistently achieves an improved model accuracy while maintains better communication efficiency as compared to state-of-the-art methods, demonstrating a promising and easily adopted method for tackling the heterogeneity of FL training.

Empowering General-purpose User Representation with Full-life Cycle Behavior Modeling

User Modeling plays an essential role in industry. In this field, task-agnostic approaches, which generate general-purpose representation applicable to diverse downstream user cognition tasks, is a promising direction being more valuable and economical than task-specific representation learning. With the rapid development of Internet service platforms, user behaviors have been accumulated continuously. However, existing general-purpose user representation researches have little ability for full-life cycle modeling on extremely long behavior sequences since user registration. In this study, we propose a novel framework called full- Life cycle User Representation Model (LURM) to tackle this challenge. Specifically, LURM consists of two cascaded sub-models: (\romannumeral1) Bag-of-Interests (BoI) encodes user behaviors in any time period into a sparse vector with super-high dimension (\textite. \textitg. , 10^5 ); (\romannumeral2) Self-supervised Multi-anchor Encoder Network (SMEN) maps sequences of BoI features to multiple low-dimensional user representations. Specially, SMEN achieves almost lossless dimensionality reduction, benefiting from a novel multi-anchor module which can learn different aspects of user interests. Experiments on several benchmark datasets show that our approach outperforms state-of-the-art general-purpose representation methods.

Fragility Index: A New Approach for Binary Classification

In binary classification problems, many performance metrics evaluate the probability that some error exceeds a threshold. Nevertheless, they focus more on the probability and fail to capture the magnitude of the error, which evaluates how large this error exceeds the threshold. Capturing the magnitude of error is desired in many applications. For example, in detecting disease and predicting credit default, the magnitude of error illustrates the confidence in making the wrong prediction. We propose a novel metric, the Fragility Index (FI), to evaluate the performance of binary classifiers by capturing the magnitude of the error. FI alleviates the risk of misclassification by penalizing the large error greatly, which is seldom considered by standard metrics. Moreover, to strengthen the generalization ability and handle unseen samples, we adopt the framework of distributionally robust optimization and robust satisficing, which allows us to derive and control the maximum degree of fragility of the classifier when the distribution of samples shifts. We show that FI can be easily calculated and optimized for common probabilistic distance measures. Experiments with real datasets demonstrate the new insights brought by FI and the advantages of classifiers selected under FI, which always improve the robustness and reduce the risk of large errors as compared to classifiers selected by alternative metrics.

IDToolkit: A Toolkit for Benchmarking and Developing Inverse Design Algorithms in Nanophotonics

Aiding humans with scientific designs is one of the most exciting of artificial intelligence (AI) and machine learning (ML), due to their potential for the discovery of new drugs, design of new materials and chemical compounds, etc. However, scientific design typically requires complex domain knowledge that is not familiar to AI researchers. Further, scientific studies involve professional skills to perform experiments and evaluations. These obstacles prevent AI researchers from developing specialized methods for scientific designs. To take a step towards easy-to-understand and reproducible research of scientific design, we propose a benchmark for the inverse design of nanophotonic devices, which can be verified computationally and accurately. Specifically, we implemented three different nanophotonic design problems, namely a radiative cooler, a selective emitter for thermophotovoltaics, and structural color filters, all of which are different in design parameter spaces, complexity, and design targets. The benchmark environments are implemented with an open-source simulator. We further implemented 10 different inverse design algorithms and compared them in a reproducible and fair framework. The results revealed the strengths and weaknesses of existing methods, which shed light on several future directions for developing more efficient inverse design algorithms. Our benchmark can also serve as the starting point for more challenging scientific design problems. The code of IDToolkit is available at https://github.com/ThyrixYang/IDToolkit.

MAPLE: Semi-Supervised Learning with Multi-Alignment and Pseudo-Learning

Data augmentation has undoubtedly enabled a significant leap forward in training a high-accuracy deep network. Besides the commonly used augmentation to target data, e.g., random cropping, flipping, and rotation, recent works have been dedicated to mining generalized knowledge by using multiple sources. However, along with plentiful data comes the huge data distribution gap between the target and different sources (hybrid shift). To mitigate this problem, existing methods tend to manually annotate more data. Unlike previous methods, this paper focuses on the study of learning deep models by gathering knowledge from multiple sources in a labor-free fashion and further proposes the "Multi-Alignment and Pseudo-Learning'' method, dubbed MAPLE. MAPLE constructs the multi-alignment module, which consists of multiple discriminators to align different data distributions via an adversarial process. In addition, a novel semi-supervised learning (SSL) manner is introduced to further facilitate the utility of our MAPLE. Extensive evaluations conducted on four benchmarks show the effectiveness of the proposed MAPLE, which achieves state-of-the-art performance outperforming existing methods by an obvious margin.

EXTRACT and REFINE: Finding a Support Subgraph Set for Graph Representation

Subgraph learning has received considerable attention in its capacity of interpreting important structural information for predictions. Existing subgraph learning usually exploits statistics on predefined structures e.g., node degrees, occurrence frequency, to extract subgraphs, or refine the contents via only capturing label-relevant information with node-level sampling. Given diverse subgraph patterns, and mutual independence with local correlations on graphs, current solutions on subgraph learning still have two limitations in extraction and refinement stages. 1) The universality of extracting substructure patterns across domains is still lacking, 2) node-level sampling in refinement will distort the original local topology and none explicit guidance eliminating redundant information contribute to inefficiency issue. In this paper, we propose a unified subgraph learning scheme, Poly-Pivot Graph Neural Network (P2GNN) where we designate the centric node of each subgraph as the pivot. In the extraction stage, we present a general subgraph extraction principle, i.e., Local; Asymmetry between the centric and affiliated nodes. To this end, we asymmetrically model the similarity between each pair of nodes with random walk and quantify mutual affiliations in Affinity Propagation architecture, to extract subgraph structures. In the refinement, we devise a subgraph-level exclusion regularization to squash the target-independent information by considering mutual relations across subgraphs, cooperatively preserving a support set of subgraphs and facilitating the refinement process for graph representation. Empirical experiments on diverse web and biological graphs reveal 1.1%~7.3% improvements against best baselines, and visualized case studies prove the universality and interpretability of our P2GNN.

κHGCN: Tree-likeness Modeling via Continuous and Discrete Curvature Learning

The prevalence of tree-like structures, encompassing hierarchical structures and power law distributions, exists extensively in real-world applications, including recommendation systems, ecosystems, financial networks, social networks, etc. Recently, the exploitation of hyperbolic space for tree-likeness modeling has garnered considerable attention owing to its exponential growth volume. Compared to the flat Euclidean space, the curved hyperbolic space provides a more amenable and embeddable room, especially for datasets exhibiting implicit tree-like architectures. However, the intricate nature of real-world tree-like data presents a considerable challenge, as it frequently displays a heterogeneous composition of tree-like, flat, and circular regions. The direct embedding of such heterogeneous structures into a homogeneous embedding space (i.e., hyperbolic space) inevitably leads to heavy distortions. To mitigate the aforementioned shortage, this study endeavors to explore the curvature between discrete structure and continuous learning space, aiming at encoding the message conveyed by the network topology in the learning process, thereby improving tree-likeness modeling. To the end, a curvature-aware hyperbolic graph convolutional neural network, κHGCN, is proposed, which utilizes the curvature to guide message passing and improve long-range propagation. Extensive experiments on node classification and link prediction tasks verify the superiority of the proposal as it consistently outperforms various competitive models by a large margin.

Specify Robust Causal Representation from Mixed Observations

Learning representations purely from observations concerns the problem of learning a low-dimensional, compact representation which is beneficial to prediction models. Under the hypothesis that the intrinsic latent factors follow some casual generative models, we argue that by learning a causal representation, which is the minimal sufficient causes of the whole system, we can improve the robustness and generalization performance of machine learning models. In this paper, we develop a learning method to learn such representation from observational data by regularizing the learning procedure with mutual information measures, according to the hypothetical factored causal graph. We theoretically and empirically show that the models trained with the learned causal representations are more robust under adversarial attacks and distribution shifts compared with baselines.

Counterfactual Learning on Heterogeneous Graphs with Greedy Perturbation

Due to the growing importance of using graph neural networks in high-stakes applications, there is a pressing need to interpret the predicted results of these models. Existing methods for explanation have mainly focused on generating sub-graphs comprising important edges for a specific prediction. However, these methods face two issues. Firstly, they lack counterfactual validity as removing the subgraph may not affect the prediction, and generating plausible counterfactual examples has not been adequately explored. Secondly, they cannot be extended to heterogeneous graphs as the complex information involved in such graphs increases the difficulty of generating interpretations. This paper proposes a novel counterfactual learning method, named CF-HGExplainer, for heterogeneous graphs. The method incorporates a semantic-aware attentive pooling strategy for the heterogeneous graph classifier and designs a heterogeneous decision boundaries extraction module to find the common logic for similar graphs based on the extracted graph embeddings from the classifier. Additionally, we propose to greedily perturb nodes and edges based on the distribution of node features and edge plausibility to train a neural network for heterogeneous edge weight learning. Extensive experiments on two public academic datasets demonstrate the effectiveness of CF-HGExplainer compared to state-of-the-art methods on the graph classification task and graph interpretation task.

LightPath: Lightweight and Scalable Path Representation Learning

Movement paths are used widely in intelligent transportation and smart city applications. To serve such applications, path representation learning aims to provide compact representations of paths that enable efficient and accurate operations when used for different downstream tasks such as path ranking and travel cost estimation. In many cases, it is attractive that the path representation learning is lightweight and scalable; in resource-limited environments and under green computing limitations, it is essential. Yet, existing path representation learning studies focus on accuracy and pay at most secondary attention to resource consumption and scalability. We propose a lightweight and scalable path representation learning framework, termed LightPath, that aims to reduce resource consumption and achieve scalability without affecting accuracy, thus enabling broader applicability. More specifically, we first propose a sparse auto-encoder that ensures that the framework achieves good scalability with respect to path length. Next, we propose a relational reasoning framework to enable faster training of more robust sparse path encoders. We also propose global-local knowledge distillation to further reduce the size and improve the performance of sparse path encoders. Finally, we report extensive experiments on two real-world datasets to offer insight into the efficiency, scalability, and effectiveness of the proposed framework.

Test Accuracy vs. Generalization Gap: Model Selection in NLP without Accessing Training or Testing Data

Selecting suitable architecture parameters and training hyperparameters is essential for enhancing machine learning (ML) model performance. Several recent empirical studies conduct large-scale correlational analysis on neural networks (NNs) to search for effective generalization metrics that can guide this type of model selection. Effective metrics are typically expected to correlate strongly with test performance. In this paper, we expand on prior analyses by examining generalization-metric-based model selection with the following objectives: (i) focusing on natural language processing (NLP) tasks, as prior work primarily concentrates on computer vision (CV) tasks; (ii) considering metrics that directly predict test error instead of the generalization gap; (iii) exploring metrics that do not need access to data to compute. From these objectives, we are able to provide the first model selection results on large pretrained Transformers from Huggingface using generalization metrics. Our analyses consider (I) hundreds of Transformers trained in different settings, in which we systematically vary the amount of data, the model size and the optimization hyperparameters, (II) a total of 51 pretrained Transformers from eight families of Huggingface NLP models, including GPT2, BERT, etc., and (III) a total of 28 existing and novel generalization metrics. Despite their niche status, we find that metrics derived from the heavy-tail (HT) perspective are particularly useful in NLP tasks, exhibiting stronger correlations than other, more popular metrics. To further examine these metrics, we extend prior formulations relying on power law (PL) spectral distributions to exponential (EXP) and exponentially-truncated power law (E-TPL) families.

MSSRNet: Manipulating Sequential Style Representation for Unsupervised Text Style Transfer

Unsupervised text style transfer task aims to rewrite a text into target style while preserving its main content. Traditional methods rely on the use of a fixed-sized vector to regulate text style, which is difficult to accurately convey the style strength for each individual token. In fact, each token of a text contains different style intensity and makes different contribution to the overall style. Our proposed method addresses this issue by assigning individual style vector to each token in a text, allowing for fine-grained control and manipulation of the style strength. Additionally, an adversarial training framework integrated with teacher-student learning is introduced to enhance training stability and reduce the complexity of high-dimensional optimization. The results of our experiments demonstrate the efficacy of our method in terms of clearly improved style transfer accuracy and content preservation in both two-style transfer and multi-style transfer settings.

DCdetector: Dual Attention Contrastive Representation Learning for Time Series Anomaly Detection

Time series anomaly detection is critical for a wide range of applications. It aims to identify deviant samples from the normal sample distribution in time series. The most fundamental challenge for this task is to learn a representation map that enables effective discrimination of anomalies. Reconstruction-based methods still dominate, but the representation learning with anomalies might hurt the performance with its large abnormal loss. On the other hand, contrastive learning aims to find a representation that can clearly distinguish any instance from the others, which can bring a more natural and promising representation for time series anomaly detection. In this paper, we propose DCdetector, a multi-scale dual attention contrastive representation learning model. DCdetector utilizes a novel dual attention asymmetric design to create the permutated environment and pure contrastive loss to guide the learning process, thus learning a permutation invariant representation with superior discrimination abilities. Extensive experiments show that DCdetector achieves state-of-the-art results on multiple time series anomaly detection benchmark datasets. Code is publicly available at https://github.com/DAMO-DI-ML/KDD2023-DCdetector.

Knowledge Graph Self-Supervised Rationalization for Recommendation

In this paper, we introduce a new self-supervised rationalization method, called KGRec, for knowledge-aware recommender systems. To effectively identify informative knowledge connections, we propose an attentive knowledge rationalization mechanism that generates rational scores for knowledge triplets. With these scores, KGRec integrates generative and contrastive self-supervised tasks for recommendation through rational masking. To highlight rationales in the knowledge graph, we design a novel generative task in the form of masking-reconstructing. By masking important knowledge with high rational scores, KGRec is trained to rebuild and highlight useful knowledge connections that serve as rationales. To further rationalize the effect of collaborative interactions on knowledge graph learning, we introduce a contrastive learning task that aligns signals from knowledge and user-item interaction views. To ensure noise-resistant contrasting, potential noisy edges in both graphs judged by the rational scores are masked. Extensive experi-ments on three real-world datasets demonstrate that KGRec outperforms state-of-the-art methods. We also provide the implementation codes for our approach at https://github.com/HKUDS/KGRec.

BatchSampler: Sampling Mini-Batches for Contrastive Learning in Vision, Language, and Graphs

In-Batch contrastive learning is a state-of-the-art self-supervised method that brings semantically-similar instances close while pushing dissimilar instances apart within a mini-batch. Its key to success is the negative sharing strategy, in which every instance serves as a negative for the others within the mini-batch. Recent studies aim to improve performance by sampling hard negatives within the current mini-batch, whose quality is bounded by the mini-batch itself. In this work, we propose to improve contrastive learning by sampling mini-batches from the input data. We present BatchSampler\footnoteThe code is available at BatchSampler to sample mini-batches of hard-to-distinguish (i.e., hard and true negatives to each other) instances. To make each mini-batch have fewer false negatives, we design the proximity graph of randomly-selected instances. To form the mini-batch, we leverage random walk with restart on the proximity graph to help sample hard-to-distinguish instances. BatchSampler is a simple and general technique that can be directly plugged into existing contrastive learning models in vision, language, and graphs. Extensive experiments on datasets of three modalities show that BatchSampler can consistently improve the performance of powerful contrastive models, as shown by significant improvements of SimCLR on ImageNet-100, SimCSE on STS (language), and GraphCL and MVGRL on graph datasets.

Improving the Expressiveness of K-hop Message-Passing GNNs by Injecting Contextualized Substructure Information

Graph neural networks (GNNs) have become the de facto standard for representational learning in graphs, and have achieved state-of-the-art performance in many graph-related tasks; however, it has been shown that the expressive power of standard GNNs are equivalent maximally to 1-dimensional Weisfeiler-Lehman (1-WL) Test. Recently, there is a line of works aiming to enhance the expressive power of graph neural networks. One line of such works aim at developing K-hop message-passing GNNs where node representation is updated by aggregating information from not only direct neighbors but all neighbors within K-hop of the node. Another line of works leverages subgraph information to enhance the expressive power which is proven to be strictly more powerful than 1-WL test. In this work, we discuss the limitation of K-hop message-passing GNNs and propose substructure encoding function to uplift the expressive power of any K-hop message-passing GNN. We further inject contextualized substructure information to enhance the expressiveness of K-hop message-passing GNNs. Our method is provably more powerful than previous works on K-hop graph neural networks and 1-WL subgraph GNNs, which is a specific type of subgraph based GNN models, and not less powerful than 3-WL. Empirically, our proposed method set new state-of-the-art performance or achieves comparable performance for a variety of datasets. Our code is available at https://github.com/tianyao-aka/Expresive_K_hop_GNNs.

Web-based Long-term Spine Treatment Outcome Forecasting

The aging of global population is witnessing increasing prevalence of spinal disorders. According to latest statistics, nearly five percent of the global population is suffering from spinal disorders. To relieve the pain, many spine patients tend to choose surgeries. However, recent evidences reveal that some spine patients can self-heal over time with nonoperative treatment and even surgeries may not ease the pain for some others, which raises a critical question regarding the appropriateness of such surgeries. Furthermore, the complex and time-consuming diagnostic process places a great burden on both clinicians and patients. Due to the development of web technology, it is possible for spine patients to obtain decision making suggestions on the Internet. The uniqueness of web technology, including its popularity, convenience, and immediacy, makes intelligent healthcare techniques, especially Treatment Outcome Forecasting (TOF), able to support clinical decision-making for doctors and healthcare providers. Despite a few machine-learning-based methods have been proposed for TOF, their performance and feasibility are mostly unsatisfactory due to the neglect of a few practical challenges (caused by applying on the Internet), including biased data selection, noisy supervision, and patient noncompliance. In light of this, we propose DeepTOF, a novel end-to-end deep learning model to cope with the unique challenges in web-based long-term continuous spine TOF. In particular, we combine different patient groups and train a unified predictive model to eliminate the data selection bias. Towards robust learning, we further take advantage of indirect but fine-grained supervision signals to mutually calibrate with the noisy training labels. Additionally, a feature selector was co-trained with DeepTOF to select the most important features (i.e., answers/indicators that need to be collected) for inference, thus easing the use of DeepTOF during web-based real-world application. The proposed DeepTOF could bring great benefits to the rehabilitation of spine patients. Comprehensive experiments and analysis show that DeepTOF outperforms conventional solutions by a large margin.

PAT: Geometry-Aware Hard-Label Black-Box Adversarial Attacks on Text

Despite a plethora of prior explorations, conducting text adversarial attacks in practical settings is still challenging with the following constraints: black box -- the inner structure of the victim model is unknown; hard label -- the attacker only has access to the top-1 prediction results; and semantic preservation - the perturbation needs to preserve the original semantics. In this paper, we present PAT, a novel adversarial attack method employed under all these constraints. Specifically, PAT explicitly models the adversarial and non-adversarial prototypes and incorporates them to measure semantic changes for replacement selection in the hard-label black-box setting to generate high-quality samples. In each iteration, PAT finds original words that can be replaced back and selects better candidate words for perturbed positions in a geometry-aware manner guided by this estimation, which maximally improves the perturbation construction and minimally impacts the original semantics. Extensive evaluation with benchmark datasets and state-of-the-art models shows that PAT outperforms existing text adversarial attacks in terms of both attack effectiveness and semantic preservation. Moreover, we validate the efficacy of PAT against industry-leading natural language processing platforms in real-world settings.

VQNE: Variational Quantum Network Embedding with Application to Network Alignment

Learning of network embedding with vector-based node representation has attracted wide attention over the decade. It differs from the general setting of graph node embedding whereby the node attributes are also considered and yet may incur privacy issues. In this paper, we depart from the classic CPU/GPU architecture to consider the well-established network alignment problem based on network embedding, and develop a quantum machine learning approach with a low qubit cost for its near-future applicability on Noisy Intermediate-Scale Quantum (NISQ) devices. Specifically, our model adopts the discrete-time quantum walk (QW) and conducts the QW on the tailored merged network to extract structure information from the two aligning networks without the need for quantum state preparation which otherwise requires high quantum gate cost. Then the quantum states from QW are fed to a quantum embedding ansatz (i.e., parameterized circuit) to learn the latent representation of each node. The key part of our approach is to connect these two quantum modules to achieve a pure quantum paradigm without involving classical modules. To our best knowledge, there has not been any classic-quantum hybrid approach to network embedding, let alone a pure quantum paradigm being free from the bottleneck of communication between classic devices and quantum devices, which is still an open problem. Experimental results on two real-world datasets show the effectiveness of our quantum embedding approach in comparison with classical embedding approaches. Our model is readily and efficiently implemented in Python with a full-amplitude simulation of the QW and the quantum circuit. Therefore, our model can be readily deployed on an existing NISQ device with all the circuits provided, and only 13 qubits are needed in the experiments, which is rarely attained in existing quantum graph learning works.

Optimal Dynamic Subset Sampling: Theory and Applications

We study the fundamental problem of sampling independent events, called subset sampling. Specifically, consider a set of n distinct events S=x1, …, xn, in which each event xi has an associated probability p(xi). The subset sampling problem aims to sample a subset T ⊆ S, such that every xi is independently included in T with probability p(xi). A naive solution is to flip a coin for each event, which takes O(n) time. However, an ideal solution is a data structure that allows drawing a subset sample in time proportional to the expected output size μ=∑i=1n p(xi), which can be significantly smaller than n in many applications. The subset sampling problem serves as an important building block in many tasks and has been the subject of various research for more than a decade.

However, the majority of existing subset sampling methods are designed for a static setting, where the events in set S or their associated probabilities remain unchanged over time. These algorithms incur either large query time or update time in a dynamic setting despite the ubiquitous time-evolving events with varying probabilities in real life. Therefore, it is a pressing need, but still, an open problem, to design efficient dynamic subset sampling algorithms.

In this paper, we propose ODSS, the first optimal dynamic subset sampling algorithm. The expected query time and update time of ODSS are both optimal, matching the lower bounds of the subset sampling problem. We present a nontrivial theoretical analysis to demonstrate the superiority of ODSS. We also conduct comprehensive experiments to empirically evaluate the performance of ODSS. Moreover, we apply ODSS to a concrete application: Influence Maximization. We empirically show that our ODSS can improve the complexities of existing Influence Maximization algorithms on large real-world evolving social networks.

Less is More: SlimG for Accurate, Robust, and Interpretable Graph Mining

How can we solve semi-supervised node classification in various graphs possibly with noisy features and structures? Graph neural networks (GNNs) have succeeded in many graph mining tasks, but their generalizability to various graph scenarios is limited due to the difficulty of training, hyperparameter tuning, and the selection of a model itself. Einstein said that we should "make everything as simple as possible, but not simpler." We rephrase it into the careful simplicity principle: a carefully-designed simple model can surpass sophisticated ones in real-world graphs. Based on the principle, we propose SlimG for semi-supervised node classification, which exhibits four desirable properties: It is (a) accurate, winning or tying on 10 out of 13 real-world datasets; (b) robust, being the only one that handles all scenarios of graph data (homophily, heterophily, random structure, noisy features, etc.); (c) fast and scalable, showing up to 18 times faster training in million-scale graphs; and (d) interpretable, thanks to the linearity and sparsity. We explain the success of SlimG through a systematic study of the designs of existing GNNs, sanity checks, and comprehensive ablation studies.

FLAMES2Graph: An Interpretable Federated Multivariate Time Series Classification Framework

Increasing privacy concerns have led to decentralized and federated machine learning techniques that allow individual clients to consult and train models collaboratively without sharing private information. Some of these applications, such as medical and healthcare, require the final decisions to be interpretable. One common form of data in these applications is multivariate time series, where deep neural networks, especially convolutional neural networks based approaches, have established excellent performance in their classification tasks. However, promising results and performance of deep learning models are a black box, and their decisions cannot always be guaranteed and trusted. While several approaches address the interpretability of deep learning models for multivariate time series data in a centralized environment, less effort has been made in a federated setting. In this work, we introduce FLAMES2Graph, a new horizontal federated learning framework designed to interpret the deep learning decisions of each client. FLAMES2Graph extracts and visualizes those input subsequences that are highly activated by a convolutional neural network. Besides, an evolution graph is created to capture the temporal dependencies between the extracted distinct subsequences. The federated learning clients only share this temporal evolution graph with the centralized server instead of trained model weights to create a global evolution graph. Our extensive experiments on various datasets from well-known multivariate benchmarks indicate that the FLAMES2Graph framework significantly outperforms other state-of-the-art federated methods while keeping privacy and augmenting network decision interpretation.

Cognitive Evolutionary Search to Select Feature Interactions for Click-Through Rate Prediction

Click-Through Rate (CTR) prediction of intelligent marketing systems is of great importance, in which feature interaction selection plays a key role. Most approaches model interactions of features by the same pre-defined operation under expert guidance, among which improper interactions may bring unnecessary noise and complicate the training process. To that end, in this paper, we aim to adaptively evolve the model to select proper operations to interact on feature pairs under task guidance. Inspired by natural evolution, we propose a general Cognitive EvoLutionary Search (CELS) framework, where cognitive ability refers to the malleability of organisms to orientate to the environment. Specifically, we conceptualize interactions as genomes, models as organisms, and tasks as natural environments. Mirroring how genetic malleability develops environmental adaptability, we thus diagnose the fitness of models to simulate the survival rates of organisms for natural selection, thereby an evolution path can be planned and visualized, offering an intuitive interpretation of the mechanisms underlying interaction modeling and selection. Based on the CELS framework, we develop four instantiations including individual-based search and population-based search. We demonstrate how individual mutation and population crossover enable CELS to evolve into diverse models suitable for various tasks and data, providing ready-to-use models. Extensive experiments on real-world datasets demonstrate that CELS significantly outperforms state-of-the-art approaches.

Towards Variance Reduction for Reinforcement Learning of Industrial Decision-making Tasks: A Bi-Critic based Demand-Constraint Decoupling Approach

Learning to plan and schedule receives increasing attention due to its efficiency in problem-solving and potential to outperform heuristics. In particular, actor-critic-based reinforcement learning (RL) has been widely adopted for uncertain environments. Yet one standing challenge for applying RL to real-world industrial decision-making problems is the high variance during training. Existing efforts design novel value functions to alleviate the issue but still suffer. In this paper, we address this issue from the perspective of adjusting the actor-critic paradigm. We start by making an observation ignored in many industrial problems---the environmental dynamics for an agent consist of two parts physically independent of each other: the exogenous task demand over time and the hard constraint for action. And we theoretically show that decoupling these two effects in the actor-critic technique would reduce variance. Accordingly, we propose to decouple and model them separately in the state transition of the Markov decision process (MDP). In the demand-encoding process, the temporal task demand, e.g., the passengers for elevator scheduling is encoded followed by a critic for scoring. While in the constraint-encoding process, an actor-critic module is adopted for action embedding, and the two critics are then used for a revised advantaged function calculation. Experimental results show that our method can adaptively handle different dynamic planning and scheduling tasks and outperform recent learning-based models and traditional heuristic algorithms.

Spatio-temporal Diffusion Point Processes

Spatio-temporal point process (STPP) is a stochastic collection of events accompanied with time and space. Due to computational complexities, existing solutions for STPPs compromise with conditional independence between time and space, which consider the temporal and spatial distributions separately. The failure to model the joint distribution leads to limited capacities in characterizing the spatio-temporal entangled interactions given past events. In this work, we propose a novel parameterization framework for STPPs, which leverages diffusion models to learn complex spatio-temporal joint distributions. We decompose the learning of the target joint distribution into multiple steps, where each step can be faithfully described by a Gaussian distribution. To enhance the learning of each step, an elaborated spatio-temporal co-attention module is proposed to capture the interdependence between the event time and space adaptively. For the first time, we break the restrictions on spatio-temporal dependencies in existing solutions, and enable a flexible and accurate modeling paradigm for STPPs. Extensive experiments from a wide range of fields, such as epidemiology, seismology, crime, and urban mobility, demonstrate that our framework outperforms the state-of-the-art baselines remarkably. Further in-depth analyses validate its ability to capture spatio-temporal interactions, which can learn adaptively for different scenarios. The datasets and source code are available online: https://github.com/tsinghua-fib-lab/Spatio-temporal-Diffusion-Point-Processes.

Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term

Deep Neural Networks (DNNs) generalization is known to be closely related to the flatness of minima, leading to the development of Sharpness-Aware Minimization (SAM) for seeking flatter minima and better generalization. In this paper, we revisit the loss of SAM and propose a more general method, called WSAM, by incorporating sharpness as a regularization term. We prove its generalization bound through the combination of PAC and Bayes-PAC techniques, and evaluate its performance on various public datasets. The results demonstrate that WSAM achieves improved generalization, or is at least highly competitive, compared to the vanilla optimizer, SAM and its variants. The code is available at this link https://github.com/intelligent-machine-learning/dlrover/tree/master/atorch/atorch/optimizers.

Doubly Robust AUC Optimization against Noisy and Adversarial Samples

Area under the ROC curve (AUC) is an important and widely used metric in machine learning especially for imbalanced datasets. In current practical learning problems, not only adversarial samples but also noisy samples seriously threaten the performance of learning models. Nowadays, there have been a lot of research works proposed to defend the adversarial samples and noisy samples separately. Unfortunately, to the best of our knowledge, none of them with AUC optimization can secure against the two kinds of harmful samples simultaneously. To fill this gap and also address the challenge, in this paper, we propose a novel doubly robust dAUC optimization (DRAUC) algorithm. Specifically, we first exploit the deep integration of self-paced learning and adversarial training under the framework of AUC optimization, and provide a statistical upper bound to the AUC adversarial risk. Inspired by the statistical upper bound, we propose our optimization objective followed by an efficient alternatively stochastic descent algorithm, which can effectively improve the performance of learning models by guarding against adversarial samples and noisy samples. Experimental results on several standard datasets demonstrate that our DRAUC algorithm has better noise robustness and adversarial robustness than the state-of-the-art algorithms.

Hyperbolic Graph Topic Modeling Network with Continuously Updated Topic Tree

Connectivity across documents often exhibits a hierarchical network structure. Hyperbolic Graph Neural Networks (HGNNs) have shown promise in preserving network hierarchy. However, they do not model the notion of topics, thus document representations lack semantic interpretability. On the other hand, a corpus of documents usually has high variability in degrees of topic specificity. For example, some documents contain general content (e.g., sports), while others focus on specific themes (e.g., basketball and swimming). Topic models indeed model latent topics for semantic interpretability, but most assume a flat topic structure and ignore such semantic hierarchy. Given these two challenges, we propose a Hyperbolic Graph Topic Modeling Network to integrate both network hierarchy across linked documents and semantic hierarchy within texts into a unified HGNN framework. Specifically, we construct a two-layer document graph. Intra- and cross-layer encoding captures network hierarchy. We design a topic tree for text decoding to preserve semantic hierarchy and learn interpretable topics. Supervised and unsupervised experiments verify the effectiveness of our model.

Quantifying Node Importance over Network Structural Stability

Quantifying node importance on engagement dynamics is critical to support network stability. We can motivate or retain the users in a social platform according to their importance s.t. the network is more sustainable. Existing studies validate that the coreness of a node is the "best practice" on network topology to estimate the engagement of the node. In this paper, the importance of a node is the effect on the engagement of other nodes when its engagement is strengthened or weakened. Specifically, the importance of a node is quantified via two novel concepts: the anchor power to measure the engagement effect of node strengthening (i.e., the overall coreness gain) and the collapse power to measure the engagement effect of node weakening (i.e., the overall coreness loss). We find the computation of the two concepts can be naturally integrated into a shell component-based framework, and propose a unified static algorithm to compute both the anchored and collapsed followers. For evolving networks, efficient maintenance techniques are designed to update the follower sets of each node, which is faster than redoing the static algorithm by around 3 orders of magnitude. Extensive experiments on real-life data demonstrate the effectiveness of our model and the efficiency of our algorithms.

Finding Favourite Tuples on Data Streams with Provably Few Comparisons

One of the most fundamental tasks in data science is to assist a user with unknown preferences in finding high-utility tuples within a large database. To accurately elicit the unknown user preferences, a widely-adopted way is by asking the user to compare pairs of tuples. In this paper, we study the problem of identifying one or more high-utility tuples by adaptively receiving user input on a minimum number of pairwise comparisons. We devise a single-pass streaming algorithm, which processes each tuple in the stream at most once, while ensuring that the memory size and the number of requested comparisons are in the worst case logarithmic in n, where n is the number of all tuples. An important variant of the problem, which can help to reduce human error in comparisons, is to allow users to declare ties when confronted with pairs of tuples of nearly equal utility. We show that the theoretical guarantees of our method can be maintained for this important problem variant. In addition, we show how to enhance existing pruning techniques in the literature by leveraging powerful tools from mathematical programming. Finally, we systematically evaluate all proposed algorithms over both synthetic and real-life datasets, examine their scalability, and demonstrate their superior performance over existing methods.

HiMacMic: Hierarchical Multi-Agent Deep Reinforcement Learning with Dynamic Asynchronous Macro Strategy

Multi-agent deep reinforcement learning (MADRL) has been widely used in many scenarios such as robotics and game AI. However, existing methods mainly focus on the optimization of agents' micro policies without considering the macro strategy. As a result, they cannot perform well in complex or sparse reward scenarios like the StarCraft Multi-Agent Challenge (SMAC) and Google Research Football (GRF). To this end, we propose a hierarchical MADRL framework called "HiMacMic" with dynamic asynchronous macro strategy. Spatially, HiMacMic determines a critical position by using a positional heat map. Temporally, the macro strategy dynamically decides its deadline and updates it asynchronously among agents. We validate HiMacMic in four widely used benchmarks, namely: Overcooked, GRF, SMAC and SMAC-v2 with nine chosen scenarios. Results show that HiMacMic not only converges faster and achieves higher results than ten existing approaches, but also shows its adaptability to different environment settings.

FedCP: Separating Feature Information for Personalized Federated Learning via Conditional Policy

Recently, personalized federated learning (pFL) has attracted increasing attention in privacy protection, collaborative learning, and tackling statistical heterogeneity among clients, e.g., hospitals, mobile smartphones, etc. Most existing pFL methods focus on exploiting the global information and personalized information in the client-level model parameters while neglecting that data is the source of these two kinds of information. To address this, we propose the Federated Conditional Policy (FedCP) method, which generates a conditional policy for each sample to separate the global information and personalized information in its features and then processes them by a global head and a personalized head, respectively. FedCP is more fine-grained to consider personalization in a sample-specific manner than existing pFL methods. Extensive experiments in computer vision and natural language processing domains show that FedCP outperforms eleven state-of-the-art methods by up to 6.69%. Furthermore, FedCP maintains its superiority when some clients accidentally drop out, which frequently happens in mobile settings. Our code is public at https://github.com/TsingZ0/FedCP.

A Study of Situational Reasoning for Traffic Understanding

Intelligent Traffic Monitoring (ITMo) technologies hold the potential for improving road safety/security and for enabling smart city infrastructure. Understanding traffic situations requires a complex fusion of perceptual information with domain-specific and causal commonsense knowledge. Whereas prior work has provided benchmarks and methods for traffic monitoring, it remains unclear whether models can effectively align these information sources and reason in novel scenarios. To address this assessment gap, we devise three novel text-based tasks for situational reasoning in the traffic domain: i) BDD-QA, which evaluates the ability of Language Models (LMs) to perform situational decision-making, ii) TV-QA, which assesses LMs' abilities to reason about complex event causality, and iii) HDT-QA, which evaluates the ability of models to solve human driving exams. We adopt four knowledge-enhanced methods that have shown generalization capability across language reasoning tasks in prior work, based on natural language inference, commonsense knowledge-graph self-supervision, multi-QA joint training, and dense retrieval of domain information. We associate each method with a relevant knowledge source, including knowledge graphs, relevant benchmarks, and driving manuals. In extensive experiments, we benchmark various knowledge-aware methods against the three datasets, under zero-shot evaluation; we provide in-depth analyses of model performance on data partitions and examine model predictions categorically, to yield useful insights on traffic understanding, given different background knowledge and reasoning strategies.

Warpformer: A Multi-scale Modeling Approach for Irregular Clinical Time Series

Irregularly sampled multivariate time series are ubiquitous in various fields, particularly in healthcare, and exhibit two key characteristics: intra-series irregularity and inter-series discrepancy. Intra-series irregularity refers to the fact that time-series signals are often recorded at irregular intervals, while inter-series discrepancy refers to the significant variability in sampling rates among diverse series. However, recent advances in irregular time series have primarily focused on addressing intra-series irregularity, overlooking the issue of inter-series discrepancy. To bridge this gap, we present Warpformer, a novel approach that fully considers these two characteristics. In a nutshell, Warpformer has several crucial designs, including a specific input representation that explicitly characterizes both intra-series irregularity and inter-series discrepancy, a warping module that adaptively unifies irregular time series in a given scale, and a customized attention module for representation learning. Additionally, we stack multiple warping and attention modules to learn at different scales, producing multi-scale representations that balance coarse-grained and fine-grained signals for downstream tasks. We conduct extensive experiments on widely used datasets and a new large-scale benchmark built from clinical databases. The results demonstrate the superiority of Warpformer over existing state-of-the-art approaches.

MixupExplainer: Generalizing Explanations for Graph Neural Networks with Data Augmentation

Graph Neural Networks (GNNs) have received increasing attention due to their ability to learn from graph-structured data. However, their predictions are often not interpretable. Post-hoc instance-level explanation methods have been proposed to understand GNN predictions. These methods seek to discover substructures that explain the prediction behavior of a trained GNN. In this paper, we shed light on the existence of the distribution shifting issue in existing methods, which affects explanation quality, particularly in applications on real-life datasets with tight decision boundaries. To address this issue, we introduce a generalized Graph Information Bottleneck (GIB) form that includes a label-independent graph variable, which is equivalent to the vanilla GIB. Driven by the generalized GIB, we propose a graph mixup method, MixupExplainer, with a theoretical guarantee to resolve the distribution shifting issue. We conduct extensive experiments on both synthetic and real-world datasets to validate the effectiveness of our proposed mixup approach over existing approaches. We also provide a detailed analysis of how our proposed approach alleviates the distribution shifting issue.

Navigating Alignment for Non-identical Client Class Sets: A Label Name-Anchored Federated Learning Framework

Traditional federated classification methods, even those designed for non-IID clients, assume that each client annotates its local data with respect to the same universal class set. In this paper, we focus on a more general yet practical setting, non-identical client class sets, where clients focus on their own (different or even non-overlapping) class sets and seek a global model that works for the union of these classes. If one views classification as finding the best match between representations produced by data/label encoder, such heterogeneity in client class sets poses a new significant challenge-local encoders at different clients may operate in different and even independent latent spaces, making it hard to aggregate at the server. We propose a novel framework, FedAlign1, to align the latent spaces across clients from both label and data perspectives. From a label perspective, we leverage the expressive natural language class names as a common ground for label encoders to anchor class representations and guide the data encoder learning across clients. From a data perspective, during local training, we regard the global class representations as anchors and leverage the data points that are close/far enough to the anchors of locally-unaware classes to align the data encoders across clients. Our theoretical analysis of the generalization performance and extensive experiments on four real-world datasets of different tasks confirm that FedAlign outperforms various state-of-the-art (non-IID) federated classification methods.

DyTed: Disentangled Representation Learning for Discrete-time Dynamic Graph

Unsupervised representation learning for dynamic graphs has attracted a lot of research attention in recent years. Compared with static graph, the dynamic graph is a comprehensive embodiment of both the intrinsic stable characteristics of nodes and the time-related dynamic preference. However, existing methods generally mix these two types of information into a single representation space, which may lead to poor explanation, less robustness, and a limited ability when applied to different downstream tasks. To solve the above problems, in this paper, we propose a novel disenTangled representation learning framework for discrete-time Dynamic graphs, namely DyTed. We specially design a temporal-clips contrastive learning task together with a structure contrastive learning to effectively identify the time-invariant and time-varying representations respectively. To further enhance the disentanglement of these two types of representation, we propose a disentanglement-aware discriminator under an adversarial learning framework from the perspective of information theory. Extensive experiments on Tencent and five commonly used public datasets demonstrate that DyTed, as a general framework that can be applied to existing methods, achieves state-of-the-art performance on various downstream tasks, as well as be more robust against noise.

Rumor Detection with Diverse Counterfactual Evidence

The growth in social media has exacerbated the threat of fake news to individuals and communities. This draws increasing attention to developing efficient and timely rumor detection methods. The prevailing approaches resort to graph neural networks (GNNs) to exploit the post-propagation patterns of the rumor-spreading process. However, these methods lack inherent interpretation of rumor detection due to the black-box nature of GNNs. Moreover, these methods suffer from less robust results as they employ all the propagation patterns for rumor detection. In this paper, we address the above issues with the proposed Diverse Counterfactual Evidence framework for Rumor Detection (DCE-RD). Our intuition is to exploit the diverse counterfactual evidence of an event graph to serve as multi-view interpretations, which are further aggregated for robust rumor detection results. Specifically, our method first designs a subgraph generation strategy to efficiently generate different subgraphs of the event graph. We constrain the removal of these subgraphs to cause the change in rumor detection results. Thus, these subgraphs naturally serve as counterfactual evidence for rumor detection. To achieve multi-view interpretation, we design a diversity loss inspired by Determinantal Point Processes (DPP) to encourage diversity among the counterfactual evidence. A GNN-based rumor detection model further aggregates the diverse counterfactual evidence discovered by the proposed DCE-RD to achieve interpretable and robust rumor detection results. Extensive experiments on two real-world datasets show the superior performance of our method. Our code is available at https://github.com/Vicinity111/DCE-RD.

CFGL-LCR: A Counterfactual Graph Learning Framework for Legal Case Retrieval

Legal case retrieval, which aims to find relevant cases based on a short case description, serves as an important part of modern legal systems. Despite the success of existing retrieval methods based on Pretrained Language Models, there are still two issues in legal case retrieval that have not been well considered before. First, existing methods underestimate the semantics associations among legal elements, e.g., law articles and crimes, which played an essential role in legal case retrieval. These methods only adopt the pre-training language model to encode the whole legal case, instead of distinguishing different legal elements in the legal case. They randomly split a legal case into different segments, which may break the completeness of each legal element. Second, due to the difficulty in annotating the relevant labels of similar cases, legal case retrieval inevitably faces the problem of lacking training data. In this paper, we propose a counterfactual graph learning framework for legal case retrieval. Concretely, to overcome the above challenges, we transform the legal case document into a graph and model the semantics of the legal elements through a graph neural network. To alleviate the low resource and learn the causal relationship between the semantics of legal elements and relevance, a counterfactual data generator is designed to augment counterfactual data and enhance legal case representation. Extensive experiments based on two publicly available legal benchmarks demonstrate that our CFGL-LCR can significantly outperform previous state-of-the-art methods in legal case retrieval.

Efficient Single-Source SimRank Query by Path Aggregation

Single-source SimRank query calculates the similarity between a query node and every node in a graph, which traverses the paths starting from the query node for similarity computation. However, the scale of the paths increases exponentially as path length increases, which decreases the computation efficiency. Sampling-based algorithms reduce computational cost by path sampling, but they need to sample sufficient paths to ensure the accuracy, and the performance might be affected by the large scale of paths. In this paper, we propose VecSim for efficient single-source SimRank query by path aggregation. VecSim first aggregates the paths starting from query node with common arrived nodes step by step to obtain the hitting probabilities, and then aggregates the paths starting from the arrived nodes reversely to obtain the first-meeting probabilities in a similar way, in which only several vectors are maintained. The extra-meeting probabilities are excluded from each step, and an efficient sampling-based algorithm is designed, which estimates the extra-meeting probabilities by sampling paths within a specified length. For further speeding up query processing, we propose a threshold-sieved algorithm, which prunes the entries with small values that contribute little to the final similarity scores by setting a threshold. Extensive experiments are done on four small and four large graphs, which demonstrate that VecSim outperforms the competitors in terms of time and space costs on a comparable accuracy. In particular, VecSim achieves an empirical error of 10-4 level in under 0.1 second over all of these graphs.

Debiasing Recommendation by Learning Identifiable Latent Confounders

Recommendation systems aim to predict users' feedback on items not exposed to them yet. Confounding bias arises due to the presence of unmeasured variables (e.g., the socio-economic status of a user) that can affect both a user's exposure and feedback. Existing methods either (1) make untenable assumptions about these unmeasured variables or (2) directly infer latent confounders from users' exposure. However, they cannot guarantee the identification of counterfactual feedback, which can lead to biased predictions. In this work, we propose a novel method, i.e., identifiable deconfounder (iDCF), which leverages a set of proxy variables (e.g., observed user features) to resolve the aforementioned non-identification issue. The proposed iDCF is a general deconfounded recommendation framework that applies proximal causal inference to infer the unmeasured confounders and identify the counterfactual feedback with theoretical guarantees. Extensive experiments on various real-world and synthetic datasets verify the proposed method's effectiveness and robustness.

Local Boosting for Weakly-Supervised Learning

Boosting is a commonly used technique to enhance the performance of a set of base models by combining them into a strong ensemble model. Though widely adopted, boosting is typically used in supervised learning where the data is labeled accurately. However, in weakly supervised learning, where most of the data is labeled through weak and noisy sources, it remains nontrivial to design effective boosting approaches. In this work, we show that the standard implementation of the convex combination of base learners can hardly work due to the presence of noisy labels. Instead, we propose LocalBoost, a novel framework for weakly-supervised boosting. LocalBoost iteratively boosts the ensemble model from two dimensions, i.e., intra-source and inter-source. The intra-source boosting introduces locality to the base learners and enables each base learner to focus on a particular feature regime by training new base learners on granularity-varying error regions. For the inter-source boosting, we leverage a conditional function to indicate the weak source where the sample is more likely to appear. To account for the weak labels, we further design an estimate-then-modify approach to compute the model weights. Experiments on seven datasets show that our method significantly outperforms vanilla boosting methods and other weakly-supervised methods.

Capacity Constrained Influence Maximization in Social Networks

Influence maximization (IM) aims to identify a small number of influential individuals to maximize the information spread and finds applications in various fields. It was first introduced in the context of viral marketing, where a company pays a few influencers to promote the product. However, apart from the cost factor, the capacity of individuals to consume content poses challenges for implementing IM in real-world scenarios. For example, players on online gaming platforms can only interact with a limited number of friends. In addition, we observe that in these scenarios, (i) the initial adopters of promotion are likely to be the friends of influencers rather than the influencers themselves, and (ii) existing IM solutions produce sub-par results with high computational demands. Motivated by these observations, we propose a new IM variant called capacity constrained influence maximization (CIM), which aims to select a limited number of influential friends for each initial adopter such that the promotion can reach more users. To solve CIM effectively, we design two greedy algorithms, MG-Greedy and RR-Greedy, ensuring the 1/2-approximation ratio. To improve the efficiency, we devise the scalable implementation named RR-OPIM+ with (1/2-ε)-approximation and near-linear running time. We extensively evaluate the performance of 9 approaches on 6 real-world networks, and our solutions outperform all competitors in terms of result quality and running time. Additionally, we deploy RR-OPIM+ to online game scenarios, which improves the baseline considerably.

Efficient Approximation Algorithms for Spanning Centrality

Given a graph \mathcalG , the spanning centrality (SC) of an edge e measures the importance of e for \mathcalG to be connected. In practice, SC has seen extensive applications in computational biology, electrical networks, and combinatorial optimization. However, it is highly challenging to compute the SC of all edges (AESC) on large graphs. Existing techniques fail to deal with such graphs, as they either suffer from expensive matrix operations or require sampling numerous long random walks. To circumvent these issues, this paper proposes TGT and its enhanced version TGT+, two algorithms for AESC computation that offers rigorous theoretical approximation guarantees. In particular, TGT remedies the deficiencies of previous solutions by conducting deterministic graph traversals with carefully-crafted truncated lengths. TGT+ further advances TGT in terms of both empirical efficiency and asymptotic performance while retaining result quality, based on the combination of TGT with random walks and several additional heuristic optimizations. We experimentally evaluate TGT+ against recent competitors for AESC using a variety of real datasets. The experimental outcomes authenticate that TGT+ outperforms state of the arts often by over one order of magnitude speedup without degrading the accuracy.

DM-PFL: Hitchhiking Generic Federated Learning for Efficient Shift-Robust Personalization

Personalized federated learning collaboratively trains client-specific models, which holds potential for various mobile and IoT applications with heterogeneous data. However, existing solutions are vulnerable to distribution shifts between training and test data, and involve high training workloads on local devices. These two shortcomings hinder the practical usage of personalized federated learning on real-world mobile applications. To overcome these drawbacks, we explore efficient shift-robust personalization for federated learning. The principle is to hitchhike the global model to improve the shift-robustness of personalized models with minimal extra training overhead. To this end, we present DM-PFL, a novel framework that utilizes a dual masking mechanism to train both global and personalized models with weight-level parameter sharing and end-to-end sparse training. Evaluations on various datasets show that our methods not only improve the test accuracy in presence of test-time distribution shifts but also save the communication and computation costs compared to state-of-the-art personalized federated learning schemes.

Domain-Specific Risk Minimization for Domain Generalization

Domain generalization (DG) approaches typically use the hypothesis learned on source domains for inference on the unseen target domain. However, such a hypothesis can be arbitrarily far from the optimal one for the target domain, induced by a gap termed ''adaptivity gap.'' Without exploiting the domain information from the unseen test samples, adaptivity gap estimation and minimization are intractable, which hinders us to robustify a model to any unknown distribution. In this paper, we first establish a generalization bound that explicitly considers the adaptivity gap. Our bound motivates two strategies to reduce the gap: the first one is ensembling multiple classifiers to enrich the hypothesis space, then we propose effective gap estimation methods for guiding the selection of a better hypothesis for the target. The other method is minimizing the gap directly by adapting model parameters using online target samples. We thus propose Domain-specific Risk Minimization (DRM). During training, DRM models the distributions of different source domains separately; for inference, DRM performs online model steering using the source hypothesis for each arriving target sample. Extensive experiments demonstrate the effectiveness of the proposed DRM for domain generalization. Code is available at: https://github.com/yfzhang114/AdaNPC.

Contrastive Cross-scale Graph Knowledge Synergy

Graph representation learning via Contrastive Learning (GCL) has drawn considerable attention recently. Efforts are mainly focused on gathering more global information via contrasting on a single high-level graph view, which, however, underestimates the inherent complex and hierarchical properties in many real-world networks, leading to sub-optimal embeddings. To incorporate these properties of a complex graph, we propose Cross-Scale Contrastive Graph Knowledge Synergy (CGKS), a generic feature learning framework, to advance graph contrastive learning with enhanced generalization ability and the awareness of latent anatomies. Specifically, to maintain the hierarchical information, we create a so-call graph pyramid (GP) consisting of coarse-grained graph views. Each graph view is obtained via the careful design topology-aware graph coarsening layer that extends the Laplacian Eigenmaps with negative sampling. To promote cross-scale information sharing and knowledge interactions among GP, we propose a novel joint optimization formula that contains a pairwise contrastive loss between any two coarse-grained graph views. This synergy loss not only promotes knowledge sharing that yields informative representations, but also stabilizes the training process. Experiments on various downstream tasks demonstrate the substantial improvements of the proposed method over its counterparts.

Adaptive Disentangled Transformer for Sequential Recommendation

Sequential recommendation aims at mining time-aware user interests through modeling sequential behaviors. Transformer, as an effective architecture designed to process sequential input data, has shown its superiority in capturing sequential relations for recommendation. Nevertheless, existing Transformer architectures lack explicit regularization for layer-wise disentanglement, which fails to take advantage of disentangled representation in recommendation and leads to suboptimal performance. In this paper, we study the problem of layer-wise disentanglement for Transformer architectures and propose the Adaptive Disentangled Transformer (ADT) framework, which is able to adaptively determine the optimal degree of disentanglement of attention heads within different layers. Concretely, we propose to encourage disentanglement by requiring the independence constraint via mutual information estimation over attention heads and employing auxiliary objectives to prevent the information from collapsing into useless noise. We further propose a progressive scheduler to adaptively adjust the weights controlling the degree of disentanglement via an evolutionary process. Extensive experiments on various real-world datasets demonstrate the effectiveness of our proposed ADT framework.

AdaProp: Learning Adaptive Propagation for Graph Neural Network based Knowledge Graph Reasoning

Due to the popularity of Graph Neural Networks (GNNs), various GNN-based methods have been designed to reason on knowledge graphs (KGs). An important design component of GNN-based KG reasoning methods is called the propagation path, which contains a set of involved entities in each propagation step. Existing methods use hand-designed propagation paths, ignoring the correlation between the entities and the query relation. In addition, the number of involved entities will explosively grow at larger propagation steps. In this work, we are motivated to learn an adaptive propagation path in order to filter out irrelevant entities while preserving promising targets. First, we design an incremental sampling mechanism where the nearby targets and layer-wise connections can be preserved with linear complexity. Second, we design a learning-based sampling distribution to identify the semantically related entities. Extensive experiments show that our method is powerful, efficient and semantic-aware. The code is available at https://github.com/LARS-research/AdaProp.

Weakly Supervised Multi-Label Classification of Full-Text Scientific Papers

Instead of relying on human-annotated training samples to build a classifier, weakly supervised scientific paper classification aims to classify papers only using category descriptions (e.g., category names, category-indicative keywords). Existing studies on weakly supervised paper classification are less concerned with two challenges: (1) Papers should be classified into not only coarse-grained research topics but also fine-grained themes, and potentially into multiple themes, given a large and fine-grained label space; and (2) full text should be utilized to complement the paper title and abstract for classification. Moreover, instead of viewing the entire paper as a long linear sequence, one should exploit the structural information such as citation links across papers and the hierarchy of sections and paragraphs in each paper. To tackle these challenges, in this study, we propose FUTEX, a framework that uses the cross-paper network structure and the in-paper hierarchy structure to classify full-text scientific papers under weak supervision. A network-aware contrastive fine-tuning module and a hierarchy-aware aggregation module are designed to leverage the two types of structural signals, respectively. Experiments on two benchmark datasets demonstrate that FUTEX significantly outperforms competitive baselines and is on par with fully supervised classifiers that use 1,000 to 60,000 ground-truth training samples.

Hierarchical Invariant Learning for Domain Generalization Recommendation

Most cross-domain recommenders require samples on target domains or source-target overlaps to carry out domain adaptation. However, in many real-world situations, target domains are lack of such knowledge. Few works discuss this problem, whose essence is domain generalization recommendation. In this paper, we figure out domain generalization recommendation with a clear symbolized definition and propose corresponding models. Moreover, we illustrate its strong connection with zero-shot recommendation, pretrained recommendation and cold-start recommendation, distinguishing it from content-based recommendation. By analyzing its properties, we propose HIRL^+ and a series of heuristic methods to solve this problem. We propose hierarchical invariant learning to expel the specific patterns in both domain-level and environment-level, and find the common patterns in generalization space. To make the division of environments flexible, fine-grained and balanced, we put forward a learnable environment assignment method. To improve the robustness against distribution shifts inside domain generalization, we present an adversarial environment refinement method. In addition, we conduct experiments on real-word datasets to verify the effectiveness of our models, and carry out further studies on the domain distance and domain diversity. To benefit the research community and promote this direction, we discuss the future of this field.

Towards Fair Disentangled Online Learning for Changing Environments

In the problem of online learning for changing environments, data are sequentially received one after another over time, and their distribution assumptions may vary frequently. Although existing methods demonstrate the effectiveness of their learning algorithms by providing a tight bound on either dynamic regret or adaptive regret, most of them completely ignore learning with model fairness, defined as the statistical parity across different sub-population (e.g., race and gender). Another drawback is that when adapting to a new environment, an online learner needs to update model parameters with a global change, which is costly and inefficient. Inspired by the sparse mechanism shift hypothesis [22], we claim that changing environments in online learning can be attributed to partial changes in learned parameters that are specific to environments and the rest remain invariant to changing environments. To this end, in this paper, we propose a novel algorithm under the assumption that data collected at each time can be disentangled with two representations, an environment-invariant semantic factor and an environment-specific variation factor. The semantic factor is further used for fair prediction under a group fairness constraint. To evaluate the sequence of model parameters generated by the learner, a novel regret is proposed in which it takes a mixed form of dynamic and static regret metrics followed by a fairness-aware long-term constraint. The detailed analysis provides theoretical guarantees for loss regret and violation of cumulative fairness constraints. Empirical evaluations on real-world datasets demonstrate our proposed method sequentially outperforms baseline methods in model accuracy and fairness.

DoubleAdapt: A Meta-learning Approach to Incremental Learning for Stock Trend Forecasting

Stock trend forecasting is a fundamental task of quantitative investment where precise predictions of price trends are indispensable. As an online service, stock data continuously arrive over time. It is practical and efficient to incrementally update the forecast model with the latest data which may reveal some new patterns recurring in the future stock market. However, incremental learning for stock trend forecasting still remains under-explored due to the challenge of distribution shifts (a.k.a. concept drifts). With the stock market dynamically evolving, the distribution of future data can slightly or significantly differ from incremental data, hindering the effectiveness of incremental updates. To address this challenge, we propose DoubleAdapt, an end-to-end framework with two adapters, which can effectively adapt the data and the model to mitigate the effects of distribution shifts. Our key insight is to automatically learn how to adapt stock data into a locally stationary distribution in favor of profitable updates. Complemented by data adaptation, we can confidently adapt the model parameters under mitigated distribution shifts. We cast each incremental learning task as a meta-learning task and automatically optimize the adapters for desirable data adaptation and parameter initialization. Experiments on real-world stock datasets demonstrate that DoubleAdapt achieves state-of-the-art predictive performance and shows considerable efficiency.

Spatial Clustering Regression of Count Value Data via Bayesian Mixture of Finite Mixtures

Investigating relationships between response variables and covariates in areas such as environmental science, geoscience, and public health is an important endeavor. Based on a Bayesian mixture of finite mixtures model, we present a novel spatially clustered coefficients regression model for count value data. The proposed method detects the spatial homogeneity of the Poisson regression coefficients. A Markov random field constrained mixture of finite mixtures prior provides a regularized estimator of the number of clusters of regression coefficients with geographical neighborhood information. As a by-product, we also provide the theoretical properties of our proposed method when the Markov random field is exchangeable. An efficient Markov chain Monte Carlo algorithm is developed by using the multivariate log gamma distribution as a base distribution. Simulation studies are carried out to examine the empirical performance of the proposed method. Additionally, we analyze Georgia's premature death data as an illustration of the effectiveness of our approach. The supplementary materials are provided on GitHub at https://github.com/pengzhaostat/MLG_MFM.

Skill Disentanglement for Imitation Learning from Suboptimal Demonstrations

Imitation learning has achieved great success in many sequential decision-making tasks, in which a neural agent is learned by imitating collected human demonstrations. However, existing algorithms typically require a large number of high-quality demonstrations that are difficult and expensive to collect. Usually, a trade-off needs to be made between demonstration quality and quantity in practice. Targeting this problem, in this work we consider the imitation of sub-optimal demonstrations, with both a small clean demonstration set and a large noisy set. Some pioneering works have been proposed, but they suffer from many limitations, e.g., assuming a demonstration to be of the same optimality throughout time steps and failing to provide any interpretation w.r.t knowledge learned from the noisy set. Addressing these problems, we propose \method by evaluating and imitating at the sub-demonstration level, encoding action primitives of varying quality into different skills. Concretely, SDIL consists of a high-level controller to discover skills and a skill-conditioned module to capture action-taking policies, and is trained following a two-phase pipeline by first discovering skills with all demonstrations and then adapting the controller to only the clean set. A mutual-information-based regularization and a dynamic sub-demonstration optimality estimator are designed to promote disentanglement in the skill space. Extensive experiments are conducted over two gym environments and a real-world healthcare dataset to demonstrate the superiority of SDIL in learning from sub-optimal demonstrations and its improved interpretability by examining learned skills.

GraphGLOW: Universal and Generalizable Structure Learning for Graph Neural Networks

Graph structure learning is a well-established problem that aims at optimizing graph structures adaptive to specific graph datasets to help message passing neural networks (i.e., GNNs) to yield effective and robust node embeddings. However, the common limitation of existing models lies in the underlying closed-world assumption: the testing graph is the same as the training graph. This premise requires independently training the structure learning model from scratch for each graph dataset, which leads to prohibitive computation costs and potential risks for serious over-fitting. To mitigate these issues, this paper explores a new direction that moves forward to learn a universal structure learning model that can generalize across graph datasets in an open world. We first introduce the mathematical definition of this novel problem setting, and describe the model formulation from a probabilistic data-generative aspect. Then we devise a general framework that coordinates a single graph-shared structure learner and multiple graph-specific GNNs to capture the generalizable patterns of optimal message-passing topology across datasets. The well-trained structure learner can directly produce adaptive structures for unseen target graphs without any fine-tuning. Across diverse datasets and various challenging cross-graph generalization protocols, our experiments show that even without training on target graphs, the proposed model i) significantly outperforms expressive GNNs trained on input (non-optimized) topology, and ii) surprisingly performs on par with state-of-the-art models that independently optimizes adaptive structures for specific target graphs, with notably orders-of-magnitude acceleration for training on the target graph.

Generative Causal Interpretation Model for Spatio-Temporal Representation Learning

Learning, interpreting, and predicting from complex and high-dimensional spatio-temporal data is a natural ability of humans and other intelligent agents, and one of the most important and difficult challenges of AI. Although objects may present different observed phenomena under different situations, their causal mechanism and generation rules are stable and invariant. Different from most existing studies that focus on dynamic correlation, we explore the latent causal structure and mechanism of causal descriptors in the spatio-temporal dimension at the microscopic level, thus revealing the generation principle of observation. In this paper, we regard the causal mechanism as a spatio-temporal causal process modulated by non-stationary exogenous variables. To this end, we propose a theoretically-grounded Generative Causal Interpretation Model (GCIM), which infers explanatory-capable microscopic causal descriptors from observational data via spatio-temporal causal representations. The core of GCIM is to estimate the prior distribution of causal descriptors by using the spatio-temporal causal structure and transition process under the constraints of identifiable conditions, thus extending the Variational Auto Encoder (VAE). Furthermore, our method is able to automatically capture domain information from observations to model non-stationarity. We further analyze the model identifiability, showing that the proposed model learned from observations recovers the true one up to a certain degree. Experiments on synthetic and real-world datasets show that GCIM can successfully identify latent causal descriptors and structures, and accurately predict future data.

Improving Search Clarification with Structured Information Extracted from Search Results

Search clarification in conversational search systems exhibits a clarification pane composed of several candidate aspect items and a clarifying question. To generate a pane, existing studies usually rely on unstructured document texts. However, important structured information in search results is not effectively considered, making the generated panes inaccurate in some cases. In this paper, we emphasize the importance of structured information in search results for improving search clarification. We propose enhancing unstructured documents with two kinds of structured information: one is "In-List'' relation obtained from HTML list structures, which helps extract groups of high-quality items with abundant parallel information. Another is "Is-A'' relation extracted from knowledge bases, which is helpful to generate good questions with explicit prompts. To avoid introducing excessive noises, we design a relation selection process to filter out ineffective relations. We further design a BART-based model for generating clarification panes. The experimental results show that the structured information is good supplement for generating high-quality clarification panes.

Dense Representation Learning and Retrieval for Tabular Data Prediction

Data science is concerned with mining data patterns from a database, which is assembled by tabular data. As the routine of machine learning, most of the previous work mining the tabular data's pattern based on a single instance. However, they neglect the similar tabular data instances that could help make the label prediction of the target data instance. Recently, some retrieval-based methods for tabular data label prediction have been proposed, which, however, treat the data as sparse vectors to perform the retrieval, which fails to make use of the semantic information of the tabular data. To address such a problem, in this paper, we propose a novel framework of dense retrieval on tabular data (DERT) to support flexible data representation learning and effective label prediction on tabular data. DERT consists of two major components: (i) the encoder that makes the tabular data as embeddings, which could be trained by flexible neural networks and auxiliary loss functions; (ii) the retrieval and prediction component, which makes use of similar rows in the table to make label prediction of the target row. We test DERT on two tasks based on five real-world datasets and experimental results show that DERT achieves consistent improvements over the state-of-the-art and various baselines.

Automatic Temporal Relation in Multi-Task Learning

Multi-task learning with temporal relation is a common prediction method for modelling the evolution of a wide range of systems. Considering the inherent relations between multiple time points, many works apply multi-task learning to jointly analyse all time points, with each time point corresponding to a prediction task. The most difficult challenge is determining how to fully explore and thus exploit the shared valuable temporal information between tasks to improve the generalization performance and robustness of the model. Existing works are classified as temporal smoothness and mean temporal relations. Both approaches, however, utilize a predefined and symmetric task relation structure that is too rigid and insufficient to adequately capture the intricate temporal relations between tasks. Instead, we propose a novel mechanism named Automatic Temporal Relation (AutoTR) for directly and automatically learning the temporal relation from any given dataset. To solve the biconvex objective function, we adopt the alternating optimization and show that the two related sub-optimization problems are amenable to closed-form computation of the proximal operator. To solve the two problems efficiently, the accelerated proximal gradient method is used, which has the fastest convergence rate of any first-order method. We have preprocessed six public real-life datasets and conducted extensive experiments to fully demonstrate the superiority of AutoTR. The results show that AutoTR outperforms several baseline methods on almost all datasets with different training ratios, in terms of overall model performance and every individual task performance. Furthermore, our findings verify that the temporal relation between tasks is asymmetrical, which has not been considered in previous works. The implementation source can be found at https://github.com/menghui-zhou/AutoTR.

Narrow the Input Mismatch in Deep Graph Neural Network Distillation

Graph neural networks (GNNs) have been widely studied for modeling graph-structured data. Thanks to the over-parameterization and large receptive field of deep GNNs, "deep" is a promising direction to develop GNNs further and has shown some superior performances. However, the over-stacked structures of deep architectures incur high inference cost in deployment. To compress deep GNNs, we can use knowledge distillation (KD) to make shallow student GNNs mimic teacher GNNs. Existing KD methods in graph domain focus on constructing diverse supervision on embedding or prediction produced by student GNNs, but overlook the gap of the receptive field (i.e., input information) between student and teacher, which brings difficulties to KD. We call this gap "input mismatch". To alleviate this problem, we propose a lightweight stochastic extended module to provide an estimation for missing input information for student GNNs. The estimator models the distribution of missing information. Specifically, we model the missing information as an independent distribution from graph level and a conditional distribution from node level (given the condition of observable input). These two estimates are optimized using a Bayesian methodology and combined into a balanced estimate as additional input to student GNNs. To the best of our knowledge, we are the first to address the "input mismatch" problem in deep GNNs distillation. Experiments on extensive benchmarks demonstrate that our method outperforms existing KD methods for GNNs in distillation performance, which confirms that the estimations are reasonable and effective.

A Sublinear Time Algorithm for Opinion Optimization in Directed Social Networks via Edge Recommendation

In this paper, we study the opinion maximization problem for the leader-follower DeGroot model of opinion dynamics in a social network modelled by a directed graph with n nodes, where a small number of nodes are competing leader nodes with binary opposing opinions 0 or 1, and the rest are follower nodes. We address the problem of maximizing the overall opinion by adding k ⇐ n new edges, where each edge is incident to a 1-leader and a follower. We prove that the objective function is monotone and submodular, and then propose a deterministic greedy algorithm with an approximation ratio (1-1 over e) and O(n3) running time. We then develop a fast sampling algorithm based on l-truncated absorbing random walks and sample-materialization techniques, which has sublinear time complexity O(kn1/2 log3/2 n/ε3) for any error parameter ε > 0. We provide extensive experiments on real networks to evaluate the performance of our algorithms. The results show that for undirected graphs our fast sampling algorithm outperforms the state-of-the-art method in terms of efficiency and effectiveness. While for directed graphs our fast sampling algorithm is as effective as our deterministic greedy algorithm, both of which are much better than the baseline strategies. Moreover, our fast algorithm is scalable to large directed graphs with over 41 million nodes.

Maintaining the Status Quo: Capturing Invariant Relations for OOD Spatiotemporal Learning

Spatiotemporal (ST) learning has become a crucial technique for urban digitalization. Due to expansions and dynamics of cities, current spatiotemporal models are inclined to suffer distribution shifts between training and testing sets, leading to the OOD delimma. However, few studies focus on such OOD problem in temporal regressions, let alone spatiotemporal learning. Spatiotemporal data usually reveals segment-level heterogeneity within periodicity and complex spatial dependencies, posing challenges to invariance extraction. In this paper, we find that ST relations make sense for generalization and devise a Causal ST learning framework, CauSTG, which enables invariant relation transferred to OOD scenarios. Specifically, we take temporal steps as environments, and transform spatial-temporal relations into learnable parameters. To tackle heterogeneity in periodicity, we partition temporal steps into sub-environments by identifying distinctive trend patterns, enabling re-organized samples trained separately. To extract invariance within ST observations, we propose a spatiotemporal consistency learner and a hierarchical invariance explorer to jointly filter out stable relations. Our spatiotemporal learner quantifies bi-directional spatial consistency and extracts disentangled seasonal-trend patterns via trainable parameters. Further, the hierarchical invariance explorer constructs variation-based filter to achieve both local and global invariances. Experiments reveal that CauSTG can increase at most 10.26% performance against best baselines, and visualized invariant relations can well interpret the physical rationales. The appendix and codes can be available in our Github repository.

Dual-view Molecular Pre-training

Molecular pre-training, which is about to learn an effective representation for molecules on large amount of data, has attracted substantial attention in cheminformatics and bioinformatics. A molecule can be viewed as either a graph (where atoms are connected by bonds) or a SMILES sequence (where depth-first-search is applied to the molecular graph with specific rules). The Transformer and graph neural networks (GNN) are two representative methods to deal with the sequential data and the graphic data, which can globally and locally model the molecules respectively and are supposed to be complementary. In this work, we propose to leverage both representations and design a new pre-training algorithm, dual-view molecule pre-training (briefly, DVMP), that can effectively combine the strengths of both types of molecule representations. DVMP has a Transformer branch and a GNN branch, and the two branches are pre-trained to maintain the semantic consistency of molecules. After pre-training, we can use either the Transformer branch (this one is recommended according to empirical results), the GNN branch, or both for downstream tasks. DVMP is tested on 11 molecular property prediction tasks and outperforms strong baselines. Furthermore, we test DVMP on three retrosynthesis tasks and it achieves state-of-the-art results. Our code is released at https://github.com/microsoft/DVMP.

On Structural Expressive Power of Graph Transformers

Graph Transformer has recently received wide attention in the research community with its outstanding performance, yet its structural expressive power has not been well analyzed. Inspired by the connections between Weisfeiler-Lehman (WL) graph isomorphism test and graph neural network (GNN), we introduce SEG-WL test (Structural Encoding enhanced G lobal Weisfeiler-Lehman test), a generalized graph isomorphism test algorithm as a powerful theoretical tool for exploring the structural discriminative power of graph Transformers. We theoretically prove that the SEG-WL test is an expressivity upper bound on a wide range of graph Transformers, and the representational power of SEG-WL test can be approximated by a simple Transformer network arbitrarily under certain conditions. With the SEG-WL test, we show how graph Transformers' expressive power is determined by the design of structural encodings, and present conditions that make the expressivity of graph Transformers beyond WL test and GNNs. Moreover, motivated by the popular shortest path distance encoding, we follow the theory-oriented principles and develop a provably stronger structural encoding method, Shortest Path Induced Subgraph (SPIS) encoding. Our theoretical findings provide a novel and practical paradigm for investigating the expressive power of graph Transformers, and extensive synthetic and real-world experiments empirically verify the strengths of our proposed methods.

Path-Specific Counterfactual Fairness for Recommender Systems

Recommender systems (RSs) have become an indispensable part of online platforms. With the growing concerns of algorithmic fairness, RSs are not only expected to deliver high-quality personalized content, but are also demanded not to discriminate against users based on their demographic information. However, existing RSs could capture undesirable correlations between sensitive features and observed user behaviors, leading to biased recommendations. Most fair RSs tackle this problem by completely blocking the influences of sensitive features on recommendations. But since sensitive features may also affect user interests in a fair manner (e.g., race on culture-based preferences), indiscriminately eliminating all the influences of sensitive features inevitably degenerate the recommendations quality and necessary diversities. To address this challenge, we propose a path-specific fair RS (PSF-RS) for recommendations. Specifically, we summarize all fair and unfair correlations between sensitive features and observed ratings into two latent proxy mediators, where the concept of path-specific bias (PS-Bias) is defined based on path-specific counterfactual inference. Inspired by Pearl's minimal change principle, we address the PS-Bias by minimally transforming the biased factual world into a hypothetically fair world, where a fair RS model can be learned accordingly by solving a constrained optimization problem. For the technical part, we propose a feasible implementation of PSF-RS, i.e., PSF-VAE, with weakly-supervised variational inference, which robustly infers the latent mediators such that unfairness can be mitigated while necessary recommendation diversities can be maximally preserved simultaneously. Experiments conducted on semi-simulated and real-world datasets demonstrate the effectiveness of PSF-RS.

WinGNN: Dynamic Graph Neural Networks with Random Gradient Aggregation Window

Modeling the dynamics into graph neural networks (GNNs) contributes to the understanding of evolution in dynamic graphs, which helps optimize temporal-spatial representations for real-world dynamic network problems. Empirically, dynamic GNN embedding requires additional temporal encoders, which inevitably introduces additional learning parameters to make dynamic GNNs oversized and inefficient. Furthermore, previous dynamic GNN models are under the same fixed temporal term, which causes the short-temporal optimum. To address these issues, we propose the WinGNN framework to model dynamic graphs, which is realized by a simple GNN model with the meta-learning strategy and a novel mechanism of random gradient aggregation. WinGNN calculates the frame-wise loss of the current snapshot and passes the loss gradient to the next to model graph dynamics without temporal encoders. Then it introduces the randomized sliding-window to acquire the window-aware gradienton consecutive snapshots, and the calculated two types of gradient are aggregated to update the GNN, thereby reducing the parameter size and improving the robustness. Experiments on six public datasets show the advantage of our WinGNN compared with existing baselines, where it has reached the optimum in twenty-two out of twenty-four performance metrics.

Robust Positive-Unlabeled Learning via Noise Negative Sample Self-correction

Learning from positive and unlabeled data is known as positive-unlabeled (PU) learning in literature and has attracted much attention in recent years. One common approach in PU learning is to sample a set of pseudo-negatives from the unlabeled data using ad-hoc thresholds so that conventional supervised methods can be applied with both positive and negative samples. Owing to the label uncertainty among the unlabeled data, errors of misclassifying unlabeled positive samples as negative samples inevitably appear and may even accumulate during the training processes. Those errors often lead to performance degradation and model instability. To mitigate the impact of label uncertainty and improve the robustness of learning with positive and unlabeled data, we propose a new robust PU learning method with a training strategy motivated by the nature of human learning: easy cases should be learned first. Similar intuition has been utilized in curriculum learning to only use easier cases in the early stage of training before introducing more complex cases. Specifically, we utilize a novel ''hardness'' measure to distinguish unlabeled samples with a high chance of being negative from unlabeled samples with large label noise. An iterative training strategy is then implemented to fine-tune the selection of negative samples during the training process in an iterative manner to include more ''easy'' samples in the early stage of training. Extensive experimental validations over a wide range of learning tasks show that this approach can effectively improve the accuracy and stability of learning with positive and unlabeled data. Our code is available at https://github.com/woriazzc/Robust-PU.

DyGen: Learning from Noisy Labels via Dynamics-Enhanced Generative Modeling

Learning from noisy labels is a challenge that arises in many real-world applications where training data can contain incorrect or corrupted labels. When fine-tuning language models with noisy labels, models can easily overfit the label noise, leading to decreased performance. Most existing methods for learning from noisy labels use static input features for denoising, but these methods are limited by the information they can provide on true label distributions and can result in biased or incorrect predictions. In this work, we propose the Dynamics-Enhanced Generative Model (DyGen), which uses dynamic patterns in the embedding space during the fine-tuning process of language models to improve noisy label predictions. DyGen uses the variational auto-encoding framework to infer the posterior distributions of true labels from noisy labels and training dynamics. Additionally, a co-regularization mechanism is used to minimize the impact of potentially noisy labels and priors. DyGen demonstrates an average accuracy improvement of 3.10% on two synthetic noise datasets and 1.48% on three real-world noise datasets compared to the previous state-of-the-art. Extensive experiments and analyses show the effectiveness of each component in DyGen. Our code is available for reproducibility on GitHub.

SESSION: Applied Data Track Full Papers

CADENCE: Offline Category Constrained and Diverse Query Generation for E-commerce Autosuggest

Query AutoComplete (QAC) or AutoSuggest is the first place of user interaction with an e-commerce search engine. It is critical for the QAC system to suggest relevant and well-formed queries for multiple possible user intents. Suggesting only the historical user queries fails in the case of infrequent or new prefixes. Much of the recent works generate synthetic candidates using models trained on user queries and thus have these issues: a) cold start problem as new products in the catalogue fail to get visibility due to lack of representation in user queries b) poor quality of generated candidates due to concept drift and c) low diversity/coverage of attributes such as brand, color & other facets in generated candidates.

In this paper, we propose an offline neural query generation framework - CADENCE - to address these challenges by a) using both user queries and noisy product titles to train two separate neural language models using self-attention memory networks, b) adding category constraints during the training and query generation process to prevent concept drift c) implementing customized dynamic beam search to generate more diverse candidates for a given prefix. Besides solving for cold start and rare/unseen prefix coverage, CADENCE also increases the coverage of the existing query prefixes through a higher number of relevant and diverse query suggestions. We generated ~700K new offline queries, which have resulted in significant improvement in recall, reduction in product cold start, and increased coverage of attributes. Online A/B tests also show a significant impact on QAC usage, downstream search click-through rates, and product conversion.

Learning to Solve Grouped 2D Bin Packing Problems in the Manufacturing Industry

The two-dimensional bin packing problem (2DBP) is a critical optimization problem in the furniture production and glass cutting industries, where the objective is to cut smaller-sized items from a minimum number of large standard-sized raw materials. In practice, factories manufacture hundreds of customer orders (sets of items) every day, and to relieve pressure in management, a common practice is to group the orders into batches for production, ensuring that items from one order are in the same batch instead of scattered across the production line. In this work, we formulate this problem as the grouped 2D bin packing problem, a bi-level problem where the upper level partitions orders into groups and the lower level solves 2DBP for items in each group. The main challenges are (1) the coupled optimization of upper and lower levels and (2) the high computational efficiency required for practical application. To tackle these challenges, we propose an iteration-based hierarchical reinforcement learning framework, which can learn to solve the optimization problem in a data-driven way and provide fast online performance after offline training. Extensive experiments demonstrate that our method not only achieves the best performance compared to all baselines but is also robust to changes in dataset distribution and problem constraints. Finally, we deployed our method in the ARROW Home factory in China, resulting in a 4.1% reduction in raw material costs. We have released the source code and datasets to facilitate future research.

Fusing Multimodal Signals on Hyper-complex Space for Extreme Abstractive Text Summarization (TL;DR) of Scientific Contents

The realm of scientific text summarization has experienced remarkable progress due to the availability of annotated brief summaries and ample data. However, the utilization of multiple input modalities, such as videos and audio, has yet to be thoroughly explored. At present, scientific multimodal-input-based text summarization systems tend to employ longer target summaries like abstracts, leading to an underwhelming performance in the task of text summarization.

In this paper, we deal with a novel task of extreme abstractive text summarization (aka TL;DR generation) by leveraging multiple input modalities. To this end, we introduce mTLDR, a first-of-its-kind dataset for the aforementioned task, comprising videos, audio, and text, along with both author-composed summaries and expert-annotated summaries. The mTLDR dataset accompanies a total of 4,182 instances collected from various academic conference proceedings, such as ICLR, ACL, and CVPR. Subsequently, we present mTLDRgen, an encoder-decoder-based model that employs a novel dual-fused hyper-complex Transformer combined with a Wasserstein Riemannian Encoder Transformer, to dexterously capture the intricacies between different modalities in a hyper-complex latent geometric space. The hyper-complex Transformer captures the intrinsic properties between the modalities, while the Wasserstein Riemannian Encoder Transformer captures the latent structure of the modalities in the latent space geometry, thereby enabling the model to produce diverse sentences. mTLDRgen outperforms 20 baselines on mTLDR as well as another non-scientific dataset (How2) across three Rouge-based evaluation measures. Furthermore, based on the qualitative metrics, BERTScore and FEQA, and human evaluations, we demonstrate that the summaries generated by mTLDRgen are fluent and congruent to the original source material.

SMILE: Evaluation and Domain Adaptation for Social Media Language Understanding

We study the ability of transformer-based language models (LMs) to understand social media language. Social media (SM) language is distinct from standard written language, yet existing benchmarks fall short of capturing LM performance in this socially, economically, and politically important domain. We quantify the degree to which social media language differs from conventional language and conclude that the difference is significant both in terms of token distribution and rate of linguistic shift. Next, we introduce a new benchmark for Social MedIa Language Evaluation (SMILE) that covers four SM platforms and eleven tasks. Finally, we show that learning a tokenizer and pretraining on a mix of social media and conventional language yields an LM that outperforms the best similar-sized alternative by 4.2 points on the overall SMILE score.

Augmenting Rule-based DNS Censorship Detection at Scale with Machine Learning

The proliferation of global censorship has led to the development of a plethora of measurement platforms to monitor and expose it. Censorship of the domain name system (DNS) is a key mechanism used across different countries. It is currently detected by applying heuristics to samples of DNS queries and responses (probes) for specific destinations. These heuristics, however, are both platform-specific and have been found to be brittle when censors change their blocking behavior, necessitating a more reliable automated process for detecting censorship.

In this paper, we explore how machine learning (ML) models can (1) help streamline the detection process, (2) improve the potential of using large-scale datasets for censorship detection, and (3) discover new censorship instances and blocking signatures missed by existing heuristic methods. Our study shows that supervised models, trained using expert-derived labels on instances of known anomalies and possible censorship, can learn the detection heuristics employed by different measurement platforms. More crucially, we find that unsupervised models, trained solely on uncensored instances, can identify new instances and variations of censorship missed by existing heuristics. Moreover, both methods demonstrate the capability to uncover a substantial number of new DNS blocking signatures, i.e., injected fake IP addresses overlooked by existing heuristics. These results are underpinned by an important methodological finding: comparing the outputs of models trained using the same probes but with labels arising from independent processes allows us to more reliably detect cases of censorship in the absence of ground-truth labels of censorship.

RankFormer: Listwise Learning-to-Rank Using Listwide Labels

Web applications where users are presented with a limited selection of items have long employed ranking models to put the most relevant results first. Any feedback received from users is typically assumed to reflect a relative judgement on the utility of items, e.g. a user clicking on an item only implies it is better than items not clicked in the same ranked list. Hence, the objectives optimized in Learning-to-Rank (LTR) tend to be pairwise or listwise.

Yet, by only viewing feedback as relative, we neglect the user's absolute feedback on the list's overall quality, e.g. when no items in the selection are clicked. We thus reconsider the standard LTR paradigm and argue the benefits of learning from this listwide signal. To this end, we propose the RankFormer as an architecture that, with a Transformer at its core, can jointly optimize a novel listwide assessment objective and a traditional listwise LTR objective.

We simulate implicit feedback on public datasets and observe that the RankFormer succeeds in benefitting from listwide signals. Additionally, we conduct experiments in e-commerce on Amazon Search data and find the RankFormer to be superior to all baselines offline. An online experiment shows that knowledge distillation can be used to find immediate practical use for the RankFormer.

Capturing Conversion Rate Fluctuation during Sales Promotions: A Novel Historical Data Reuse Approach

Conversion rate (CVR) prediction is one of the core components in online recommender systems, and various approaches have been proposed to obtain accurate and well-calibrated CVR estimation. However, we observe that a well-trained CVR prediction model often performs sub-optimally during sales promotions. This can be largely ascribed to the problem of the data distribution shift, in which the conventional methods no longer work. To this end, we seek to develop alternative modeling techniques for CVR prediction. Observing similar purchase patterns across different promotions, we propose reusing the historical promotion data to capture the promotional conversion patterns. Herein, we propose a novel Historical Data Reuse (HDR) approach that first retrieves historically similar promotion data and then fine-tunes the CVR prediction model with the acquired data for better adaptation to the promotion mode. HDR consists of three components: an automated data retrieval module that seeks similar data from historical promotions, a distribution shift correction module that re-weights the retrieved data for better aligning with the target promotion, and a TransBlock module that quickly fine-tunes the original model for better adaptation to the promotion mode. Experiments conducted with real-world data demonstrate the effectiveness of HDR, as it improves both ranking and calibration metrics to a large extent. HDR has also been deployed on the display advertising system in Alibaba, bringing a lift of 9% RPM and 16% CVR during Double 11 Sales in 2022.

TWIN: TWo-stage Interest Network for Lifelong User Behavior Modeling in CTR Prediction at Kuaishou

Life-long user behavior modeling, i.e., extracting a user's hidden interests from rich historical behaviors in months or even years, plays a central role in modern CTR prediction systems. Conventional algorithms mostly follow two cascading stages: a simple General Search Unit (GSU) for fast and coarse search over tens of thousands of long-term behaviors and an Exact Search Unit (ESU) for effective Target Attention (TA) over the small number of finalists from GSU. Although efficient, existing algorithms mostly suffer from a crucial limitation: the inconsistent target-behavior relevance metrics between GSU and ESU. As a result, their GSU usually misses highly relevant behaviors but retrieves ones considered irrelevant by ESU. In such case, the TA in ESU, no matter how attention is allocated, mostly deviates from the real user interests and thus degrades the overall CTR prediction accuracy. To address such inconsistency, we propose TWo-stage Interest Network (TWIN), where our Consistency-Preserved GSU (CP-GSU) adopts the identical target-behavior relevance metric as the TA in ESU, making the two stages twins. Specifically, to break TA's computational bottleneck and extend it from ESU to GSU, or namely from behavior length 102 to length 104 - 105, we build a novel attention mechanism by behavior feature splitting. For the video inherent features of a behavior, we calculate their linear projection by efficient pre-computing & caching strategies. And for the user-item cross features, we compress each into a one-dimentional bias term in the attention score calculation to save the computational cost. The consistency between two stages, together with the effective TA-based relevance metric in CP-GSU, contributes to significant performance gain in CTR prediction. Offline experiments on a 46 billion scale real production dataset from Kuaishou and an Online A/B test show that TWIN outperforms all compared SOTA algorithms. With optimized online infrastructure, we reduce the computational bottleneck by 99.3%, which contributes to the successful deployment of TWIN on Kuaishou, serving the main traffic of hundreds of millions of active users everyday.

PEPNet: Parameter and Embedding Personalized Network for Infusing with Personalized Prior Information

With the increase of content pages and interactive buttons in online services such as online-shopping and video-watching websites, industrial-scale recommender systems face challenges in multi-domain and multi-task recommendations. The core of multi-task and multi-domain recommendation is to accurately capture user interests in multiple scenarios given multiple user behaviors. In this paper, we propose a plug-and-play Parameter and Embedding Personalized Network (PEPNet) for multi-domain and multi-task recommendation. PEPNet takes personalized prior information as input and dynamically scales the bottom-level Embedding and top-level DNN hidden units through gate mechanisms. Embedding Personalized Network (EPNet) performs personalized selection on Embedding to fuse features with different importance for different users in multiple domains. Parameter Personalized Network (PPNet) executes personalized modification on DNN parameters to balance targets with different sparsity for different users in multiple tasks. We have made a series of special engineering optimizations combining the Kuaishou training framework and the online deployment environment. By infusing personalized selection of Embedding and personalized modification of DNN parameters, PEPNet tailored to the interests of each individual obtains significant performance gains, with online improvements exceeding 1% in multiple task metrics across multiple domains. We have deployed PEPNet in Kuaishou apps, serving over 300 million users every day.

Taming the Domain Shift in Multi-source Learning for Energy Disaggregation

Non-intrusive load monitoring (NILM) is a cost-effective energy disaggregation means to estimate the energy consumption of individual appliances from a central load reading. Learning-based methods are the new trends in NILM implementations but require large labeled data to work properly at end-user premises. We first formulate an unsupervised multi-source domain adaptation problem to address this challenge by leveraging rich public datasets for building the NILM model. Then, we prove a new generalization bound for the target domain under multi-source settings. A hybrid loss-driven multi-source domain adversarial network (HLD-MDAN) is developed by approximating and optimizing the bound to tackle the domain shift between source and target domains. We conduct extensive experiments on three real-world residential energy datasets to evaluate the effectiveness of HLD-MDAN, showing that it is superior to other methods in single-source and multi-source learning scenarios.

Web-Scale Academic Name Disambiguation: The WhoIsWho Benchmark, Leaderboard, and Toolkit

Name disambiguation---a fundamental problem in online academic systems--is now facing greater challenges with the increasing growth of research papers. For example, on AMiner, an online academic search platform, about 10% of names own more than 100 authors. Such real-world challenging cases have not been effectively addressed by existing researches due to the small-scale or low-quality datasets that they have used. The development of effective algorithms is further hampered by a variety of tasks and evaluation protocols designed on top of diverse datasets. To this end, we present Who Is Who owning, a large-scale benchmark with over 1,000,000 papers built using an interactive annotation process, a regular leaderboard with comprehensive tasks, and an easy-to-use toolkit encapsulating the entire pipeline as well as the most powerful features and baseline models for tackling the tasks. Our developed strong baseline has already been deployed online in the AMiner system to enable daily arXiv paper assignments.

FS-REAL: Towards Real-World Cross-Device Federated Learning

Federated Learning (FL) aims to train high-quality models in collaboration with distributed clients while not uploading their local data, which attracts increasing attention in both academia and industry. However, there is still a considerable gap between the flourishing FL research and real-world scenarios, mainly caused by the characteristics of heterogeneous devices and its scales. Most existing works conduct evaluations with homogeneous devices, which are mismatched with the diversity and variability of heterogeneous devices in real-world scenarios. Moreover, it is challenging to conduct research and development at scale with heterogeneous devices due to limited resources and complex software stacks. These two key factors are important yet underexplored in FL research as they directly impact the FL training dynamics and final performance, making the effectiveness and usability of FL algorithms unclear. To bridge the gap, in this paper, we propose an efficient and scalable prototyping system for real-world cross-device FL, FS-REAL. It supports heterogeneous device runtime, contains parallelism and robustness enhanced FL server, and provides implementations and extensibility for advanced FL utility features such as personalization, communication compression and asynchronous aggregation. To demonstrate the usability and efficiency of FS-REAL, we conduct extensive experiments with various device distributions, quantify and analyze the effect of the heterogeneous device and various scales, and further provide insights and open discussions about real-world FL scenarios. Our system is released to help to pave the way for further real-world FL research and broad applications involving diverse devices and scales.

A Data-driven Region Generation Framework for Spatiotemporal Transportation Service Management

MAUP (modifiable areal unit problem) is a fundamental problem for spatial data management and analysis. As an instantiation of MAUP in online transportation platforms, region generation (i.e., specifying the areal unit for service operations) is the first and vital step for supporting spatiotemporal transportation services such as ride-sharing and freight transport. Most existing region generation methods are manually specified (e.g., fixed-size grids), suffering from poor spatial semantic meaning and inflexibility to meet service operation requirements. In this paper, we propose RegionGen, a data-driven region generation framework that can specify regions with key characteristics (e.g., good spatial semantic meaning and predictability) by modeling region generation as a multi-objective optimization problem. First, to obtain good spatial semantic meaning, RegionGen segments the whole city into atomic spatial elements based on road networks and obstacles (e.g., rivers). Then, it clusters the atomic spatial elements into regions by maximizing various operation characteristics, which is formulated as a multi-objective optimization problem. For this optimization problem, we propose a multi-objective co-optimization algorithm. Extensive experiments verify that RegionGen can generate more suitable regions than traditional methods for spatiotemporal service management.

Controllable Multi-Objective Re-ranking with Policy Hypernetworks

Multi-stage ranking pipelines have become widely used strategies in modern recommender systems, where the final stage aims to return a ranked list of items that balances a number of requirements such as user preference, diversity, novelty etc. Linear scalarization is arguably the most widely used technique to merge multiple requirements into one optimization objective, by summing up the requirements with certain preference weights. Existing final-stage ranking methods often adopt a static model where the preference weights are determined during offline training and kept unchanged during online serving. Whenever a modification of the preference weights is needed, the model has to be re-trained, which is time and resources inefficient. Meanwhile, the most appropriate weights may vary greatly for different groups of targeting users or at different time periods (e.g., during holiday promotions). In this paper, we propose a framework called controllable multi-objective re-ranking (CMR) which incorporates a hypernetwork to generate parameters for a re-ranking model according to different preference weights. In this way, CMR is enabled to adapt the preference weights according to the environment changes in an online manner, without retraining the models. Moreover, we classify practical business-oriented tasks into four main categories and seamlessly incorporate them in a new proposed re-ranking model based on an Actor-Evaluator framework, which serves as a reliable real-world testbed for CMR. Offline experiments based on the dataset collected from Taobao App showed that CMR improved several popular re-ranking models by using them as underlying models. Online A/B tests also demonstrated the effectiveness and trustworthiness of CMR.

Graph-Based Model-Agnostic Data Subsampling for Recommendation Systems

Data subsampling is widely used to speed up the training of large-scale recommendation systems. Most subsampling methods are model-based and often require a pre-trained pilot model to measure data importance via e.g. sample hardness. However, when the pilot model is misspecified, model-based subsampling methods deteriorate. Since model misspecification is persistent in real recommendation systems, we instead propose model-agnostic data subsampling methods by only exploring input data structure represented by graphs. Specifically, we study the topology of the user-item graph to estimate the importance of each user-item interaction (an edge in the user-item graph) via graph conductance, followed by a propagation step on the network to smooth out the estimated importance value. Since our proposed method is model-agnostic, we can marry the merits of both model-agnostic and model-based subsampling methods. Empirically, we show that combing the two consistently improves over any single method on the used datasets. Experimental results on KuaiRec and MIND datasets demonstrate that our proposed methods achieve superior results compared to baseline approaches.

Binary Classifier Evaluation on Unlabeled Segments using Inverse Distance Weighting with Distance Learning

Binary classification models are ubiquitous, and reliably measuring their performance is critical for their proper usage. Ideally, the performance of supervised models is measured using high-quality labeled datasets that are sufficiently large and representative of the population. However, obtaining labels for all segments of the population can be difficult, and model performance typically varies across different segments of the population (e.g., in different countries). In this work, we present a novel methodology to estimate the performance of a binary classifier in segments of the population where labels are unavailable. The main idea is that if two segments are "similar,'' then the performance of the classifier in these two segments would also be "similar.'' Specifically, we define a way to measure similarity between segments, and propose a statistical model that describes the performance of the model in unlabeled segments as a function of the performance in labeled segments. With extensive numerical experiments on synthetic and real-world datasets, we demonstrate that the proposed method substantially improves over existing methods in both estimation accuracy and computational efficiency. We also showcase the application of our method on the Instagram Adult Classifier to improve the geographic coverage and usability of the model.

Evolve Path Tracer: Early Detection of Malicious Addresses in Cryptocurrency

With the boom of cryptocurrency and its concomitant financial risk concerns, detecting fraudulent behaviors and associated malicious addresses has been drawing significant research effort. Most existing studies, however, rely on the full history features or full-fledged address transaction networks, both of which are unavailable in the problem of early malicious address detection and therefore failing them for the task. To detect fraudulent behaviors of malicious addresses in the early stage, we present Evolve Path Tracer, which consists of Evolve Path Encoder LSTM, Evolve Path Graph GCN, and Hierarchical Survival Predictor. Specifically, in addition to the general address features, we propose Asset Transfer Paths and corresponding path graphs to characterize early transaction patterns. Furthermore, since transaction patterns change rapidly in the early stage, we propose Evolve Path Encoder LSTM and Evolve Path Graph GCN to encode asset transfer path and path graph under an evolving structure setting. Hierarchical Survival Predictor then predicts addresses' labels with high scalability and efficiency. We investigate the effectiveness and generalizability of Evolve Path Tracer on three real-world malicious address datasets. Our experimental results demonstrate that Evolve Path Tracer outperforms the state-of-the-art methods. Extensive scalability experiments demonstrate the model's adaptivity under a dynamic prediction setting.

CT4Rec: Simple yet Effective Consistency Training for Sequential Recommendation

Sequential recommendation methods are increasingly important in cutting-edge recommender systems. Through leveraging historical records, the systems can capture user interests and perform recommendations accordingly. State-of-the-art sequential recommendation models proposed very recently combine contrastive learning techniques for obtaining high-quality user representations. Though effective and performing well, the models based on contrastive learning require careful selection of data augmentation methods and pretext tasks, efficient negative sampling strategies, and massive hyper-parameters validation. In this paper, we propose an ultra-simple alternative for obtaining better user representations and improving sequential recommendation performance. Specifically, we present a simple yet effective Consistency T braining method for sequential Recommendation (CT4Rec) in which only two extra training objectives are utilized without any structural modifications and data augmentation. Experiments on three benchmark datasets and one large newly crawled industrial corpus demonstrate that our proposed method outperforms SOTA models by a large margin and with much less training time than these based on contrastive learning. Online evaluation on real-world content recommendation system also achieves 2.717% improvement on the click-through rate and 3.679% increase on the average click number per capita. Further exploration reveals that such a simple method has great potential for CTR prediction. Our code is available at https://github.com/ct4rec/CT4Rec.git.

Conditional Neural ODE Processes for Individual Disease Progression Forecasting: A Case Study on COVID-19

Time series forecasting, as one of the fundamental machine learning areas, has attracted tremendous attentions over recent years. The solutions have evolved from statistical machine learning (ML) methods to deep learning techniques. One emerging sub-field of time series forecasting is individual disease progression forecasting, e.g., predicting individuals' disease development over a few days (e.g., deteriorating trends, recovery speed) based on few past observations. Despite the promises in the existing ML techniques, a variety of unique challenges emerge for disease progression forecasting, such as irregularly-sampled time series, data sparsity, and individual heterogeneity in disease progression. To tackle these challenges, we propose novel Conditional Neural Ordinary Differential Equations Processes (CNDPs), and validate it in a COVID-19 disease progression forecasting task using audio data. CNDPs allow for irregularly-sampled time series modelling, enable accurate forecasting with sparse past observations, and achieve individual-level progression forecasting. CNDPs show strong performance with an Unweighted Average Recall (UAR) of 78.1%, outperforming a variety of commonly used Recurrent Neural Networks based models. With the proposed label-enhancing mechanism (i.e., including the initial health status as input) and the customised individual-level loss, CNDPs further boost the performance reaching a UAR of 93.6%. Additional analysis also reveals the model's capability in tracking individual-specific recovery trend, implying the potential usage of the model for remote disease progression monitoring. In general, CNDPs pave new pathways for time series forecasting, and provide considerable advantages for disease progression monitoring.

Modelling Delayed Redemption with Importance Sampling and Pre-Redemption Engagement

Rewards-based programs are popular within e-commerce online stores, with the goal of providing serendipitous incentives to delight customers. These rewards (or incentives) could be in the form of cashback, free-shipping or discount coupons on purchases within specific categories. The success of such programs relies on their ability to identify relevant rewards for customers, from a wide variety of incentives available on the online store. Estimating the likelihood of a customer redeeming an incentive is challenging due to 1) data sparsity: relatively rare occurrence of coupon redemptions as compared to issuances, and 2) delayed feedback: customers taking time to redeem, resulting in inaccurate model refresh, compounded by data drift due to new customers and coupons.

To overcome these challenges, we present a novel framework, DRESS (Delayed Redemption Entire Space Sampling), that jointly models the effect of data sparsity and delayed feedback on redemptions. Our solution entails an architecture based on the recently proposed Entire Space Model ([12]), where we leverage pre-redemption engagement of customers (e.g. clipping of coupon) to overcome the sparsity challenge. The effect of delayed feedback is mitigated via a novel importance sampling mechanism, whose efficacy we formally analyze via a novel application of Influence Function ([10]). Experimental evaluation suggests that DRESS achieves significant lift in offline metric in comparison to state-of-the-art alternatives. Additionally, a live A/B test with DRESS resulted in a lift of 10 basis points in the redemption rate.

Variance Reduction Using In-Experiment Data: Efficient and Targeted Online Measurement for Sparse and Delayed Outcomes

Improving statistical power is a common challenge for online experimentation platforms so that more hypotheses can be tested and lower effect sizes can be detected. To increase the power without increasing the sample size, it is necessary to consider the variance of experimental outcome metrics. Variance reduction was previously applied to online experimentation based on the idea of using pre-experiment covariate data to account for noise in the final metrics. Since this method relies on correlations between pre-experiment covariates and experiment outcomes, its effectiveness can be limited when testing features for specific product surfaces. We were also motivated by the challenge of attributing sparse, delayed binary outcomes to individual user-product interactions. We present two novel methods for variance reduction that rely exclusively on in-experiment data. The first method is a framework for a model-based leading indicator metric which continually estimates progress toward a delayed binary outcome. The second method is a counterfactual treatment exposure index that quantifies the amount that a user is impacted by the treatment. We applied these methods to past experiments and found that both can achieve variance reduction of 50% or more compared to the delayed outcome metric. The substantial reduction in variance afforded by the two methods presented in this paper has enabled Airbnb's experimentation platform to become more agile and innovative.

From Human Days to Machine Seconds: Automatically Answering and Generating Machine Learning Final Exams

A final exam in machine learning at a top institution such as MIT, Harvard, or Cornell typically takes faculty days to write, and students hours to solve. We demonstrate that large language models pass machine learning finals at a human level on finals available online and automatically generate new human-quality final exam questions in seconds. Previous work has developed program synthesis and few-shot learning methods to solve university-level problem set questions in mathematics and STEM courses. In this work, we develop and compare methods that solve final exams, which differ from problem sets in several ways: the questions are longer, have multiple parts, are more complicated, and span a broader set of topics. We curate a dataset and benchmark of questions from machine learning final exams available online and code for answering these questions and generating new questions. We show how to generate new questions from other questions and course notes. For reproducibility and future research on this final exam benchmark, we use automatic checkers for multiple-choice, numeric, and questions with expression answers. A student survey comparing the quality, appropriateness, and difficulty of machine-generated questions with human-written questions shows that across multiple aspects, machine-generated questions are indistinguishable from human-generated questions and are suitable for final exams. We perform ablation studies comparing zero-shot learning with few-shot learning and chain-of-thought prompting using GPT-3, OPT, Codex, and ChatGPT across machine learning topics and find that few-shot learning methods perform best. We highlight the transformative potential of language models to streamline the writing and solution of large-scale assessments, significantly reducing the workload from human days to mere machine seconds. Our results suggest that rather than banning large language models such as ChatGPT in class, instructors should teach students to harness them by asking students meta-questions about correctness, completeness, and originality of the responses generated, encouraging critical thinking in academic studies.

Time-to-Event Modeling with Hypernetwork based Hawkes Process

Many real-world applications are associated with collection of events with timestamps, known as time-to-event data. Earthquake occurrences, social networks, and user activity logs can be represented as a sequence of discrete events observed in continuous time. Temporal point process serves as an essential tool for modeling such time-to-event data in continuous time space. Despite having massive amounts of event sequence data from various domains like social media, healthcare etc., real world application of temporal point process faces two major challenges: 1) it is not generalizable to predict events from unseen event sequences in dynamic environment 2) they are not capable of thriving in continually evolving environment with minimal supervision while retaining previously learnt knowledge. To tackle these issues, we propose HyperHawkes, a hypernetwork based temporal point process framework which is capable of modeling time of event occurrence for unseen sequences and consequently, zero-shot learning for time-to-event modeling. We also develop a hypernetwork based continually learning temporal point process for continuous modeling of time-to-event sequences with minimal forgetting. HyperHawkes augments the temporal point process with zero-shot modeling and continual learning capabilities. We demonstrate the application of the proposed framework through our experiments on real-world datasets. Our results show the efficacy of the proposed approach in terms of predicting future events under zero-shot regime for unseen event sequences. We also show that the proposed model is able to learn the time-to-event sequences continually while retaining information from previous event sequences, mitigating catastrophic forgetting in neural temporal point process.

Discovering Novel Biological Traits From Images Using Phylogeny-Guided Neural Networks

Discovering evolutionary traits that are heritable across species on the tree of life (also referred to as a phylogenetic tree) is of great interest to biologists to understand how organisms diversify and evolve. However, the measurement of traits is often a subjective and labor-intensive process, making trait discovery a highly label-scarce problem. We present a novel approach for discovering evolutionary traits directly from images without relying on trait labels. Our proposed approach, Phylo-NN, encodes the image of an organism into a sequence of quantized feature vectors -or codes- where different segments of the sequence capture evolutionary signals at varying ancestry levels in the phylogeny. We demonstrate the effectiveness of our approach in producing biologically meaningful results in a number of downstream tasks including species image generation and species-to-species image translation, using fish species as a target example

Demystifying Fraudulent Transactions and Illicit Nodes in the Bitcoin Network for Financial Forensics

Blockchain provides the unique and accountable channel for financial forensics by mining its open and immutable transaction data. A recent surge has been witnessed by training machine learning models with cryptocurrency transaction data for anomaly detection, such as money laundering and other fraudulent activities. This paper presents a holistic applied data science approach to fraud detection in the Bitcoin network with two original contributions. First, we contribute the Elliptic++dataset, which extends the Elliptic transaction dataset to include over 822k Bitcoin wallet addresses (nodes), each with 56 features, and 1.27M temporal interactions. This enables both the detection of fraudulent transactions and the detection of illicit addresses (actors) in the Bitcoin network by leveraging four types of graph data: (i) the transaction-to-transaction graph, representing the money flow in the Bitcoin network, (ii) the address-to-address interaction graph, capturing the types of transaction flows between Bitcoin addresses, (iii) the address-transaction graph, representing the bi-directional money flow between addresses and transactions (BTC flow from input address to one or more transactions and BTC flow from a transaction to one or more output addresses), and (iv) the user entity graph, capturing clusters of Bitcoin addresses representing unique Bitcoin users. Second, we perform fraud detection tasks on all four graphs by using diverse machine learning algorithms. We show that adding enhanced features from the address-to-address and the address-transaction graphs not only assists in effectively detecting both illicit transactions and illicit addresses, but also assists in gaining in-depth understanding of the root cause of money laundering vulnerabilities in cryptocurrency transactions and the strategies for fraud detection and prevention. The Elliptic++ dataset is released at https://www.github.com/git-disl/EllipticPlusPlus.

RecruitPro: A Pretrained Language Model with Skill-Aware Prompt Learning for Intelligent Recruitment

Recent years have witnessed the rapid development of machine-learning-based intelligent recruitment services. Along this line, a large number of emerging models have been proposed, achieving remarkable performance in various tasks, such as person-job fit, job classification and salary prediction. However, existing studies are usually domain/task specific, which significantly hinders the adaptation of models for different industries/tasks with limited training data. To this end, in this paper, we propose a novel skill-aware prompt-based pretraining framework, namely RecruitPro, which is capable of learning unified representations on the recruitment data and adapting for various downstream tasks of intelligent recruitment services. To be specific, we first present a contextualized embedding model that is pretrained on a large-scale recruitment dataset. Then, we construct 13 downstream benchmark tasks that are representative in the recruitment process. Along this line, we propose a skill-aware prompt learning module to enhance the adaptability of the pretrained model on downstream tasks. This module includes a skill-related prompt, which is designed to explore key semantic information (i.e., skills) from recruitment text, and a task-related prompt, which is designed to bridge the gap between the pretrained model and different downstream tasks. Moreover, we propose a strategy for extracting potential skills to further improve the performance of our skill-aware prompt learning module. Finally, extensive experiments have clearly demonstrated the effectiveness of RecruitPro. In addition, a case study has been presented to discuss the privacy preserving issue of our RecruitPro.

Learning Multi-Agent Intention-Aware Communication for Optimal Multi-Order Execution in Finance

Order execution is a fundamental task in quantitative finance, aiming at finishing acquisition or liquidation for a number of trading orders of the specific assets. Recent advance in model-free reinforcement learning (RL) provides a data-driven solution to the order execution problem. However, the existing works always optimize execution for an individual order, overlooking the practice that multiple orders are specified to execute simultaneously, resulting in suboptimality and bias. In this paper, we first present a multi-agent RL (MARL) method for multi-order execution considering practical constraints. Specifically, we treat every agent as an individual operator to trade one specific order, while keeping communicating with each other and collaborating for maximizing the overall profits. Nevertheless, the existing MARL algorithms often incorporate communication among agents by exchanging only the information of their partial observations, which is inefficient in complicated financial market. To improve collaboration, we then propose a learnable multi-round communication protocol, for the agents communicating the intended actions with each other and refining accordingly. It is optimized through a novel action value attribution method which is provably consistent with the original learning objective yet more efficient. The experiments on the data from two real-world markets have illustrated superior performance with significantly better collaboration effectiveness achieved by our method.

A Lightweight, Efficient and Explainable-by-Design Convolutional Neural Network for Internet Traffic Classification

Traffic classification, i.e., the identification of the type of applications flowing in a network, is a strategic task for numerous activities (e.g., intrusion detection, routing). This task faces some critical challenges that current deep learning approaches do not address. The design of current approaches do not take into consideration the fact that networking hardware (e.g., routers) often runs with limited computational resources. Further, they do not meet the need for faithful explainability highlighted by regulatory bodies. Finally, these traffic classifiers are evaluated on small datasets which fail to reflect the diversity of applications in real-world settings.

Therefore, this paper introduces a new Lightweight, Efficient and eXplainable-by-design convolutional neural network (LEXNet) for Internet traffic classification, which relies on a new residual block (for lightweight and efficiency purposes) and prototype layer (for explainability). Based on a commercial-grade dataset, our evaluation shows that LEXNet succeeds to maintain the same accuracy as the best performing state-of-the-art neural network, while providing the additional features previously mentioned. Moreover, we illustrate the explainability feature of our approach, which stems from the communication of detected application prototypes to the end-user, and we highlight the faithfulness of LEXNet explanations through a comparison with post hoc methods.

ILRoute: A Graph-based Imitation Learning Method to Unveil Riders' Routing Strategies in Food Delivery Service

Pick-up and delivery (PD) services such as online food ordering are playing an increasingly important role in serving people's daily demands. Accurate PD route prediction (PDRP) is important for service providers to efficiently schedule riders to improve service quality. It is crucial to model the decision-making process behind the route choice of riders for PDRP. Recent years have witnessed the success of utilizing imitation learning (IL) to model user decision-making process. Therefore, we propose to deploy an IL framework to solve the PDRP problem. However, there still exist three main challenges: (1) the rider's route decision is affected by multi-source and heterogeneous features and the complex relationships among these features make it hard to explore how they influence the rider's route decision-making; (2) the large route decision-making space make it easy to explore and predict unreasonable routes; (3) the rider's personalized preference is important in modeling the route decision-making process but cannot be fully explored. To tackle the above challenges, we propose ILRoute, a Graph-based imitation learning method for PDRP. ILRoute utilizes a multi-graph neural network (multi-GNN) to extract the multi-source and heterogeneous features and model their complex relationships. To address the large route decision-making space, ILRoute introduces a mobility regularity-aware constraint as prior route choice knowledge to reduce the exploration route decision-making space. To model the personalized preferences of the rider, ILRoute utilizes a personalized constraint mechanism to enhance the personalization of the rider's route decision-making process. Offline experiments conducted on three real-world datasets and online comparisons demonstrate the superiority of our proposed model.

FedMultimodal: A Benchmark for Multimodal Federated Learning

Over the past few years, Federated Learning (FL) has become an emerging machine learning technique to tackle data privacy challenges through collaborative training. In the Federated Learning algorithm, the clients submit a locally trained model, and the server aggregates these parameters until convergence. Despite significant efforts that have been made to FL in fields like computer vision, audio, and natural language processing, the FL applications utilizing multimodal data streams remain largely unexplored. It is known that multimodal learning has broad real-world applications in emotion recognition, healthcare, multimedia, and social media, while user privacy persists as a critical concern. Specifically, there are no existing FL benchmarks targeting multimodal applications or related tasks. In order to facilitate the research in multimodal FL, we introduce FedMultimodal, the first FL benchmark for multimodal learning covering five representative multimodal applications from ten commonly used datasets with a total of eight unique modalities. FedMultimodal offers a systematic FL pipeline, enabling end-to-end modeling framework ranging from data partition and feature extraction to FL benchmark algorithms and model evaluation. Unlike existing FL benchmarks, FedMultimodal provides a standardized approach to assess the robustness of FL against three common data corruptions in real-life multimodal applications: missing modalities, missing labels, and erroneous labels. We hope that FedMultimodal can accelerate numerous future research directions, including designing multimodal FL algorithms toward extreme data heterogeneity, robustness multimodal FL, and efficient multimodal FL. The datasets and benchmark results can be accessed at: https://github.com/usc-sail/fed-multimodal.

Influence Maximization with Fairness at Scale

In this paper, we revisit the problem of influence maximization with fairness, which aims to select k influential nodes to maximise the spread of information in a network, while ensuring that selected sensitive user attributes (e.g., gender, location, origin, race, etc.) are fairly affected, i.e., are proportionally similar between the original network and the affected users. Recent studies on this problem focused only on extremely small networks, hence the challenge remains on how to achieve a scalable solution, applicable to networks with millions or billions of nodes. We propose an approach that is based on learning node representations (embeddings) for fair spread from diffusion cascades, instead of the social connectivity, and in this way we can deal with very large graphs. We propose two data-driven approaches: (a) fairness-based participant sampling (FPS), and (b) fairness as context (FAC). Spread related user features, such as the probability of diffusing information to others, are derived from the historical information cascades, using a deep neural network. The extracted features are then used in selecting influencers that maximize the influence spread, while being also fair with respect to the chosen sensitive attributes. In FPS, fairness and cascade length information are considered independently in the decision-making process, while FAC considers these information facets jointly and takes into account correlations between them. The proposed algorithms are generic and represent the first policy-driven solutions that can be applied to arbitrary sets of sensitive attributes at scale. We evaluate the performance of our solutions on a real-world public dataset (Sina Weibo) and on a hybrid real-synthetic dataset (Digg), which exhibit all the facets that we exploit, namely diffusion network, diffusion traces, and user profiles. These experiments show that our methods outperform the state-the-art solutions in terms of spread, fairness, and scalability.

Binary Embedding-based Retrieval at Tencent

Large-scale embedding-based retrieval (EBR) is the cornerstone of search-related industrial applications. Given a user query, the system of EBR aims to identify relevant information from a large corpus of documents that may be tens or hundreds of billions in size. The storage and computation turn out to be expensive and inefficient with massive documents and high concurrent queries, making it difficult to further scale up.

To tackle the challenge, we propose a binary embedding-based retrieval (BEBR) engine equipped with a recurrent binarization algorithm that enables customized bits per dimension. Specifically, we compress the full-precision query and document embeddings, formulated as float vectors in general, into a composition of multiple binary vectors using a lightweight transformation model with residual multi-layer perception (MLP) blocks. The bits of transformed binary vectors are jointly determined by the output dimension of MLP blocks (termed m) and the number of residual blocks (termed u), i.e., m X (u + 1). We can therefore tailor the number of bits for different applications to trade off accuracy loss and cost savings. Importantly, we enable task-agnostic efficient training of the binarization model using a new embedding-to-embedding strategy,e.g., only 2 V100 GPU hours are required by millions of vectors for training. We also exploit the compatible training of binary embeddings so that the BEBR engine can support indexing among multiple embedding versions within a unified system. To further realize efficient search, we propose Symmetric Distance Calculation (SDC) to achieve lower response time than Hamming codes. The technique exploits Single Instruction Multiple Data (SIMD) units widely available in current CPUs.

We successfully employed the introduced BEBR to web search and copyright detection of Tencent products, including Sogou, Tencent Video, QQ World, etc. The binarization algorithm can be seamlessly generalized to various tasks with multiple modalities, for instance, natural language processing (NLP) and computer vision (CV). Extensive experiments on offline benchmarks and online A/B tests demonstrate the efficiency and effectiveness of our method, significantly saving 30% ~ 50% index costs with almost no loss of accuracy at the system level1 is publicly available at https://github.com/ganyk/BEBR.

Yggdrasil Decision Forests: A Fast and Extensible Decision Forests Library

Yggdrasil Decision Forests is a library for the training, serving and interpretation of decision forest models, targeted both at research and production work, implemented in C++, and available in C++, command line interface, Python (under the name TensorFlow Decision Forests), JavaScript, Go, and Google Sheets (under the name Simple ML for Sheets). The library has been developed organically since 2018 following a set of four design principles applicable to machine learning libraries and frameworks: simplicity of use, safety of use, modularity and high-level abstraction, and integration with other machine learning libraries. In this paper, we describe those principles in detail and present how they have been used to guide the design of the library. We then showcase the use of our library on a set of classical machine learning problems. Finally, we report a benchmark comparing our library to related solutions.

Towards Equitable Assignment: Data-Driven Delivery Zone Partition at Last-mile Logistics

The popularity of online e-commerce has promoted the rapid development of last-mile logistics in recent years. In last-mile services, to ensure delivery efficiency and enhance user experience, the delivery zone is proposed to perform delivery task assignment, which is a fundamental part of last-mile delivery. Each courier is responsible for one delivery zone. Couriers will collect orders belonging to their delivery zones from the delivery station and deliver orders to customers. Existing delivery zone partition practices in last-mile logistics consist of manual experience-based and static optimization-based methods, which perform order amount balancing among different zone but suffer from dissatisfaction and inefficiency because of two limitations: (i) using order amount is not always a good balancing metric considering deliveries' various difficulties (e.g., residence or industrial park, with or without elevators); (ii) less considering couriers' familiarity and preference behaviors. To generate delivery zone partition with equitable workload assignment, in this paper, we propose E-partition, a data-driven delivery zone partition framework to achieve equitable workload assignment in last-mile logistics. We first design a learning-based workload prediction model to estimate service time given a partition plan that consists of unseen courier-zone matching scenarios. Then, a delivery zone partition algorithm is proposed to iterative optimize couriers' core-AOI (i.e., area of interest) generation and AOI assignment process. Extensive offline experimental results show that our model outperforms baselines in working time prediction and workload balancing performances. Real-world deployment results at JD Logistics also verify the effectiveness of equitable-assignment aware delivery zone partition, with a 2.2% increase in service on-time rate compared to state-of-practice partition solutions.

An Interpretable, Flexible, and Interactive Probabilistic Framework for Melody Generation

The fast-growing demand for algorithmic music generation is found throughout entertainment, art, education, etc. Unfortunately, most recent models are practically impossible to interpret or musically fine-tune, as they use deep neural networks with thousands of parameters. We introduce an interpretable, flexible, and interactive model, SchenkComposer, for melody generation that empowers users to be creative in all aspects of the music generation pipeline and allows them to learn from the process. We divide the task of melody generation into steps based on the process that a human composer using music-theoretical domain knowledge might use. First, the model determines phrase structure based on form analysis and identifies an appropriate number of measures. Using concepts from Schenkerian analysis, the model then finds a fitting harmonic rhythm, middleground harmonic progression, foreground rhythm, and melody in a hierarchical, scaffolded approach using a probabilistic context-free grammar based on musical contours. By incorporating theories of musical form and harmonic structure, our model produces music with long-term structural coherence. In extensive human experiments, we find that music generated with our approach successfully passes a Turing test in human experiments while current state-of-the-art approaches fail, and we further demonstrate superior performance and preference for our melodies compared to existing melody generation methods. Additionally, we developed and deployed a public website for SchenkComposer, and conducted preliminary user surveys. Through analysis, we show the strong viability and enjoyability of SchenkComposer.

iETA: A Robust and Scalable Incremental Learning Framework for Time-of-Arrival Estimation

Time-of-arrival estimation or Estimated Time of Arrival (ETA) has become an indispensable building block of modern intelligent transportation systems. While many efforts have been made for time-of-arrival estimation, most of them have scalability and robustness issues when dealing with real-world large-scale ETA scenarios, where billions of vehicle trajectories and ETA requests have been continuously generating every day. To this end, in this paper, we propose a robust and scalable incremental ETA learning framework, iETA, to continuously exploit spatio-temporal traffic patterns from massive floating-car data and thus achieve better estimation performances. Specifically, we first build an incremental travel time predictor that can be incrementally updated based on newly generated traffic data. The incremental travel time predictor not only reduces the overall learning overhead but also improves the model's robustness toward urban traffic distribution shifts. Then, we propose a historical traffic knowledge consolidation module to preserve critical spatio-temporal knowledge from previous ETA predictors under the incremental learning setting. Moreover, to reduce interference induced by low-quality traffic data, we propose an adversarial training module to improve the learning robustness by proactively mitigating and resisting traffic noise perturbations. Finally, extensive experiments demonstrate the effectiveness and efficiency of the proposed system against state-of-the-art baselines in large-scale ETA scenarios. Most importantly, iETA has been deployed on the Didi Chuxing platform, handling real-time billions of ETA queries every day, and substantially improves the prediction accuracy.

Efficient Continuous Space Policy Optimization for High-frequency Trading

High-frequency trading is an extraordinarily intricate financial task, which is normally treated as a near real-time sequential decision problem. Compared with the traditional two-phase approach, forecasting equity's trend and then weighting them by combinatorial optimization, deep reinforcement learning (DRL) methods have shown advances in reward chasing with optimal policies. However, existing DRL-based methods either leverage portfolio optimization on low-frequency scenarios or only support a very limited number of assets with discrete action space, facing significant computing efficiency challenges. Therefore, we propose an efficient DRL-based policy optimization (DRPO) method for high-frequency trading. In particular, we model the portfolio management task with Markov Decision Process by directly inferring the equity weights in the action space guided by maximum accumulated returns. To reduce agents' interaction complexity without reducing interpretation, we detach the environment into the "static'' market states and "dynamic'' portfolio weight states. Then, we design an efficient reward expectation calculation algorithm via probabilistic dynamic programming, which enables our agents directly collect feedback away from trajectory sampling-based morass. To the best of our knowledge, this is the first work that solves the high-frequency portfolio optimization problem by devising an efficient continuous space policy optimization algorithm in the DRL framework. Through extensive experiments on the real-world data from Dow Jones, Coinbase and SSE exchanges, we show that our proposed DRPO significantly outperforms state-of-the-art benchmark methods. The results demonstrate the practical applicability and effectiveness of the proposed method.

Select and Trade: Towards Unified Pair Trading with Hierarchical Reinforcement Learning

Pair trading is one of the most effective statistical arbitrage strategies which seeks a neutral profit by hedging a pair of selected assets. Existing methods generally decompose the task into two separate steps: pair selection and trading. However, the decoupling of two closely related sub-tasks can block information propagation and lead to limited overall performance. For pair selection, ignoring the trading performance results in the wrong assets being selected with irrelevant price movements, while the agent trained for trading can overfit to the selected assets without any historical information of other assets. To address it, in this paper, we propose a paradigm for automatic pair trading as a unified task rather than a two-step pipeline. We design a hierarchical reinforcement learning framework to jointly learn and optimize two sub-tasks. A high-level policy would select two assets from all possible combinations and a low-level policy would then perform a series of trading actions. Experimental results on real-world stock data demonstrate the effectiveness of our method on pair trading compared with both existing pair selection and trading methods.

Identifying Complicated Contagion Scenarios from Cascade Data

We consider the setting of cascades that result from contagion dynamics on large realistic contact networks. We address the question of whether the structural properties of a (partially) observed cascade can characterize the contagion scenario and identify the interventions that might be in effect. Using epidemic spread as a concrete example, we study how social interventions such as compliance in social distancing, extent (and efficacy) of vaccination, and the transmissibility of disease can be inferred. The techniques developed are more generally applicable to other contagions as well.

Our approach involves the use of large realistic social contact networks of certain regions of USA and an agent-based model (ABM) to simulate spread under two interventions, namely vaccination and generic social distancing (GSD). Through a machine learning approach, coupled with parameter significance analysis, our experimental results show that subgraph counts of the graph induced by the cascade can be used effectively to characterize the contagion scenario even during the initial stages of the epidemic, when traditional information such as case counts alone are not adequate for this task. Further, we show that our approach performs well even for partially observed cascades. These results demonstrate that cascade data collected from digital tracing applications under poor digital penetration and privacy constraints can provide valuable information about the contagion scenario.

BOSS: A Bilateral Occupational-Suitability-Aware Recommender System for Online Recruitment

With the rapid development of online recruitment platforms, a variety of emerging recommendation services have been witnessed for benefiting both job seekers and recruiters. While many researchers have studied the problem of reciprocal recommendation in two- sided markets (e.g., marriage market and real estate market), there is still a lack of in-depth understanding of the bilateral occupational preferences of different participants in the online recruitment market. To this end, in this paper, we propose a Bilateral Occupational-Suitability-aware recommender System (BOSS) for online recruitment, in consideration of the reciprocal, bilateral, and sequential properties of realistic recruitment scenarios simultaneously. To be specific, in BOSS, we first propose a multi-group-based mixture-of-experts (MoE) module to independently learn the preference representations of job seekers and recruiters. Then, with a specially-designed multi-task learning module, BOSS can progressively model the action sequence of recruitment process through a bilateral probabilistic manner. As a result, the reciprocal recommendations can be efficiently implemented by leveraging the product of different action probabilities of job seekers and recruiters. Finally, we have conducted extensive experiments on 5 real-world large-scale datasets as well as the online environment. Both online A/B test and offline experimental results clearly validate that our recommender system BOSS can outperform other state-of-the-art baselines with a significant margin.

Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering

To contribute to automating the medical vision-language model, we propose a novel Chest-Xray Different Visual Question Answering (VQA) task. Given a pair of main and reference images, this task attempts to answer several questions on both diseases and, more importantly, the differences between them. This is consistent with the radiologist's diagnosis practice that compares the current image with the reference before concluding the report. We collect a new dataset, namely MIMIC-Diff-VQA, including 700,703 QA pairs from 164,324 pairs of main and reference images. Compared to existing medical VQA datasets, our questions are tailored to the Assessment-Diagnosis-Intervention-Evaluation treatment procedure used by clinical professionals. Meanwhile, we also propose a novel expert knowledge-aware graph representation learning model to address this task. The proposed baseline model leverages expert knowledge such as anatomical structure prior, semantic, and spatial knowledge to construct a multi-relationship graph, representing the image differences between two images for the image difference VQA task. The dataset and code can be found at https://github.com/Holipori/MIMIC-Diff-VQA. We believe this work would further push forward the medical vision language model.

Graph Learning in Physical-informed Mesh-reduced Space for Real-world Dynamic Systems

Recent machine learning approaches have demonstrated their ability to extract information from data that could be translated into knowledge about the underlying dynamic systems. However, these learning-based models suffer from scalability issues when training on high-dimensional and high-resolution simulation data generated for real-world applications. In this work, we aim to tackle this challenge by deliberately prioritizing certain aspects of dynamic systems, while allocating relatively less attention and computational resources to others. Specifically, we concentrate on improving the predictive accuracy of crucial properties or regions that significantly impact these dynamic systems, while comparatively reducing emphasis on the remaining aspects. By employing graph learning schemes and custom-designed modules, we have developed a two-stage prediction model that incorporates prior knowledge of the systems. This approach enables us to place a heightened emphasis on the region of interest (ROI) during the learning process where the model operates in a reduced-dimensional mesh space, resulting in reduced computational costs while preserving crucial physical properties. To test and evaluate our method, we utilized two simulation datasets: lid-driven cavity and cylinder flow. The results show that even under reduced operational space, our method still achieves desirable performance on accuracy and generalizability of both prediction and physical consistency over region of interest.

SAMD: An Industrial Framework for Heterogeneous Multi-Scenario Recommendation

Industrial recommender systems usually need to serve multiple scenarios at the same time. In practice, there are various heterogeneous scenarios, since users frequently engage in scenarios with varying intentions and the items within each scenario typically belong to diverse categories. Existing works of multi-scenario recommendation mainly focus on modeling homogeneous scenarios which have similar data distributions. They equally transfer knowledge to each scenario without considering the diversity of heterogeneous scenarios. In this paper, we argue that the heterogeneity in multi-scenario recommendations is a key problem that needs to be solved. To this end, we propose an industrial framework named Scenario-Aware Model-Agnostic Meta Distillation (SAMD) for the multi-scenario recommendation. SAMD aims to provide scenario-aware and model-agnostic knowledge sharing across heterogeneous scenarios by modeling scenarios' relationship and conducting heterogeneous knowledge distillation. Specifically, SAMD first measures the comprehensive representation of each scenario and then proposes a novel meta distillation paradigm to conduct scenario-aware knowledge sharing. The meta network first establishes the potential scenarios' relationships and generates the strategies of knowledge sharing for each scenario. Then the heterogeneous knowledge distillation utilizes scenario-aware strategies to share knowledge across heterogeneous scenarios through intermediate features distillation without the restriction of the model architecture. In this way, SAMD shares knowledge across heterogeneous scenarios in a scenario-aware and model-agnostic manner, which addresses the problem of heterogeneity. Compared with other state-of-the-art methods, extensive offline experiments, and online A/B testing demonstrate the superior performance of the proposed SAMD framework, especially in heterogeneous scenarios.

Learning Discrete Document Representations in Web Search

Product quantization (PQ) has been usually applied to dense retrieval (DR) of documents thanks to its competitive time, memory efficiency and compatibility with other approximate nearest search (ANN) methods. Originally, PQ was learned to minimize the reconstruction loss, i.e., the distortions between the original dense embeddings and the reconstructed embeddings after quantization. Unfortunately, such an objective is inconsistent with the goal of selecting ground-truth documents for the input query, which may cause a severe loss of retrieval quality. Recent research has primarily concentrated on jointly training the biencoders and PQ to ensure consistency for improved performance. However, it is still difficult to design an approach that can cope with challenges like discrete representation collapse, mining informative negatives, and deploying effective embedding-based retrieval (EBR) systems in a real search engine.

In this paper, we propose a Two-stage Multi-task Joint training technique (TMJ) to learn discrete document representations, which is simple and effective for real-world practical applications. In the first stage, the PQ centroid embeddings are regularized by the dense retrieval loss, which ensures the distinguishability of the quantized vectors and preserves the retrieval quality of dense embeddings. In the second stage, a PQ-oriented sample mining strategy is introduced to explore more informative negatives and further improve the performance. Offline evaluations are performed on a public benchmark (MS MARCO) and two private real-world web search datasets, where our method notably outperforms the SOTA PQ methods both in Recall and Mean Reciprocal Ranking (MRR). Besides, online experiments are conducted to validate that our technique can significantly provide high-quality vector quantization. Moreover, our joint training framework has been successfully applied to a billion-scale web search system.

Large-scale Urban Cellular Traffic Generation via Knowledge-Enhanced GANs with Multi-Periodic Patterns

With the rapid development of the cellular network, network planning is increasingly important. Generating large-scale urban cellular traffic contributes to network planning via simulating the behaviors of the planned network. Existing methods fail in simulating the long-term temporal behaviors of cellular traffic while cannot model the influences of the urban environment on the cellular networks. We propose a knowledge-enhanced GAN with multi-periodic patterns to generate large-scale cellular traffic based on the urban environment. First, we design a GAN model to simulate the multi-periodic patterns and long-term aperiodic temporal dynamics of cellular traffic via learning the daily patterns, weekly patterns, and residual traffic between long-term traffic and periodic patterns step by step. Then, we leverage urban knowledge to enhance traffic generation via constructing a knowledge graph containing multiple factors affecting cellular traffic in the surrounding urban environment. Finally, we evaluate our model on a real cellular traffic dataset. Our proposed model outperforms three state-of-art generation models by over 32.77%, and the urban knowledge enhancement improves the performance of our model by 4.71%. Moreover, our model achieves good generalization and robustness in generating traffic for urban cellular networks without training data in the surrounding areas.

SentiGOLD: A Large Bangla Gold Standard Multi-Domain Sentiment Analysis Dataset and Its Evaluation

In this study, we present a Bangla multi-domain sentiment analysis dataset, named as SentiGOLD, developed using 70,000 samples, which was compiled from a variety of sources and annotated by a gender-balanced team of linguists. This dataset was created in accordance with a standard set of linguistic conventions that were established after multiple meetings between the Government of Bangladesh and a nationally recognized Bangla linguistics committee. Although there are standard sentiment analysis datasets available for English and other rich languages, there are not any such datasets in Bangla, especially because, there was no standard linguistics framework agreed upon by national stakeholders. Senti-GOLD derives its raw data from online video comments, social media posts and comments, blog posts and comments, news and numerous other sources. Throughout the development of this dataset, domain distribution and class distribution were rigorously maintained. SentiGOLD was created using data from a total of 30 domains (e.g. politics, entertainment, sports, etc.) and was labeled using 5 classes (e.g. strongly negative, weakly negative, neutral, weakly positive, and strongly positive). In order to maintain annotation quality, the national linguistics committee approved an annotation scheme to ensure a rigorous Inter Annotator Agreement (IAA) in a multi-annotator annotation scenario. This procedure yielded an IAA score of 0.88 using Fleiss' kappa method, which is elaborated upon in the paper. A protocol for intra- and cross-dataset evaluation was utilized in our efforts to develop a classification system as a standard. The cross-dataset evaluation was performed on the SentNoB dataset, which contains noisy Bangla text samples, thereby establishing a demanding test scenario. We also performed cross-dataset testing by employing zero-shot experiments, and our best model produced competitive performance, which exemplify our dataset's generalizability. Our top model attained a macro f1 of 0.62 (intra-dataset) for 5 classes establishing the benchmark for SentiGOLD, and 0.61 (cross-dataset from SentNoB) for 3 classes which stands comparable to the current state-of-the-art. Our fine-tuned sentiment analysis model\footnotehttps://sentiment.bangla.gov.bd can be accessed online.

Off-Policy Learning-to-Bid with AuctionGym

Online advertising opportunities are sold through auctions, billions of times every day across the web. Advertisers who participate in those auctions need to decide on a bidding strategy: how much they are willing to bid for a given impression opportunity. Deciding on such a strategy is not a straightforward task, because of the interactive and reactive nature of the repeated auction mechanism. Indeed, an advertiser does not observe counterfactual outcomes of bid amounts that were not submitted, and successful advertisers will adapt their own strategies based on bids placed by competitors. These characteristics complicate effective learning and evaluation of bidding strategies based on logged data alone.

The interactive and reactive nature of the bidding problem lends itself to a bandit or reinforcement learning formulation, where a bidding strategy can be optimised to maximise cumulative rewards. Several design choices then need to be made regarding parameterisation, model-based or model-free approaches, and the formulation of the objective function. This work provides a unified framework for such "learning to bid'' methods, showing how many existing approaches fall under the value-based paradigm. We then introduce novel policy-based and doubly robust formulations of the bidding problem. To allow for reliable and reproducible offline validation of such methods without relying on sensitive proprietary data, we introduce AuctionGym: a simulation environment that enables the use of bandit learning for bidding strategies in online advertising auctions. We present results from a suite of experiments under varying environmental conditions, unveiling insights that can guide practitioners who need to decide on a model class. Empirical observations highlight the effectiveness of our newly proposed methods. AuctionGym is released under an open-source license, and we expect the research community to benefit from this tool.

FairCod: A Fairness-aware Concurrent Dispatch System for Large-scale Instant Delivery Services

In recent years, we have been witnessing a rapid prevalence of instant delivery services (e,g., UberEats, Instacart, and Eleme) due to their convenience and timeliness. A unique characteristic of instant delivery services is the concurrent dispatch mode, where (i) one courier usually simultaneously delivers multiple orders, especially during rush hours, and (ii) couriers can receive new orders when delivering existing orders. Most existing concurrent dispatch systems are efficiency-oriented, which means they usually dispatch a group of orders that have a similar delivery route to a courier. Although this strategy may achieve high overall efficiency, it also potentially causes a huge disparity of earnings between different couriers. To address the problem, in this paper, we design a Fairness-aware Concurrent dispatch system called FairCod, which aims to optimize the overall operation efficiency and individual fairness at the same time. Specifically, in FairCod, we design a Dynamic Advantage Actor-Critic algorithm with Fairness constrain (DA2CF). The basic idea is that it includes an Actor network to make dispatch decisions based on dynamic action space and a Critic network to evaluate the dispatch decisions from the fairness perspective. More importantly, we extensively evaluate our FairCod system based on one-month real-world data consisting of 36.38 million orders from 42,000 couriers collected by one of the largest instant delivery companies in China. Experimental results show that our FairCod improves courier fairness by 30.3% without sacrificing the overall system benefit compared to state-of-the-art baselines.

AdSEE: Investigating the Impact of Image Style Editing on Advertisement Attractiveness

Online advertisements are important elements in e-commerce sites, social media platforms, and search engines. With the increasing popularity of mobile browsing, many online ads are displayed with visual information in the form of a cover image in addition to text descriptions to grab the attention of users. Various recent studies have focused on predicting the click rates of online advertisements aware of visual features or composing optimal advertisement elements to enhance visibility. In this paper, we propose Advertisement Style Editing and Attractiveness Enhancement (AdSEE), which explores whether semantic editing to ads images can affect or alter the popularity of online advertisements. We introduce StyleGAN-based facial semantic editing and inversion to ads images and train a click rate predictor attributing GAN-based face latent representations in addition to traditional visual and textual features to click rates. Through a large collected dataset named QQ-AD, containing 20,527 online ads, we perform extensive offline tests to study how different semantic directions and their edit coefficients may impact click rates. We further design a Genetic Advertisement Editor to efficiently search for the optimal edit directions and intensity given an input ad cover image to enhance its projected click rates. Online A/B tests performed over a period of 5 days have verified the increased click-through rates of AdSEE-edited samples as compared to a control group of original ads, verifying the relation between image styles and ad popularity. We open source the code for AdSEE research at https://github.com/LiyaoJiang1998/adsee.

Adaptive Graph Contrastive Learning for Recommendation

Graph neural networks (GNNs) have recently emerged as an effective collaborative filtering (CF) approaches for recommender systems. The key idea of GNN-based recommender systems is to recursively perform message passing along user-item interaction edges to refine encoded embeddings, relying on sufficient and high-quality training data. However, user behavior data in practical recommendation scenarios is often noisy and exhibits skewed distribution. To address these issues, some recommendation approaches, such as SGL, leverage self-supervised learning to improve user representations. These approaches conduct self-supervised learning through creating contrastive views, but they depend on the tedious trial-and-error selection of augmentation methods. In this paper, we propose a novel Adaptive Graph Contrastive Learning (AdaGCL) framework that conducts data augmentation with two adaptive contrastive view generators to better empower the CF paradigm. Specifically, we use two trainable view generators - a graph generative model and a graph denoising model - to create adaptive contrastive views. With two adaptive contrastive views, AdaGCL introduces additional high-quality training signals into the CF paradigm, helping to alleviate data sparsity and noise issues. Extensive experiments on three real-world datasets demonstrate the superiority of our model over various state-of-the-art recommendation methods. Our model implementation codes are available at the link https://github.com/HKUDS/AdaGCL.

PGLBox: Multi-GPU Graph Learning Framework for Web-Scale Recommendation

While having been used widely for large-scale recommendation and online advertising, the Graph Neural Network (GNN) has demonstrated its representation learning capacity to extract embeddings of nodes and edges through passing, transforming, and aggregating information over the graph. In this work, we propose PGLBox1 - a multi-GPU graph learning framework based on PaddlePaddle [24], incorporating with optimized storage, computation, and communication strategies, to train deep GNNs based on web-scale graphs for the recommendation. Specifically, PGLBox adopts a hierarchical storage system with three layers to facilitate I/O, where graphs and embeddings are stored in the HBMs and SSDs, respectively, with MEMs as the cache. To fully utilize multi-GPUs and I/O bandwidth, PGLBox proposes an asynchronous pipeline with three stages - it first samples the subgraphs from the input graph, then pulls & updates embeddings and trains GNNs on the subgraph with parameters updating queued at the end of the pipeline. Thanks to the capacity of PGLBox in handling web-scale graphs, it becomes feasible to unify the view of GNN-based recommendation tasks for multiple advertising verticals and fuse all these graphs into a unified yet huge one. We evaluate PGLBox using a bucket of realistic GNN training tasks for the recommendation, and compare the performance of PGLBox on top of a multi-GPU server (Tesla A100×8) and the legacy training system based on a 40-node MPI cluster at Baidu. The overall comparisons show that PGLBox could save up to 55% monetary cost for training GNN models, and achieve up to 14× training speedup with the same accuracy as the legacy trainer. The open-source implementation of PGLBox is available at https://github.com/PaddlePaddle/PGL/tree/main/apps/PGLBox.

Multimodal Indoor Localisation in Parkinson's Disease for Detecting Medication Use: Observational Pilot Study in a Free-Living Setting

Parkinson's disease (PD) is a slowly progressive, debilitating neurodegenerative disease which causes motor symptoms including gait dysfunction. Motor fluctuations are alterations between periods with a positive response to levodopa therapy ("on") and periods marked by re-emergency of PD symptoms ("off") as the response to medication wears off. These fluctuations often affect gait speed and they increase in their disabling impact as PD progresses. To improve the effectiveness of current indoor localisation methods, a transformer-based approach utilising dual modalities which provide complementary views of movement, Received Signal Strength Indicator (RSSI) and accelerometer data from wearable devices, is proposed. A sub-objective aims to evaluate whether indoor localisation, including its in-home gait speed features (i.e. the time taken to walk between rooms), could be used to evaluate motor fluctuations by detecting whether the person with PD is taking levodopa medications or withholding them. To properly evaluate our proposed method, we use a free-living dataset where the movements and mobility are greatly varied and unstructured as expected in real-world conditions. 24 participants lived in pairs (consisting of one person with PD, one control) for five days in a smart home with various sensors. Our evaluation on the resulting dataset demonstrates that our proposed network outperforms other methods for indoor localisation. The sub-objective evaluation shows that precise room-level localisation predictions, transformed into in-home gait speed features, produce accurate predictions on whether the PD participant is taking or withholding their medications.

IGB: Addressing The Gaps In Labeling, Features, Heterogeneity, and Size of Public Graph Datasets for Deep Learning Research

Graph neural networks (GNNs) have shown high potential for a variety of real-world, challenging applications, but one of the major obstacles in GNN research is the lack of large-scale flexible datasets. Most existing public datasets for GNNs are relatively small, which limits the ability of GNNs to generalize to unseen data. The few existing large-scale graph datasets provide very limited labeled data. This makes it difficult to determine if the GNN model's low accuracy for unseen data is inherently due to insufficient training data or if the model failed to generalize. Additionally, datasets used to train GNNs need to offer flexibility to enable a thorough study of the impact of various factors while training GNN models.

In this work, we introduce the Illinois Graph Benchmark (IGB), a research dataset tool that the developers can use to train, scrutinize and systematically evaluate GNN models with high fidelity. IGB includes both homogeneous and heterogeneous academic graphs of enormous sizes, with more than 40% of their nodes labeled. Compared to the largest graph datasets publicly available, the IGB provides over 162× more labeled data for deep learning practitioners and developers to create and evaluate models with higher accuracy. The IGB dataset is a collection of academic graphs designed to be flexible, enabling the study of various GNN architectures, embedding generation techniques, and analyzing system performance issues for node classification tasks. IGB is open-sourced, supports DGL and PyG frameworks, and comes with releases of the raw text that we believe foster emerging language models and GNN research projects. An early public version of IGB is available at https://github.com/IllinoisGraphBenchmark/IGB-Datasets.

Ball Trajectory Inference from Multi-Agent Sports Contexts Using Set Transformer and Hierarchical Bi-LSTM

As artificial intelligence spreads out to numerous fields, the application of AI to sports analytics is also in the spotlight. However, one of the major challenges is the difficulty of automated acquisition of continuous movement data during sports matches. In particular, it is a conundrum to reliably track a tiny ball on a wide soccer pitch with obstacles such as occlusion and imitations. Tackling the problem, this paper proposes an inference framework of ball trajectory from player trajectories as a cost-efficient alternative to ball tracking. We combine Set Transformers to get permutation-invariant and equivariant representations of the multi-agent contexts with a hierarchical architecture that intermediately predicts the player ball possession to support the final trajectory inference. Also, we introduce the reality loss term and postprocessing to secure the estimated trajectories to be physically realistic. The experimental results show that our model provides natural and accurate trajectories as well as admissible player ball possession at the same time. Lastly, we suggest several practical applications of our framework including missing trajectory imputation, semi-automated pass annotation, automated zoom-in for match broadcasting, and calculating possession-wise running performance metrics.

Real Time Index and Search Across Large Quantities of GNN Experts for Low Latency Online Learning

Online learning is a powerful technique that allows models to adjust to concept drift in dynamically changing graphs. This approach is crucial for large mobility-based companies like Grab, where batch-learning methods fail to keep up with the large amount of training data. Our work focuses on scaling graph neural network mixture of expert (MoE) models for real-time traffic speed prediction on road networks, while meeting high accuracy and low latency requirements. Conventional spatio-temporal and incremental MoE frameworks struggle with poor inference accuracy and linear time complexity when scaling experts, for the latter, leading to prohibitively high latency in model updates. To address this issue, we introduce the Indexed Router, a novel method that categorizes experts into a structured hierarchy called the indexed tree. This approach reduces the time to scale and search N number of experts from O(N) to O(log N), making it ideal for online learning under tight service level agreements. Our experiments show that these time savings do not compromise inference accuracy, and our Indexed Router outperforms state-of-the-art spatio-temporal and incremental MoE models in terms of traffic speed prediction accuracy on real-life GPS traces from Grab's database and publicly available records. In summary, the Indexed Router enables MoE models to scale across large numbers of experts with low latency, while accurately identifying the relevant experts for inference.

Neural Insights for Digital Marketing Content Design

In digital marketing, experimenting with new website content is one of the key levers to improve customer engagement. However, creating successful marketing content is a manual and time-consuming process that lacks clear guiding principles. This paper seeks to close the loop between content creation and online experimentation by offering marketers AI-driven actionable insights based on historical data to improve their creative process. We present a neural-network-based system that scores and extracts insights from a marketing content design. Namely, a multimodal neural network predicts the attractiveness of marketing contents, and a post-hoc attribution method generates actionable insights for marketers to improve their content in specific marketing locations. Our insights not only point out the advantages and drawbacks of a given current content, but also provide design recommendations based on historical data. We show that our scoring model and insights work well both quantitatively and qualitatively.

Revisiting Hate Speech Benchmarks: From Data Curation to System Deployment

Social media is awash with hateful content, much of which is often veiled with linguistic and topical diversity. The benchmark datasets used for hate speech detection do not account for such divagation as they are predominantly compiled using hate lexicons. However, capturing hate signals becomes challenging in neutrally-seeded malicious content. Thus, designing models and datasets that mimic the real-world variability of hate warrants further investigation.

To this end, we present GOTHate, a large-scale code-mixed crowdsourced dataset of around 51k posts for hate speech detection from Twitter. GOTHate is neutrally seeded, encompassing different languages and topics. We conduct detailed comparisons of GOTHate with the existing hate speech datasets, highlighting its novelty. We benchmark it with 10 recent baselines. Our extensive empirical and benchmarking experiments suggest that GOTHate is hard to classify in a text-only setup. Thus, we investigate how adding endogenous signals enhances the hate speech detection task. We augment GOTHate with the user's timeline information and ego network, bringing the overall data source closer to the real-world setup for understanding hateful content. Our proposed solution HEN-mBERT is a modular, multilingual, mixture-of-experts model that enriches the linguistic subspace with latent endogenous signals from history, topology, and exemplars. HEN-mBERT transcends the best baseline by 2.5% and 5% in overall macro-F1 and hate class F1, respectively. Inspired by our experiments, in partnership with Wipro AI, we are developing a semi-automated pipeline to detect hateful content as a part of their mission to tackle online harm.

A Preference-aware Meta-optimization Framework for Personalized Vehicle Energy Consumption Estimation

Vehicle Energy Consumption (VEC) estimation aims to predict the total energy required for a given trip before it starts, which is of great importance to trip planning and transportation sustainability. Existing approaches mainly focus on extracting statistically significant factors from typical trips to improve the VEC estimation. However, the energy consumption of each vehicle may diverge widely due to the personalized driving behavior under varying travel contexts. To this end, this paper proposes a preference-aware meta-optimization framework Meta-Pec for personalized vehicle energy consumption estimation. Specifically, we first propose a spatiotemporal behavior learning module to capture the latent driver preference hidden in historical trips. Moreover, based on the memorization of driver preference, we devise a selection-based driving behavior prediction module to infer driver-specific driving patterns on a given route, which provides additional basis and supervision signals for VEC estimation. Besides, a driver-specific meta-optimization scheme is proposed to enable fast model adaption by learning and sharing transferable knowledge globally. Extensive experiments on two real-world datasets show the superiority of our proposed framework against ten numerical and data-driven machine learning baselines. The source code is available at https://github.com/usail-hkust/Meta-Pec.

Towards Suicide Prevention from Bipolar Disorder with Temporal Symptom-Aware Multitask Learning

Bipolar disorder (BD) is closely associated with an increased risk of suicide. However, while the prior work has revealed valuable insight into understanding the behavior of BD patients on social media, little attention has been paid to developing a model that can predict the future suicidality of a BD patient. Therefore, this study proposes a multi-task learning model for predicting the future suicidality of BD patients by jointly learning current symptoms. We build a novel BD dataset clinically validated by psychiatrists, including 14 years of posts on bipolar-related subreddits written by 818 BD patients, along with the annotations of future suicidality and BD symptoms. We also suggest a temporal symptom-aware attention mechanism to determine which symptoms are the most influential for predicting future suicidality over time through a sequence of BD posts. Our experiments demonstrate that the proposed model outperforms the state-of-the-art models in both BD symptom identification and future suicidality prediction tasks. In addition, the proposed temporal symptom-aware attention provides interpretable attention weights, helping clinicians to apprehend BD patients more comprehensively and to provide timely intervention by tracking mental state progression.

AdaTT: Adaptive Task-to-Task Fusion Network for Multitask Learning in Recommendations

Multi-task learning (MTL) aims to enhance the performance and efficiency of machine learning models by simultaneously training them on multiple tasks. However, MTL research faces two challenges: 1) effectively modeling the relationships between tasks to enable knowledge sharing, and 2) jointly learning task-specific and shared knowledge. In this paper, we present a novel model called Adaptive Task-to-Task Fusion Network (AdaTT) to address both challenges. AdaTT is a deep fusion network built with task-specific and optional shared fusion units at multiple levels. By leveraging a residual mechanism and a gating mechanism for task-to-task fusion, these units adaptively learn both shared knowledge and task-specific knowledge. To evaluate AdaTT's performance, we conduct experiments on a public benchmark and an industrial recommendation dataset using various task groups. Results demonstrate AdaTT significantly outperforms existing state-of-the-art baselines. Furthermore, our end-to-end experiments reveal that the model exhibits better performance compared to alternatives.

Learning Slow and Fast System Dynamics via Automatic Separation of Time Scales

Learning the underlying slow and fast dynamics of a system is instrumental for many practical applications related to the system. However, existing approaches are limited in discovering the appropriate time scale to separate the slow and fast variables and effectively learning their dynamics based on correct-dimensional representation vectors. In this paper, we introduce a framework that effectively learns slow and fast system dynamics in an integrated manner. We propose a novel intrinsic dimensionality (ID) driven learning method based on a time-lagged autoencoder framework to identify appropriate time scales to separate slow and fast variables and their IDs simultaneously. Further, we propose an integrated framework to concurrently learn the system's slow and fast dynamics, which is able to integrate prior knowledge of time scale and IDs and model the complex coupled slow and fast variables. Extensive experimental results on two representative dynamical systems show that our proposed framework is able to efficiently learn slow and fast system dynamics. Specifically, the long-time prediction performance is able to be improved by 36% on average compared with four representative baselines based on our proposed framework. Furthermore, our proposed system is able to extract interpretable slow and fast dynamics highly correlated with the known slow and fast variables in the dynamical systems. Our codes and datasets are open-sourced at: https://github.com/tsinghua-fib-lab/SlowFastSeparation.

DisasterNet: Causal Bayesian Networks with Normalizing Flows for Cascading Hazards Estimation from Satellite Imagery

Sudden-onset hazards like earthquakes often induce cascading secondary hazards (e.g., landslides, liquefaction, debris flows, etc.) and subsequent impacts (e.g., building and infrastructure damage) that cause catastrophic human and economic losses. Rapid and accurate estimates of these hazards and impacts are critical for timely and effective post-disaster responses. Emerging remote sensing techniques provide pre- and post-event satellite images for rapid hazard estimation. However, hazards and damage often co-occur or colocate with underlying complex cascading geophysical processes, making it challenging to directly differentiate multiple hazards and impacts from satellite imagery using existing single-hazard models. We introduce DisasterNet, a novel family of causal Bayesian networks to model processes that a major hazard triggers cascading hazards and impacts and further jointly induces signal changes in remotely sensed observations. We integrate normalizing flows to effectively model the highly complex causal dependencies in this cascading process. A triplet loss is further designed to leverage prior geophysical knowledge to enhance the identifiability of our highly expressive Bayesian networks. Moreover, a novel stochastic variational inference with normalizing flows is derived to jointly approximate posteriors of multiple unobserved hazards and impacts from noisy remote sensing observations. Integrating with the USGS Prompt Assessment of Global Earthquakes for Response (PAGER) system, our framework is evaluated in recent global earthquake events. Evaluation results show that DisasterNet significantly improves multiple hazard and impact estimation compared to existing USGS products.

Diga: Guided Diffusion Model for Graph Recovery in Anti-Money Laundering

With the upsurge of online banking, mobile payment, and virtual currency, new money-laundering crimes easily conceal in the enormous transaction volume. The traditional rule-based methods with large amounts of alerting thresholds are already incapable of handling the fast-changing transaction networks. Recently, the DL models represented by the graph neural networks (GNNs) show the potential to capture money-laundering modes with high accuracy. However, most related works are still far from practical deployment in the industry. Based on our practice at WeBank, there are three major challenges: Firstly, supervised learning is infeasible facing the extraordinarily large-scale but imbalanced data, with hundreds of millions of active accounts but only thousands of anomalies. Secondly, the real-world transactions form a sparse network with millions of isolated user groups, which overflows the expressive ability of current node-level GNNs. Thirdly, the explanation for each suspicious account is mandatory by the government for double check, which conflicts with the black-box nature of most DL models. Therefore, we proposed Diga, the first work to apply the diffusion probabilistic model to a graph anomaly detection problem with three novel techniques: the biased K-hop PageRank, the semi-supervised guided diffusion and the novel weight-sharing GNN layer. The effectiveness and efficiency of Diga are verified via intensive experiments on both industrial and public datasets.

HardSATGEN: Understanding the Difficulty of Hard SAT Formula Generation and A Strong Structure-Hardness-Aware Baseline

Industrial SAT formula generation is a critical yet challenging task. Existing SAT generation approaches can hardly simultaneously capture the global structural properties and maintain plausible computational hardness. We first present an in-depth analysis for the limitation of previous learning methods in reproducing the computational hardness of original instances, which may stem from the inherent homogeneity in their adopted split-merge procedure. On top of the observations that industrial formulae exhibit clear community structure and oversplit substructures lead to the difficulty in semantic formation of logical structures, we propose HardSATGEN, which introduces a fine-grained control mechanism to the neural split-merge paradigm for SAT formula generation to better recover the structural and computational properties of the industrial benchmarks. Experiments including evaluations on private and practical corporate testbed show the superiority of HardSATGEN being the only method to successfully augments formulae maintaining similar computational hardness and capturing the global structural properties simultaneously. Compared to the best previous methods, the average performance gains achieve 38.5% in structural statistics, 88.4% in computational metrics, and over 140.7% in the effectiveness of guiding solver tuning by our generated instances. Source code is available at https://github.com/Thinklab-SJTU/HardSATGEN.

Learning Joint Relational Co-evolution in Spatial-Temporal Knowledge Graph for SMEs Supply Chain Prediction

To effectively explore the supply chain relationships among Small and Medium-sized Enterprises (SMEs), some remarkable progress in such a relation modeling problem, especially knowledge graph-based methods have been witnessed during these years. As a typical link prediction task, supply chain prediction can usually predict the unknown future relationship facts between SMEs by utilizing the historical semantic connections between entities in knowledge graphs (KGs). However, it is still a great challenge for existing models as seldom of them can consider both temporal dependency and cooperative correlation of the connectivity pattern along the timeline synergistically. Accordingly, we propose a novel framework to learn joint relational co-evolution in Spatial-Temporal Knowledge Graphs (STKG). Specifically, on the base of the constructed large-scale financial STKG, a multi-view relational sequences mining method is proposed to reveal the semantic information from ontological concepts. Furthermore, a relational co-evolution learning module is also developed to capture the regularity of evolving connectivity patterns from the spatial-temporal view. Meanwhile, a multiple random subspace representation learning layer is also designed to improve both compatibility and complementarity during knowledge aggregation. Experimental results on large-scale SMEs supply chain prediction tasks from four real-world industries in China have illustrated the effectiveness of the proposed model.

S2phere: Semi-Supervised Pre-training for Web Search over Heterogeneous Learning to Rank Data

While Learning to Rank (LTR) models on top of transformers have been widely adopted to achieve decent performance, it is still challenging to train the model with sufficient data as only an extremely small number of query-webpage pairs could be annotated versus trillions of webpages available online and billions of web search queries everyday. In the meanwhile, industry research communities have released a number of open-source LTR datasets with well annotations but incorporating different designs of LTR features/labels (i.e., heterogeneous domains). In this work, inspired by the recent progress in pre-training transformers for performance advantages, we study the problem of pre-training LTR models using both labeled and unlabeled samples, especially we focus on the use of well-annotated samples in heterogeneous open-source LTR datasets to boost the performance of pre-training. Hereby, we propose S2phere-Semi-Supervised Pre-training with Heterogeneous LTR data strategies for LTR models using both unlabeled and labeled query-webpage pairs across heterogeneous LTR datasets. S2phere consists of a three-step approach: (1) Semi-supervised Feature Extraction Pre-training via Perturbed Contrastive Loss, (2) Cross-domain Ranker Pre-training over Heterogeneous LTR Datasets and (3) End-to-end LTR Fine-tuning via Modular Network Composition. Specifically, given an LTR model composed of a backbone (the feature extractor), a neck (the module to reason the orders) and a head (the predictor of ranking scores), S2phere uses unlabeled/labeled data from the search engine to pre-train the backbone in Step (1) via semi-supervised learning; then Step (2) incorporates multiple open-source heterogeneous LTR datasets to improve pre-training of the neck module as shared parameters of cross-domain learning; and finally, S2phere in Step (3) composes the backbone and neck with a randomly-initialized head into a whole LTR model and fine-tunes the model using search engine data with various learning strategies. Extensive experiments have been done with both offline experiments and online A/B Test on top of Baidu search engine. The comparisons against numbers of baseline algorithms confirmed the advantages of S2phere in producing high-performance LTR models for web-scale search.

CBLab: Supporting the Training of Large-scale Traffic Control Policies with Scalable Traffic Simulation

Traffic simulation provides interactive data for the optimization of traffic control policies. However, existing traffic simulators are limited by their lack of scalability and shortage in input data, which prevents them from generating interactive data from traffic simulation in the scenarios of real large-scale city road networks.

In this paper, we present City Brain Lab, a toolkit for scalable traffic simulation. CBLab consists of three components: CBEngine, CBData, and CBScenario. CBEngine is a highly efficient simulator supporting large-scale traffic simulation. CBData includes a traffic dataset with road network data of 100 cities all around the world. We also develop a pipeline to conduct a one-click transformation from raw road networks to input data of our traffic simulation. Combining CBEngine and CBData allows researchers to run scalable traffic simulations in the road network of real large-scale cities. Based on that, CBScenario implements an interactive environment and a benchmark for two scenarios of traffic control policies respectively, with which traffic control policies adaptable for large-scale urban traffic can be trained and tuned. To the best of our knowledge, CBLab is the first infrastructure supporting traffic control policy optimization in large-scale urban scenarios. CBLab has supported the City Brain Challenge @ KDD CUP 2021. The project is available on GitHub:~https://github.com/CityBrainLab/CityBrainLab.git.

MUSER: A MUlti-Step Evidence Retrieval Enhancement Framework for Fake News Detection

The ease of spreading false information online enables individuals with malicious intent to manipulate public opinion and destabilize social stability. Recently, fake news detection based on evidence retrieval has gained popularity in an effort to identify fake news reliably and reduce its impact. Evidence retrieval-based methods can improve the reliability of fake news detection by computing the textual consistency between the evidence and the claim in the news. In this paper, we propose a framework for fake news detection based on MUlti- Step Evidence Retrieval enhancement (MUSER), which simulates the steps of human beings in the process of reading news, summarizing, consulting materials, and inferring whether the news is true or fake. Our model can explicitly model dependencies among multiple pieces of evidence, and perform multi-step associations for the evidence required for news verification through multi-step retrieval. In addition, our model is able to automatically collect existing evidence through paragraph retrieval and key evidence selection, which can save the tedious process of manual evidence collection. We conducted extensive experiments on real-world datasets in different languages, and the results demonstrate that our proposed model outperforms state-of-the-art baseline methods for detecting fake news by at least 3% in F1-Macro and 4% in F1-Micro. Furthermore, it provides interpretable evidence for end users.

Analysis of COVID-19 Offensive Tweets and Their Targets

During the global COVID-19 pandemic, people utilized social media platforms, especially Twitter, to spread and express opinions about the pandemic. Such discussions also drove the rise in COVID-related offensive speech. In this work, focusing on Twitter, we present a comprehensive analysis of COVID-related offensive tweets and their targets. We collected a COVID-19 dataset with over 747 million tweets for 30 months and fine-tuned a BERT classifier to detect offensive tweets. Our offensive tweets analysis shows that the ebb and flow of COVID-related offensive tweets potentially reflect events in the physical world. We then studied the targets of these offensive tweets. There was a large number of offensive tweets with abusive words, which could negatively affect the targeted groups or individuals. We also conducted a user network analysis, and found that offensive users interact more with other offensive users and that the pandemic had a lasting impact on some offensive users. Our study offers novel insights into the persistence and evolution of COVID-related offensive tweets during the pandemic

Balancing Approach for Causal Inference at Scale

With the modern software and online platforms to collect massive amount of data, there is an increasing demand of applying causal inference methods at large scale when randomized experimentation is not viable. Weighting methods that directly incorporate covariate balancing have recently gained popularity for estimating causal effects in observational studies. These methods reduce the manual efforts required by researchers to iterate between propensity score modeling and balance checking until a satisfied covariate balance result. However, conventional solvers for determining weights lack the scalability to apply such methods on large scale datasets in companies like Snap Inc. To address the limitations and improve computational efficiency, in this paper we present scalable algorithms, DistEB and DistMS, for two balancing approaches: entropy balancing [14] and MicroSynth [33]. The solvers have linear time complexity and can be conveniently implemented in distributed computing frameworks such as Spark, Hive, etc. We study the properties of balancing approaches at different scales up to 1 million treated units and 487 covariates. We find that with larger sample size, both bias and variance in the causal effect estimation are significantly reduced. The results emphasize the importance of applying balancing approaches on large scale datasets. We combine the balancing approach with a synthetic control framework and deploy an end-to-end system for causal impact estimation at Snap Inc.

Tree based Progressive Regression Model for Watch-Time Prediction in Short-video Recommendation

An accurate prediction of watch time has been of vital importance to enhance user engagement in video recommender systems. To achieve this, there are four properties that a watch time prediction framework should satisfy: first, despite its continuous value, watch time is also an ordinal variable and the relative ordering between its values reflects the differences in user preferences. Therefore the ordinal relations should be reflected in watch time predictions. Second, the conditional dependence between the video-watching behaviors should be captured in the model. For instance, one has to watch half of the video before he/she finishes watching the whole video. Third, modeling watch time with a point estimation ignores the fact that models might give results with high uncertainty and this could cause bad cases in recommender systems. Therefore the framework should be aware of prediction uncertainty. Forth, the real-life recommender systems suffer from severe bias amplifications thus an estimation without bias amplification is expected.

How to design a framework that solves these four issues simultaneously remain unexplored. Therefore we propose TPM (Tree-based Progressive regression Model) for watch time prediction. Specifically, the ordinal ranks of watch time are introduced into TPM and the problem is decomposed into a series of conditional dependent classification tasks which are organized into a tree structure. The expectation of watch time can be generated by traversing the tree and the variance of watch time predictions is explicitly introduced into the objective function as a measurement for uncertainty. Moreover, we illustrate that backdoor adjustment can be seamlessly incorporated into TPM, which alleviates bias amplifications.

Extensive offline evaluations have been conducted in public datasets and TPM have been deployed in a real-world video app Kuaishou with over 300 million DAUs. The results indicate that TPM outperforms state-of-the-art approaches and indeed improves video consumption significantly.

Explicit Feature Interaction-aware Uplift Network for Online Marketing

As a key component in online marketing, uplift modeling aims to accurately capture the degree to which different treatments motivate different users, such as coupons or discounts, also known as the estimation of individual treatment effect (ITE). In an actual business scenario, the options for treatment may be numerous and complex, and there may be correlations between different treatments. In addition, each marketing instance may also have rich user and contextual features. However, existing methods still fall short in both fully exploiting treatment information and mining features that are sensitive to a particular treatment. In this paper, we propose an explicit feature interaction-aware uplift network (EFIN) to address these two problems. Our EFIN includes four customized modules: 1) a feature encoding module encodes not only the user and contextual features, but also the treatment features; 2) a self-interaction module aims to accurately model the user's natural response with all but the treatment features; 3) a treatment-aware interaction module accurately models the degree to which a particular treatment motivates a user through interactions between the treatment features and other features, i.e., ITE; and 4) an intervention constraint module is used to balance the ITE distribution of users between the control and treatment groups so that the model would still achieve a accurate uplift ranking on data collected from a non-random intervention marketing scenario. We conduct extensive experiments on two public datasets and one product dataset to verify the effectiveness of our EFIN. In addition, our EFIN has been deployed in a credit card bill payment scenario of a large online financial platform with a significant improvement.

Uncertainty-Aware Probabilistic Travel Time Prediction for On-Demand Ride-Hailing at DiDi

Travel Time Estimation (TTE) aims to accurately forecast the expected trip duration from an origin to a destination. As one of the world's largest ride-hailing platforms, DiDi answers billions of TTE queries per day. The quality of TTE directly decides the customer's experience and the effectiveness of passenger-to-driver matching. However, existing studies mainly regard TTE as a deterministic regression problem and focus on improving the prediction accuracy of a single label, which overlooks the travel time uncertainty induced by various dynamic contextual factors. To this end, in this paper, we propose a probabilistic framework, ProbTTE, for uncertainty-aware travel time prediction. Specifically, the framework first transforms the single-label regression task to a multi-class classification problem to estimate the implicit travel time distribution. Moreover, we propose an adaptive local label-smoothing scheme to capture the ordinal inter-class relationship among soft travel time labels. Furthermore, we construct a route-wise log-normal distribution regularizer to absorb prior knowledge from large-scale historical trip data. By explicitly considering the travel uncertainty, the proposed approach not only improves the TTE accuracy but also provides additional travel time information to benefit downstream tasks in ride-hailing. Extensive experiments on real-world datasets demonstrate the superiority of the proposed framework compared with state-of-the-art travel time prediction algorithms. In addition, ProbTTE has been deployed in production at DiDi in late 2022 to empower various order dispatching services, and improves passenger and driver experiences significantly.

Stationary Algorithmic Balancing For Dynamic Email Re-Ranking Problem

Email platforms need to generate personalized rankings of emails that satisfy user preferences, which may vary over time. We approach this as a recommendation problem based on three criteria: closeness (how relevant the sender and topic are to the user), timeliness (how recent the email is), and conciseness (how brief the email is). We propose MOSR (Multi-Objective Stationary Recommender), a novel online algorithm that uses an adaptive control model to dynamically balance these criteria and adapt to preference changes. We evaluate MOSR on the Enron Email Dataset, a large collection of real emails, and compare it with other baselines. The results show that MOSR achieves better performance, especially under non-stationary preferences, where users value different criteria more or less over time. We also test MOSR's robustness on a smaller down-sampled dataset that exhibits high variance in email characteristics, and show that it maintains stable rankings across different samples. Our work offers novel insights into how to design email re-ranking systems that account for multiple objectives impacting user satisfaction.

PrivateRec: Differentially Private Model Training and Online Serving for Federated News Recommendation

Federated recommendation can potentially alleviate the privacy concerns in collecting sensitive and personal data for training personalized recommendation systems. However, it suffers from a low recommendation quality when a local serving is inapplicable due to the local resource limitation and the data privacy of querying clients is required in online serving. Furthermore, a theoretically private solution in both the training and serving of federated recommendation is essential but still lacking. Naively applying differential privacy (DP) to the two stages in federated recommendation would fail to achieve a satisfactory trade-off between privacy and utility due to the high-dimensional characteristics of model gradients and hidden representations. In this work, we propose a federated news recommendation method for achieving better utility in model training and online serving under a DP guarantee. We first clarify the DP definition over behavior data for each round in the pipeline of federated recommendation systems. Next, we propose a privacy-preserving online serving mechanism under this definition based on the idea of decomposing user embeddings with public basic vectors and perturbing the lower-dimensional combination coefficients. We apply a random behavior padding mechanism to reduce the required noise intensity for better utility. Besides, we design a federated recommendation model training method, which can generate effective and public basic vectors for serving while providing DP for training participants. We avoid the dimension-dependent noise for large models via label permutation and differentially private attention modules. Experiments on real-world news recommendation datasets validate that our method achieves superior utility under a DP guarantee in both training and serving of federated news recommendations.

WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences

We present WebGLM, a web-enhanced question-answering system based on the General Language Model (GLM). Its goal is to augment a pre-trained large language model (LLM) with web search and retrieval capabilities while being efficient for real-world deployments. To achieve this, we develop WebGLM with strategies for the LLM-augmented retriever, bootstrapped generator, and human preference-aware scorer. Specifically, we identify and address the limitations of WebGPT (OpenAI), through which WebGLM is enabled with accuracy, efficiency, and cost-effectiveness advantages. In addition, we propose systematic criteria for evaluating web-enhanced QA systems. We conduct multi-dimensional human evaluation and quantitative ablation studies, which suggest the outperformance of the proposed WebGLM designs over existing systems. WebGLM with the 10-billion-parameter GLM (10B) is shown to perform better than the similar-sized WebGPT (13B) and even comparably to WebGPT (175B) in human evaluation. The code, demo, and data are at https://github.com/THUDM/WebGLM.

Practical Synthetic Human Trajectories Generation Based on Variational Point Processes

Human trajectories, reflecting people's travel patterns and the range of activities, are crucial for the applications like urban planning and epidemic control. However, the real-world human trajectory data tends to be limited by user privacy or device acquisition issues, leading to its insufficient quality to support the above applications. Hence, generating human trajectory data is a crucial but challenging task, which suffers from the following two critical challenges: 1) how to capture the user distribution in human trajectories (group view), and 2) how to model the complex mobility patterns of each user trajectory (individual view). In this paper, we propose a novel human trajectories generator (named VOLUNTEER), consisting of a user VAE and a trajectory VAE, to address the above challenges. Specifically, in the user VAE, we propose to learn the user distribution with all human trajectories from a group view. In the trajectory VAE, from the individual view, we model the complex mobility patterns by decoupling travel time and dwell time to accurately simulate individual trajectories. Extensive experiments on two real-world datasets show the superiority of our model over the state-of-the-art baselines. Further application analysis in the industrial system also demonstrates the effectiveness of our model.

Impact-Oriented Contextual Scholar Profiling using Self-Citation Graphs

Quantitatively profiling a scholar's scientific impact is important to modern research society. Current practices with bibliometric indicators (e.g., h-index), lists, and networks perform well at scholar ranking, but do not provide structured context for scholar-centric, analytical tasks such as profile reasoning and understanding. This work presents GeneticFlow (GF), a suite of novel graph-based scholar profiles that fulfill three essential requirements: structured-context, scholar-centric, and evolution-rich. We propose a framework to compute GF over large-scale academic data sources with millions of scholars. The framework encompasses a new unsupervised advisor-advisee detection algorithm, a well-engineered citation type classifier using interpretable features, and a fine-tuned graph neural network (GNN) model. Evaluations are conducted on the real-world task of scientific award inference. Experiment outcomes show that the F1 score of best GF profile significantly outperforms alternative methods of impact indicators and bibliometric networks in all the 6 computer science fields considered. Moreover, the core GF profiles, with 63.6%\sim66.5% nodes and 12.5%\sim29.9% edges of the full profile, still significantly outrun existing methods in 5 out of 6 fields studied. Visualization of GF profiling result also reveals human explainable patterns for high-impact scholars.

A Look into Causal Effects under Entangled Treatment in Graphs: Investigating the Impact of Contact on MRSA Infection

Methicillin-resistant Staphylococcus aureus (MRSA) is a type of bacteria resistant to certain antibiotics, making it difficult to prevent MRSA infections. Among decades of efforts to conquer infectious diseases caused by MRSA, many studies have been proposed to estimate the causal effects of close contact (treatment) on MRSA infection (outcome) from observational data. In this problem, the treatment assignment mechanism plays a key role as it determines the patterns of missing counterfactuals --- the fundamental challenge of causal effect estimation. Most existing observational studies for causal effect learning assume that the treatment is assigned individually for each unit. However, on many occasions, the treatments are pairwisely assigned for units that are connected in graphs, i.e., the treatments of different units are entangled. Neglecting the entangled treatments can impede the causal effect estimation. In this paper, we study the problem of causal effect estimation with treatment entangled in a graph. Despite a few explorations for entangled treatments, this problem still remains challenging due to the following challenges: (1) the entanglement brings difficulties in modeling and leveraging the unknown treatment assignment mechanism; (2) there may exist hidden confounders which lead to confounding biases in causal effect estimation; (3) the observational data is often time-varying. To tackle these challenges, we propose a novel method NEAT, which explicitly leverages the graph structure to model the treatment assignment mechanism, and mitigates confounding biases based on the treatment assignment modeling. We also extend our method into a dynamic setting to handle time-varying observational data. Experiments on both synthetic datasets and a real-world MRSA dataset validate the effectiveness of the proposed method, and provide insights for future applications.

SAInf: Stay Area Inference of Vehicles using Surveillance Camera Records

Stay area detection is one of the most important applications in trajectory data mining, which is helpful to understand human's behavior intentions. Traditional stay area detection methods are based on GPS data with relatively high sampling rate. However, because of privacy issues, accessing GPS data can be difficult in most real-world applications. Fortunately, traffic surveillance cameras have been widely deployed in urban area, and it provides us a novel way of acquiring vehicles' trajectories. All the vehicles that traverse by can be recognized and recorded in a passive way. However, the trajectory data collected in this way is extremely coarse, because the surveillance cameras are only deployed in important locations, such as crossroads. This coarse trajectory introduces two challenges for the stay area detection problem, i.e., whether and where the stay event occurs. In this paper, we design a two-stage method to solve the stay area detection problem with coarse trajectories. It first detects the stay event between a surveillance camera record pair, then uses a layer-by-layer stay area identification algorithm to infer the exact stay area. Extensive experiments based on real-world data were used to evaluate the performance of the proposed framework. Results demonstrate the proposed framework SAInf achieved a 58% performance improvement compared with SOTA methods.

Multi-Label Learning to Rank through Multi-Objective Optimization

Learning to Rank (LTR) technique is ubiquitous in Information Retrieval systems, especially in search ranking applications. The relevance labels used to train ranking models are often noisy measurements of human behavior, such as product ratings in product searches. This results in non-unique ground truth rankings and ambiguity. To address this, Multi-Label LTR (MLLTR) is used to train models using multiple relevance criteria, capturing conflicting but important goals, such as product quality and purchase likelihood for improved revenue in product searches. This research leverages Multi-Objective Optimization (MOO) in MLLTR and employs modern MOO algorithms to solve the problem. A general framework is proposed to combine label information to characterize trade-offs among goals, and allows for the use of gradient-based MOO algorithms. We test the proposed framework on four publicly available LTR datasets and one E-commerce dataset to show its efficacy.

Detecting Vulnerable Nodes in Urban Infrastructure Interdependent Network

Understanding and characterizing the vulnerability of urban infrastructures, which refers to the engineering facilities essential for the regular running of cities and that exist naturally in the form of networks, is of great value to us. Potential applications include protecting fragile facilities and designing robust topologies, etc. Due to the strong correlation between different topological characteristics and infrastructure vulnerability and their complicated evolution mechanisms, some heuristic and machine assisted analysis fall short in addressing such a scenario. In this paper, we model the interdependent network as a heterogeneous graph and propose a system based on graph neural network with reinforcement learning, which can be trained on real-world data, to characterize the vulnerability of the city system accurately. The presented system leverages deep learning techniques to understand and analyze the heterogeneous graph, which enables us to capture the risk of cascade failure and discover vulnerable infrastructures of cities. Extensive experiments with various requests demonstrate not only the expressive power of our system but also transferring ability and necessity of the specific components. All source codes and models including those that can reproduce all figures analyzed in this work are publicly available at this link: https://github.com/tsinghua-fib-lab/KDD2023-ID546-UrbanInfra.

DRL4Route: A Deep Reinforcement Learning Framework for Pick-up and Delivery Route Prediction

Pick-up and Delivery Route Prediction (PDRP), which aims to estimate the future service route of a worker given his current task pool, has received rising attention in recent years. Deep neural networks based on supervised learning have emerged as the dominant model for the task because of their powerful ability to capture workers' behavior patterns from massive historical data. Though promising, they fail to introduce the non-differentiable test criteria into the training process, leading to a mismatch in training and test criteria. Which considerably trims down their performance when applied in practical systems. To tackle the above issue, we present the first attempt to generalize Reinforcement Learning (RL) to the route prediction task, leading to a novel RL-based framework called DRL4Route. It combines the behavior-learning abilities of previous deep learning models with the non-differentiable objective optimization ability of reinforcement learning. DRL4Route can serve as a plug-and-play component to boost the existing deep learning models. Based on the framework, we further implement a model named DRL4Route-GAE for PDRP in logistic service. It follows the actor-critic architecture which is equipped with a Generalized Advantage Estimator that can balance the bias and variance of the policy gradient estimates, thus achieving a more optimal policy. Extensive offline experiments and the online deployment show that DRL4Route-GAE improves Location Square Deviation (LSD) by 0.9%-2.7%, and Accuracy@3 (ACC@3) by 2.4%-3.2% over existing methods on the real-world dataset.

HUGE: Huge Unsupervised Graph Embeddings with TPUs

Graphs are a representation of structured data that captures the relationships between sets of objects. With the ubiquity of available network data, there is increasing industrial and academic need to quickly analyze graphs with billions of nodes and trillions of edges. A common first step for network understanding is Graph Embedding, the process of creating a continuous representation of nodes in a graph. A continuous representation is often more amenable, especially at scale, for solving downstream machine learning tasks such as classification, link prediction, and clustering. A high-performance graph embedding architecture leveraging Tensor Processing Units (TPUs) with configurable amounts of high-bandwidth memory is presented that simplifies the graph embedding problem and can scale to graphs with billions of nodes and trillions of edges. We verify the embedding space quality on real and synthetic large-scale datasets.

Hierarchical Projection Enhanced Multi-behavior Recommendation

Various types of user behaviors are recorded in most real-world recommendation scenarios. To fully utilize the multi-behavior information, the exploration of multiplex interaction among them is essential. Many multi-task learning based multi-behavior methods are proposed recently to use multiple types of supervision signals and perform information transfer among them. Despite the great successes, these methods fail to design prediction tasks comprehensively, leading to insufficient utilization of multi-behavior correlative information. Besides, these methods are either based on the weighting of expert information extracted from the coupled input or modeling of information transfer between multiple behavior levels through task-specific extractors, which are usually accompanied by negative transfer phenomenon1. To address the above problems, we propose a multi-behavior recommendation framework, called Hierarchical Projection Enhanced Multi-behavior Recommendation (HPMR). The key module, Projection-based Transfer Network (PTN), uses the projection mechanism to "explicitly" model the correlations of upstream and downstream behaviors, refines the upstream behavior representations, and fully uses the refined representations to enhance the learning of downstream tasks. Offline experiments on public and industrial datasets and online A/B test further verify the effectiveness of HPMR in modeling the associations from upstream to downstream and alleviating the negative transfer. The source code and datasets are available at https://github.com/MC-CV/HPMR.

Scenario-Adaptive Feature Interaction for Click-Through Rate Prediction

Traditional Click-Through Rate (CTR) prediction models are usually trained and deployed in a single scenario. However, large-scale commercial platforms usually contain multiple recommendation scenarios, the traffic characteristics of which may be significantly different. Recent studies have proved that learning a unified model to serve multiple scenarios is effective in improving the overall performance. However, most existing approaches suffer from various limitations respectively, such as insufficient distinction modeling, inefficiency with the increase of scenarios, and lack of interpretability. More importantly, as far as we know, none of existing Multi-Scenario Modeling approaches takes explicit feature interaction into consideration when modeling scenario distinctions, which limits the expressive power of the network and thus impairs the performance. In this paper, we propose a novel Scenario-Adaptive Feature Interaction framework named SATrans, which models scenario discrepancy as the distinction of patterns in feature correlations. Specifically, SATrans is built on a Transformer architecture to learn high-order feature interaction and involves the scenario information in the modeling of self-attention to capture distribution shifts across scenarios. We provide various implementations of our framework to boost the performance, and experiments on both public and industrial datasets show that SATrans 1) significantly outperforms existing state-of-the-art approaches for prediction, 2) is parameter-efficient as the space complexity grows marginally with the increase of scenarios, 3) offers good interpretability in both instance-level and scenario-level. We have deployed the model in WeChat Official Account Platform and have seen more than 2.84% online CTR increase on average in three major scenarios.

Deep Offline Reinforcement Learning for Real-world Treatment Optimization Applications

There is increasing interest in data-driven approaches for recommending optimal treatment strategies in many chronic disease management and critical care applications. Reinforcement learning methods are well-suited to this sequential decision-making problem, but must be trained and evaluated exclusively on retrospective medical record datasets as direct online exploration is unsafe and infeasible. Despite this requirement, the vast majority of treatment optimization studies use off-policy RL methods (e.g., Double Deep Q Networks (DDQN) or its variants) that are known to perform poorly in purely offline settings. Recent advances in offline RL, such as Conservative Q-Learning (CQL), offer a suitable alternative. But there remain challenges in adapting these approaches to real-world applications where suboptimal examples dominate the retrospective dataset and strict safety constraints need to be satisfied. In this work, we introduce a practical and theoretically grounded transition sampling approach to address action imbalance during offline RL training. We perform extensive experiments on two real-world tasks for diabetes and sepsis treatment optimization to compare performance of the proposed approach against prominent off-policy and offline RL baselines (DDQN and CQL). Across a range of principled and clinically relevant metrics, we show that our proposed approach enables substantial improvements in expected health outcomes and in consistency with relevant practice and safety guidelines.

Deep Landscape Forecasting in Multi-Slot Real-Time Bidding

Real-Time Bidding (RTB) has shown remarkable success in display advertising and has been employed in other advertising scenarios, e.g., sponsored search advertising with multiple ad slots. Many current RTB techniques built for single-slot display advertising are thus no longer applicable, especially in the bid landscape forecasting. Landscape forecasting predicts market competition, including the highest bid price and winning probability, which is preliminary and crucial for the subsequent bidding strategy design. In the multi-slot advertising, predicting the winning prices for each position requires a more precise differentiation of bids among top advertisers. Furthermore, defining the winning probability and addressing censorship issues are not as straightforward as in the case of a single slot. In view of these challenges, how to forecast the bidding landscape in the multi-slot environment remains open.

In this work, we are the first to study the landscape forecasting problem in multi-slot RTB, considering the correlation between ad slots in the same pageview. Specifically, we formulate the research topic into two subproblems: predicting the distribution of the winning price and predicting the winning probability of the bid price for each position. Based on the observation from the production data and survival analysis techniques, we propose a deep recurrent model to predict the distribution of the winning price as well as the winning probability for each position. A comprehensive loss function is proposed to learn from the censoring data. Experiments on two public semi-synthetic datasets and one private industrial dataset demonstrate the effectiveness of our method.

Rewiring Police Officer Training Networks to Reduce Forecasted Use of Force

Research has shown that police officer involved shootings, misconduct and excessive use of force complaints exhibit network effects, where officers are at greater risk of being involved in these incidents when they socialize with officers who have a history of use of force and misconduct. In this work, we first construct a network survival model for the time-to-event of use of force incidents involving new police trainees. The model includes network effects of the diffusion of risk from field training officer (FTO) to trainee. We then introduce a network rewiring algorithm to maximize the expected time to use of force events upon completion of field training. We study several versions of the algorithm, including constraints that encourage demographic diversity of FTOs. Using data from Indianapolis, we show that rewiring the network can increase the expected time (in days) of a recruit's first use of force incident by 8%. We then discuss the potential benefits and challenges associated with implementing such an algorithm in practice.

Extreme Multi-Label Classification for Ad Targeting using Factorization Machines

Applications involving Extreme Multi-Label Classification (XMLC) face several practical challenges with respect to scale, model size and prediction latency, while maintaining satisfactory predictive accuracy. In this paper, we propose a Multi-Label Factorization Machine (MLFM) model, which addresses some of the challenges in XMLC problems. We use behavioral ad targeting as a case study to illustrate the benefits of the MLFM model. Predicting user qualifications for targeting segments plays a major role in both personalization and real-time bidding. Considering the large number of segments and the prediction time requirements of real-world production systems, building scalable models is often difficult and computationally burdensome. To cope with these challenges, we (1) reformulate the problem of assigning users to segments as a multi-label classification (XMLC) problem, and (2) leverage the benefits of the conventional FM model and generalize its capacity to joint prediction across a large number of targeting segments. We have shown that the MLFM model is both effective and computationally efficient compared to several baseline models on publicly available datasets in addition to the targeting use case.

Entity-aware Multi-task Learning for Query Understanding at Walmart

Query Understanding (QU) is a fundamental process in E-commerce search engines by extracting the shopping intents of customers. It usually includes a set of different tasks such as named entity recognization and query classification. Traditional approaches often tackle each task separately by its own network, which leads to excessive workload for development and maintenance as well as increased latency and resource usage in large-scale E-commerce platforms. To tackle these challenges, this paper presents a multi-task learning approach to query understanding at Walmart. We experimented with several state-of-the-art multi-task learning architectures including MTDNN, MMoE, and PLE. Furthermore, we propose a novel large-scale entity-aware multi-task learning model (EAMT)1 by retrieving entities from engagement data as query context to augment the query representation. To the best of our knowledge, there exists no prior work on multi-task learning for E-commerce query understanding. Comprehensive offline experiments are conducted on industry-scale datasets (up to 965M queries) to illustrate the effectiveness of our approach. The results from online experiments show substantial gains in key accuracy and latency metrics. https://github.com/zhiyuanpeng/KDD2023-EAMT

Revisiting Personalized Federated Learning: Robustness Against Backdoor Attacks

In this work, besides improving prediction accuracy, we study whether personalization could bring robustness benefits to backdoor attacks. We conduct the first study of backdoor attacks in the pFL framework, testing 4 widely used backdoor attacks against 6 pFL methods on benchmark datasets FEMNIST and CIFAR-10, a total of 600 experiments. The study shows that pFL methods with partial model-sharing can significantly boost robustness against backdoor attacks. In contrast, pFL methods with full model-sharing do not show robustness. To analyze the reasons for varying robustness performances, we provide comprehensive ablation studies on different pFL methods. Based on our findings, we further propose a lightweight defense method, Simple-Tuning, which empirically improves defense performance against backdoor attacks. We believe that our work could provide both guidance for pFL application in terms of its robustness and offer valuable insights to design more robust FL methods in the future. We open-source our code to establish the first benchmark for black-box backdoor attacks in pFL: https://github.com/alibaba/FederatedScope/tree/backdoor-bench.

NFT-Based Data Marketplace with Digital Watermarking

In today's digital world, enterprises and individuals are generating massive data that is potentially useful for many data consumers with data driven applications. The emergence of data marketplaces is a step toward helping the data owners to monetize their digital assets and get connected to the potential buyers. The current data marketplaces cannot handle the challenges related to data ownership claims, illegal redistribution, and data ownership traceability. To overcome these problems in a general-purpose market, we propose a marketplace based on watermarking and Non-Fungible Token (NFT) technologies. In the proposed NFT-based marketplace, the owner's data is stored as an NFT where the underlying content of the NFT holds the watermarked data. The watermarked data is obtained by embedding some information about the owners and the buyers into the original data. The embedded information can later be extracted to identify the owner and the buyer of the traded data. Furthermore, the transactions corresponding to the NFT provide verifiable ownership proof and traceable ownership history. A Proof-Of-Concept (POC) implementation of the proposed marketplace that will be integrated within AI-Gallery Data Marketplace service in Huawei Cloud is presented for trading image data. An extensive set of experiments to measure the gas consumption on the blockchain and evaluate the robustness of the watermarked assets against 51 attacks are performed. Finally, a method based on error correction codes is proposed for improving the watermarking robustness in the implemented marketplace. The link for the codes and the POC demo is provided in the appendix.

un-xPass: Measuring Soccer Player's Creativity

Creativity is highly valued in soccer players. It contributes to exciting and unpredictable play, which can help teams to overcome defensive strategies and create scoring opportunities. Consequently, evaluating the creative abilities of players is an important aspect of the player recruitment process. However, there is currently no clear way to measure creativity in soccer. It is not captured by the typical result-based performance indicators, as being creative entails going beyond just doing something useful, to accomplishing something useful but in a unique or atypical way. Therefore in this paper, we define a novel metric to quantify the level of creativity involved in a player's passes. Our Creative Decision Rating (CDR) utilizes machine learning techniques to assess two important factors: the originality of a pass, and its value in terms of increasing the team's chances of scoring a goal. We validated our metric on StatsBomb 360 contextual event stream data of the 2021/22 English Premier League season and show through a number of use cases that it provides another angle on a player's skill, complementing existing player evaluation metrics. Overall, our metric provides a concise method for capturing and quantifying the creativity of soccer players and could have important implications for player recruitment and talent development in the sport.

End-to-End Query Term Weighting

Bag-of-words based lexical retrieval systems are still the most commonly used methods for real-world search applications. Recently deep learning methods have shown promising results to improve this retrieval performance but are expensive to run in an online fashion, non-trivial to integrate into existing production systems, and might not generalize well in out-of-domain retrieval scenarios. Instead, we build on top of lexical retrievers by proposing a Term Weighting BERT (TW-BERT) model. TW-BERT learns to predict the weight for individual n-gram (e.g., uni-grams and bi-grams) query input terms. These inferred weights and terms can be used directly by a retrieval system to perform a query search. To optimize these term weights, TW-BERT incorporates the scoring function used by the search engine, such as BM25, to score query-document pairs. Given sample query-document pairs we can compute a ranking loss over these matching scores, optimizing the learned query term weights in an end-to-end fashion. Aligning TW-BERT with search engine scorers minimizes the changes needed to integrate it into existing production applications, whereas existing deep learning based search methods would require further infrastructure optimization and hardware requirements. The learned weights can be easily utilized by standard lexical retrievers and by other retrieval techniques such as query expansion. We show that TW-BERT improves retrieval performance over strong term weighting baselines within MSMARCO and in out-of-domain retrieval on TREC datasets.

UnifieR: A Unified Retriever for Large-Scale Retrieval

Large-scale retrieval is to recall relevant documents from a huge collection given a query. It relies on representation learning to embed documents and queries into a common semantic encoding space. According to the encoding space, recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms. These two paradigms unveil the PLMs' representation capability in different granularities, i.e., global sequence-level compression and local word-level contexts, respectively. Inspired by their complementary global-local contextualization and distinct representing views, we propose a new learning framework, Unifier, which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability. Experiments on passage retrieval benchmarks verify its effectiveness in both paradigms. A uni-retrieval scheme is further presented with even better retrieval quality. We lastly evaluate the model on BEIR benchmark to verify its transferability.

Rover: An Online Spark SQL Tuning Service via Generalized Transfer Learning

Distributed data analytic engines like Spark are common choices to process massive data in industry. However, the performance of Spark SQL highly depends on the choice of configurations, where the optimal ones vary with the executed workloads. Among various alternatives for Spark SQL tuning, Bayesian optimization (BO) is a popular framework that finds near-optimal configurations given sufficient budget, but it suffers from the re-optimization issue and is not practical in real production. When applying transfer learning to accelerate the tuning process, we notice two domain-specific challenges: 1) most previous work focus on transferring tuning history, while expert knowledge from Spark engineers is of great potential to improve the tuning performance but is not well studied so far; 2) history tasks should be carefully utilized, where using dissimilar ones lead to a deteriorated performance in production.

In this paper, we present Rover, a deployed online Spark SQL tuning service for efficient and safe search on industrial workloads. To address the challenges, we propose generalized transfer learning to boost the tuning performance based on external knowledge, including expert-assisted Bayesian optimization and controlled history transfer. Experiments on public benchmarks and real-world tasks show the superiority of Rover over competitive baselines. Notably, Rover saves an average of 50.1% of the memory cost on 12k real-world Spark SQL tasks in 20 iterations, among which 76.2% of the tasks achieve a significant memory reduction of over 60%.

Joint Optimization of Ranking and Calibration with Contextualized Hybrid Model

Despite the development of ranking optimization techniques, pointwise loss remains the dominating approach for click-through rate prediction. It can be attributed to the calibration ability of the pointwise loss since the prediction can be viewed as the click probability. In practice, a CTR prediction model is also commonly assessed with the ranking ability. To optimize the ranking ability, ranking loss (e.g., pairwise or listwise loss) can be adopted as they usually achieve better rankings than pointwise loss. Previous studies have experimented with a direct combination of the two losses to obtain the benefit from both losses and observed an improved performance. However, previous studies break the meaning of output logit as the click-through rate, which may lead to sub-optimal solutions. To address this issue, we propose an approach that can Jointly optimize the Ranking and Calibration abilities (JRC for short). JRC improves the ranking ability by contrasting the logit value for the sample with different labels and constrains the predicted probability to be a function of the logit subtraction. We further show that JRC consolidates the interpretation of logits, where the logits model the joint distribution. With such an interpretation, we prove that JRC approximately optimizes the contextualized hybrid discriminative-generative objective. Experiments on public and industrial datasets and online A/B testing show that our approach improves both ranking and calibration abilities. Since May 2022, JRC has been deployed on the display advertising platform of Alibaba and has obtained significant performance improvements.

PIER: Permutation-Level Interest-Based End-to-End Re-ranking Framework in E-commerce

Re-ranking draws increased attention on both academics and industries, which rearranges the ranking list by modeling the mutual influence among items to better meet users' demands. Many existing re-ranking methods directly take the initial ranking list as input, and generate the optimal permutation through a well-designed context-wise model, which brings the evaluation-before-reranking problem. Meanwhile, evaluating all candidate permutations brings unacceptable computational costs in practice. Thus, to better balance efficiency and effectiveness, online systems usually use a two-stage architecture which uses some heuristic methods such as beam-search to generate a suitable amount of candidate permutations firstly, which are then fed into the evaluation model to get the optimal permutation. However, existing methods in both stages can be improved through the following aspects. As for generation stage, heuristic methods only use point-wise prediction scores and lack an effective judgment. As for evaluation stage, most existing context-wise evaluation models only consider the item context and lack more fine-grained feature context modeling.

This paper presents a novel end-to-end re-ranking framework named PIER to tackle the above challenges which still follows the two-stage architecture and contains two mainly modules named FPSM and OCPM. Inspired by long-time user behavior modeling methods, we apply SimHash in FPSM to select top-K candidates from the full permutation based on user's permutation-level interest in an efficient way. Then we design a novel omnidirectional attention mechanism in OCPM to better capture the context information in the permutation. Finally, we jointly train these two modules in an end-to-end way by introducing a comparative learning loss, which use the predict value of OCPM to guide the FPSM to generate better permutations. Offline experiment results demonstrate that PIER outperforms baseline models on both public and industrial datasets, and we have successfully deployed PIER on Meituan food delivery platform.

QTNet: Theory-based Queue Length Prediction for Urban Traffic

Smart traffic management is the cornerstone of Intelligent Transport Systems (ITS). To achieve smooth travel in urban road networks, ITS provide software-based traffic management based on traffic forecasts. Recently, spatial-temporal graph neural networks (STGNNs) have achieved significant improvements in traffic forecasting by taking into account spatial and temporal dependencies in traffic data. However, in spite of being an indispensable statistic in traffic management in urban areas, the length of congestion queues has not been a prediction target. In addition, existing methods have not considered the use of multimodal traffic data for forecasting. Moreover, given the significant impact of ITS on the real world, black-box predictions with less explainability are unreliable. In this paper, we propose aQueueing-theory-based Neural Network (QTNet), which combines data-driven STGNN methods with queueing-theory-based domain knowledge of traffic engineering in order to achieve accurate and explainable predictions. In our queue length prediction experiments using a real-world dataset collected in urban areas of Tokyo, QTNet outperformed the baseline methods including the state-of-the-art STGNNs by 12.6% in RMSE and 9.9% MAE, and particularly for severe congestion, by 8.1% and 8.4%.

Deep Transfer Learning for City-scale Cellular Traffic Generation through Urban Knowledge Graph

The problem of cellular traffic generation in cities without historical traffic data is critical and urgently needs to be solved to assist 5G base station deployments in mobile networks. In this paper, we propose ADAPTIVE, a deep transfer learning framework for city-scale cellular traffic generation through the urban knowledge graph. ADAPTIVE leverages historical data from other cities that have deployed 5G networks to assist cities that are newly deploying 5G networks through deep transfer learning. Specifically, ADAPTIVE can align the representations of base stations in the target city and source city while considering the environmental factors of cities, spatial and environmental contextual relations between base stations, and traffic temporal patterns at base stations. We next design a feature-enhanced generative adversarial network, which is trained based on the historical traffic data and representations of base stations in the source city. By feeding the aligned target city's base station representations into the trained model, we can then obtain the generated traffic data for the target city. Extensive experiments on real-world cellular traffic datasets show that ADAPTIVE generally outperforms state-of-the-art baselines by more than 40% in terms of Jensen-Shannon divergence and root-mean-square error. Also, ADAPTIVE has strong robustness based on the results of various cross-city experiments. ADAPTIVE has been successfully deployed on the 'Jiutian' Artificial Intelligence Platform of China Mobile to support cellular traffic generation and assist in the construction and operation of mobile networks.

Hierarchical Reinforcement Learning for Dynamic Autonomous Vehicle Navigation at Intelligent Intersections

Recent years have witnessed the rapid development of the Cooperative Vehicle Infrastructure System (CVIS), where road infrastructures such as traffic lights (TL) and autonomous vehicles (AVs) can share information among each other and work collaboratively to provide safer and more comfortable transportation experience to human beings. While many efforts have been made to develop efficient and sustainable CVIS solutions, existing approaches on urban intersections heavily rely on domain knowledge and physical assumptions, preventing them from being practically applied. To this end, this paper proposes NavTL, a learning-based framework to jointly control traffic signal plans and autonomous vehicle rerouting in mixed traffic scenarios where human-driven vehicles and AVs co-exist. The objective is to improve travel efficiency and reduce total travel time by minimizing congestion at the intersections while guiding AVs to avoid the temporally congested roads. Specifically, we design a graph-enhanced multi-agent decentralized bi-directional hierarchical reinforcement learning framework by regarding TLs as manager agents and AVs as worker agents. At lower temporal resolution timesteps, each manager sets a goal for the workers within its controlled region. Simultaneously, managers learn to take the signal actions based on the observation from the environment as well as an intention information extracted from its workers. At higher temporal resolution timesteps, each worker makes rerouting decisions along its way to the destination based on its observation from the environment, an intention-enhanced manager state representation, and a goal from its present manager. Finally, extensive experiments on one synthetic and two real-world network-level datasets demonstrate the effectiveness of our proposed framework in terms of improving travel efficiency.

TrustGeo: Uncertainty-Aware Dynamic Graph Learning for Trustworthy IP Geolocation

The rising popularity of online social network services has attracted a lot of research focusing on mining various user patterns. Among them, accurate IP geolocation is essential for a plethora of location-aware applications. However, despite extensive research efforts and significant advances, the "accurate and reliable'' desideratum is yet to be achieved at a higher quality level. This work presents a graph neural network (GNN)-based model, called TrustGeo, for trustworthy street-level IP geolocation. A distinct and important aspect of TrustGeo is the incorporation of sources of uncertainty in the learning process. The results of our extensive experimental evaluations on three real-world datasets demonstrate the superiority of our framework in significantly improving the accuracy and trustworthiness of street-level IP geolocation. Our code and datasets are available at https://github.com/ICDM-UESTC/TrustGeo.

Optimizing Airbnb Search Journey with Multi-task Learning

At Airbnb, an online marketplace for stays and experiences, guests often spend weeks exploring and comparing multiple items before making a final reservation request. Each reservation request may then potentially be rejected or cancelled by the host prior to check-in. The long and exploratory nature of the search journey, as well as the need to balance both guest and host preferences, present unique challenges for Airbnb search ranking. In this paper, we present Journey Ranker, a new multi-task deep learning model architecture that addresses these challenges. Journey Ranker leverages intermediate guest actions as milestones, both positive and negative, to better progress the guest towards a successful booking. It also uses contextual information such as guest state and search query to balance guest and host preferences. Its modular and extensible design, consisting of four modules with clear separation of concerns, allows for easy application to use cases beyond the Airbnb search ranking context. We conducted offline and online testing of the Journey Ranker and successfully deployed it in production to four different Airbnb products with significant business metrics improvements.

Improving Training Stability for Multitask Ranking Models in Recommender Systems

Recommender systems play an important role in many content platforms. While most recommendation research is dedicated to designing better models to improve user experience, we found that research on stabilizing the training for such models is severely under-explored. As recommendation models become larger and more sophisticated, they are more susceptible to training instability issues, i.e., loss divergence, which can make the model unusable, waste significant resources and block model developments. In this paper, we share our findings and best practices we learned for improving the training stability of a real-world multitask ranking model for YouTube recommendations. We show some properties of the model that lead to unstable training and conjecture on the causes. Furthermore, based on our observations of training dynamics near the point of training instability, we hypothesize why existing solutions would fail, and propose a new algorithm to mitigate the limitations of existing solutions. Our experiments on YouTube production dataset show the proposed algorithm can significantly improve training stability while not compromising convergence, comparing with several commonly used baseline methods.

Counterfactual Video Recommendation for Duration Debiasing

Duration bias widely exists in video recommendations, where models tend to recommend short videos for the higher ratio of finish playing and thus possibly fail to capture users' true interests. In this paper, we eliminate the duration bias from both data and model. First, based on the extensive data analysis, we observe that play completion rate of videos with the same duration presents a bimodal distribution. Hence, we propose to perform threshold division to construct binary labels as training labels for alleviating the drawback of finish playing labels overly biased towards short videos. Algorithmically, we resort to causal inference, which enables us to inspect causal relationships of video recommendations with a causal graph. We identify that duration has two kinds of effect on prediction: direct and indirect. Duration bias lies in the direct effect, while the indirect effect benefits prediction. To this end, we design a model-agnostic Counterfactual Video Recommendation for Duration Debiasing (CVRDD) framework, which incorporates multi-task learning to estimate different causal effect during training. In the inference phase, we perform counterfactual inference to remove the direct effect of duration for unbiased prediction. We conduct experiments on two industrial datasets, and in addition to achieving highly promising results on traditional top-k recommendation metrics, CVRDD also improves the user watch time.

Semantic-Enhanced Differentiable Search Index Inspired by Learning Strategies

Recently, a new paradigm called Differentiable Search Index (DSI) has been proposed for document retrieval, wherein a sequence-to-sequence model is learned to directly map queries to relevant document identifiers. The key idea behind DSI is to fully parameterize traditional ''index-retrieve'' pipelines within a single neural model, by encoding all documents in the corpus into the model parameters. In essence, DSI needs to resolve two major questions: (1) how to assign an identifier to each document, and (2) how to learn the associations between a document and its identifier. In this work, we propose a Semantic-Enhanced DSI model (SE-DSI) motivated by Learning Strategies in the area of Cognitive Psychology. Our approach advances original DSI in two ways: (1) For the document identifier, we take inspiration from Elaboration Strategies in human learning. Specifically, we assign each document an Elaborative Description based on the query generation technique, which is more meaningful than a string of integers in the original DSI; and (2) For the associations between a document and its identifier, we take inspiration from Rehearsal Strategies in human learning. Specifically, we select fine-grained semantic features from a document as Rehearsal Contents to improve document memorization. Both the offline and online experiments show improved retrieval performance over prevailing baselines.

Online Quality Prediction in Windshield Manufacturing using Data-Efficient Machine Learning

The digitization of manufacturing processes opens up the possibility of using machine learning methods on process data to predict future product quality. Based on the model predictions, quality improvement actions can be taken at an early stage. However, significant challenges must be overcome to successfully implement the predictions. Production lines are subject to hardware and memory limitations and are characterized by constant changes in quality influencing factors. In this paper, we address these challenges and present an online prediction approach for real-world manufacturing processes. On the one hand, it includes methods for feature extraction and selection from multimodal process and sensor data. On the other hand, a continual learning method based on memory-aware synapses is developed to efficiently train an artificial neural network over process changes. We deploy and evaluate the approach in a windshield production process. Our experimental evaluation shows that the model can accurately predict windshield quality and achieve significant process improvement. By comparing with other learning strategies such as transfer learning, we also show that the continual learning method both prevents catastrophic forgetting of the model and maintains its data efficiency.

PASS: Personalized Advertiser-aware Sponsored Search

The nucleus of online sponsored search systems lies in measuring the relevance between the search intents of users and the advertising purposes of advertisers. Existing conventional doublet-based (query-keyword) relevance models solely rely on short queries and keywords to uncover such intents, which ignore the diverse and personalized preferences of participants (i.e., users and advertisers), resulting in undesirable advertising performance. In this paper, we investigate the novel problem of Personalized A dvertiser-aware Sponsored Search (PASS). Our motivation lies in incorporating the portraits of users and advertisers into relevance models to facilitate the modeling of intrinsic search intents and advertising purposes, leading to a quadruple-based (i.e., user-query-keyword-advertiser) task. Various types of historical behaviors are explored in the format of hypergraphs to provide abundant signals on identifying the preferences of participants. A novel heterogeneous textual hypergraph transformer is further proposed to deeply fuse the textual semantics and the high-order hypergraph topology. Our proposal is extensively evaluated over real industry datasets, and experimental results demonstrate its superiority.

Towards Fairness in Personalized Ads Using Impression Variance Aware Reinforcement Learning

Variances in ad impression outcomes across demographic groups are increasingly considered to be potentially indicative of algorithmic bias in personalized ads systems. While there are many definitions of fairness that could be applicable in the context of personalized systems, we present a framework which we call the Variance Reduction System (VRS) for achieving more equitable outcomes in Meta's ads systems. VRS seeks to achieve a distribution of impressions with respect to selected protected class (PC) attributes that more closely aligns the demographics of an ad's eligible audience (a function of advertiser targeting criteria) with the audience who sees that ad, in a privacy-preserving manner. We first define metrics to quantify fairness gaps in terms of ad impression variances with respect to PC attributes including gender and estimated race. We then present the VRS for re-ranking ads in an impression variance-aware manner. We evaluate VRS via extensive simulations over different parameter choices and study the effect of the VRS on the chosen fairness metric. We finally present online A/B testing results from applying VRS to Meta's ads systems, concluding with a discussion of future work. We have deployed the VRS to all users in the US for housing ads, resulting in significant improvement in our fairness metric. VRS is the first large-scale deployed framework for pursuing fairness for multiple PC attributes in online advertising.

Automatic Music Playlist Generation via Simulation-based Reinforcement Learning

Personalization of playlists is a common feature in music streaming services, but conventional techniques, such as collaborative filtering, rely on explicit assumptions regarding content quality to learn how to make recommendations. Such assumptions often result in misalignment between offline model objectives and online user satisfaction metrics. In this paper, we present a reinforcement learning framework that solves for such limitations by directly optimizing for user satisfaction metrics via the use of a simulated playlist-generation environment. Using this simulator we develop and train a modified Deep Q-Network, the action head DQN (AH-DQN), in a manner that addresses the challenges imposed by the large state and action space of our RL formulation. The resulting policy is capable of making recommendations from large and dynamic sets of candidate items with the expectation of maximizing consumption metrics. We analyze and evaluate agents offline via simulations that use environment models trained on both public and proprietary streaming datasets. We show how these agents lead to better user-satisfaction metrics compared to baseline methods during online A/B tests. Finally, we demonstrate that performance assessments produced from our simulator are strongly correlated with observed online metric results.

Workplace Recommendation with Temporal Network Objectives

Workplace communication software such as Microsoft Teams, Slack, and Google Workspace have become integral to workplace collaboration, especially due to the rise of remote work. By making it easier to access relevant or useful information, recommender systems for these platforms have the potential to improve efficient cross-team information flow through a company's communication network. While there has been some recent work on recommendation approaches that optimize network objectives, these have focused on static graphs. In this work, we focus on optimizing information flow, which is highly temporal and presents a number of novel algorithmic challenges. To overcome these, we develop tractable measures of temporal information flow and design efficient online recommendation algorithms that jointly optimize for relevance and cross-team information flow. We demonstrate the potential for impact of these approaches on a rich multi-modal dataset capturing one month of communication between 180k Microsoft employees through email, chats and posts on Microsoft Teams, and file sharing on SharePoint. We design an offline model-based evaluation pipeline to estimate the effects of recommendations on the temporal communication network. We show that our recommendation algorithms can significantly improve cross-team information flow with only a small decrease in traditional relevance metrics.

Stabilising Job Survival Analysis for Disability Employment Services in Unseen Environments

In Disability Employment Services (DES), an emerging problem is to make job survival analysis stable in unseen environments without prior knowledge of these environments. Existing survival analysis methods cannot adequately solve this problem since they assume that distribution of unseen data is similar to that observed during training. However, this assumption can be violated in practice where unanticipated events such as COVID19 and inflation can change the work and life patterns of people with disability. Models trained before the COVID19 pandemic may make unreliable job survival predictions in COVID19 or inflation situations. It is also costly and time consuming to frequently re-train and deploy the models. This paper proposes a stable survival analysis method for the DES sector without requiring prior knowledge of deployment environments. Latent representations are learned to capture non-linear relationships between relevant features and job survival time. Two reweighting stages are developed to remove censoring and conditional spurious correlations between irrelevant features and the survival outcome. The case study of Australian workers with disability shows that our method can make stable risk predictions. It can also help workers with disability determine the most effective skills for improvement to increase their job survival time. Further evaluations with public datasets show the promising stable performance of our method in other applications.

Fair Multilingual Vandalism Detection System for Wikipedia

This paper presents a novel design of the system aimed at supporting the Wikipedia community in addressing vandalism on the platform. To achieve this, we collected a massive dataset of 47 languages, and applied advanced filtering and feature engineering techniques, including multilingual masked language modeling to build the training dataset from human-generated data. The performance of the system was evaluated through comparison with the one used in production in Wikipedia, known as ORES. Our research results in a significant increase in the number of languages covered, making Wikipedia patrolling more efficient to a wider range of communities. Furthermore, our model outperforms ORES, ensuring that the results provided are not only more accurate but also less biased against certain groups of contributors.

Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data Pipelines

Data pipelines are widely employed in modern enterprises to power a variety of Machine-Learning (ML) and Business-Intelligence (BI) applications. Crucially, these pipelines are recurring (e.g., daily or hourly) in production settings to keep data updated so that ML models can be re-trained regularly, and BI dashboards refreshed frequently. However, data quality (DQ) issues can often creep into recurring pipelines because of upstream schema and data drift over time. As modern enterprises operate thousands of recurring pipelines, today data engineers have to spend substantial efforts to manually monitor and resolve DQ issues, as part of their DataOps and MLOps practices.

Given the high human cost of managing large-scale pipeline operations, it is imperative that we can automate as much as possible. In this work, we propose Auto-Validate-by-History (AVH) that can automatically detect DQ issues in recurring pipelines, leveraging rich statistics from historical executions. We formalize this as an optimization problem, and develop constant-factor approximation algorithms with provable precision guarantees. Extensive evaluations using 2000 production data pipelines at Microsoft demonstrate the effectiveness and efficiency of AVH.

The Missing Indicator Method: From Low to High Dimensions

Missing data is common in applied data science, particularly for tabular data sets found in healthcare, social sciences, and natural sciences. Most supervised learning methods only work on complete data, thus requiring preprocessing such as missing value imputation to work on incomplete data sets. However, imputation alone does not encode useful information about the missing values themselves. For data sets with informative missing patterns, the Missing Indicator Method (MIM), which adds indicator variables to indicate the missing pattern, can be used in conjunction with imputation to improve model performance. While commonly used in data science, MIM is surprisingly understudied from an empirical and especially theoretical perspective. In this paper, we show empirically and theoretically that MIM improves performance for informative missing values, and we prove that MIM does not hurt linear models asymptotically for uninformative missing values. Additionally, we find that for high-dimensional data sets with many uninformative indicators, MIM can induce model overfitting and thus test performance. To address this issue, we introduce Selective MIM (SMIM), a novel MIM extension that adds missing indicators only for features that have informative missing patterns. We show empirically that SMIM performs at least as well as MIM in general, and improves MIM for high-dimensional data. Lastly, to demonstrate the utility of MIM on real-world data science tasks, we demonstrate the effectiveness of MIM and SMIM on clinical tasks generated from the MIMIC-III database of electronic health records.

Experimentation Platforms Meet Reinforcement Learning: Bayesian Sequential Decision-Making for Continuous Monitoring

With the growing needs of online A/B testing to support the innovation in industry, the opportunity cost of running an experiment becomes non-negligible. Therefore, there is an increasing demand for an efficient continuous monitoring service that allows early stopping when appropriate. Classic statistical methods focus on hypothesis testing and are mostly developed for traditional high-stake problems such as clinical trials, while experiments at online service companies typically have very different features and focuses. Motivated by the real needs, in this paper, we introduce a novel framework that we developed in Amazon to maximize customer experience and control opportunity cost. We formulate the problem as a Bayesian optimal sequential decision making problem that has a unified utility function. We discuss extensively practical design choices and considerations. We further introduce how to solve the optimal decision rule via Reinforcement Learning and scale the solution. We show the effectiveness of this novel approach compared with existing methods via a large-scale meta-analysis on experiments in Amazon.

A Multi-stage Framework for Online Bonus Allocation Based on Constrained User Intent Detection

With the explosive development of e-commerce for service, tens of millions of orders are generated every day on the Meituan platform. By allocating bonuses to new customers when they pay, the Meituan platform encourages them to use its own payment service for a better experience in the future. It can be formulated as a multi-choice knapsack problem (MCKP), and the mainstream solution is usually a two-stage method. The first stage is user intent detection, predicting the effect for each bonus treatment. Then, it serves as the objective of the MCKP, and the problem is solved in the second stage to obtain the optimal allocation strategy. However, this solution usually faces the following challenges: (1) In the user intent detection stage, due to the sparsity of interaction and noise, the traditional multi-treatment effect estimation methods lack interpretability, which may violate the domain knowledge that the marginal gain is non-negative with the increase of the bonus amount in economic theory. (2) There is an optimality gap between the two stages, which limits the upper bound of the optimal value obtained in the second stage. (3) Due to changes in the distribution of orders online, the actual cost consumption often violates the given budget limit. To solve the above challenges, we propose a framework that consists of three modules, i.e., User Intent Detection Module, Online Allocation Module, and Feedback Control Module. In the User Intent Detection Module, we implicitly model the treatment increment based on deep representation learning and constrain it to be non-negative to achieve monotonicity constraints. Then, in order to reduce the optimality gap, we further propose a convex constrained model to increase the upper bound of the optimal value. For the third challenge, to cope with the fluctuation of online bonus consumption, we leverage a feedback control strategy in the framework to make the actual cost more accurately approach the given budget limit. Finally, we conduct extensive offline and online experiments, demonstrating the superiority of our proposed framework, which reduced customer acquisition costs by 5.07% and is still running online.

BERT4CTR: An Efficient Framework to Combine Pre-trained Language Model with Non-textual Features for CTR Prediction

Although deep pre-trained language models have shown promising benefit in a large set of industrial scenarios, including Click-Through-Rate (CTR) prediction, how to integrate pre-trained language models that handle only textual signals into a prediction pipeline with non-textual features is challenging.

Up to now, two directions have been explored to integrate multi-modal inputs in fine-tuning of pre-trained language models. One consists of fusing the outcome of language models and non-textual features through an aggregation layer, resulting into ensemble framework, where the cross-information between textual and non-textual inputs are learned only in the aggregation layer. The second one consists of splitting and transforming non-textual features into fine-grained tokens that are fed, along with textual tokens, directly into the transformer layers of language models. However, by adding additional tokens, this approach increases the complexity of the learning and inference.

We propose in this paper, a novel framework, BERT4CTR, that addresses these limitations. The new framework leverages Uni-Attention mechanism to benefit from the interactions between non-textual and textual features, while maintaining low training and inference time-costs, through a dimensionality reduction. We demonstrate through comprehensive experiments on both public and commercial data that BERT4CTR outperforms significantly the state-of-the-art approaches to handle multi-modal inputs and is applicable to CTR prediction. In comparison with ensemble framework, BERT4CTR brings more than 0.4% AUC gain on both tested data sets with only 7% increase on latency.

Interdependent Causal Networks for Root Cause Localization

The goal of root cause analysis is to identify the underlying causes of system problems by discovering and analyzing the causal structure from system monitoring data. It is indispensable for maintaining the stability and robustness of large-scale complex systems. Existing methods mainly focus on the construction of a single effective isolated causal network, whereas many real-world systems are complex and exhibit interdependent structures (i.e., multiple networks of a system are interconnected by cross-network links). In interdependent networks, the malfunctioning effects of problematic system entities can propagate to other networks or different levels of system entities. Consequently, ignoring the interdependency results in suboptimal root cause analysis outcomes.

In this paper, we propose REASON, a novel framework that enables the automatic discovery of both intra-level (i.e., within-network) and inter-level (i.e., across-network) causal relationships for root cause localization. REASON consists of Topological Causal Discovery (TCD) and Individual Causal Discovery (ICD). The TCD component aims to model the fault propagation in order to trace back to the root causes. To achieve this, we propose novel hierarchical graph neural networks to construct interdependent causal networks by modeling both intra-level and inter-level non-linear causal relations. Based on the learned interdependent causal networks, we then leverage random walk with restarts to model the network propagation of a system fault. The ICD component focuses on capturing abrupt change patterns of a single system entity. This component examines the temporal patterns of each entity's metric data (i.e., time series), and estimates its likelihood of being a root cause based on the Extreme Value theory. Combining the topological and individual causal scores, the top K system entities are identified as root causes. Extensive experiments on three real-world datasets validate the effectiveness of the proposed framework.

Macular: A Multi-Task Adversarial Framework for Cross-Lingual Natural Language Understanding

Cross-lingual natural language understanding~(NLU) aims to train NLU models on a source language and apply the models to NLU tasks in target languages, and is a fundamental task for many cross-language applications. Most of the existing cross-lingual NLU models assume the existence of parallel corpora so that words and sentences in source and target languages could be aligned. However, the construction of such parallel corpora is expensive and sometimes infeasible. Motivated by this challenge, recent works propose data augmentation or adversarial training methods to reduce the reliance on external parallel corpora. In this paper, we propose an orthogonal and novel perspective to tackle this challenging cross-lingual NLU task (i.e., when parallel corpora are unavailable). We propose to conduct multi-task learning across different tasks for mutual performance improvement on both source and target languages. The proposed multi-task learning framework is complementary to existing studies and could be integrated with existing methods to further improve their performance on challenging cross-lingual NLU tasks.

Towards this end, we propose a multi-task adversarial framework for cross-lingual NLU, namely Macular. The proposed Macular includes a multi-task module and a task-specific module to infer both the common knowledge across tasks and unique task characteristics. More specifically, in the multi-task module, we incorporate a task adversarial loss into training to ensure the derivation of task-shared knowledge only by the representations. In the task-specific fine-tuning module, we extract task-specific knowledge which is not captured by the multi-task module. A task-level consistency loss is added to the training loss so that consistent predictions across a target task and an auxiliary task (i.e., the task that is the most similar to the target task) are achieved. A language adversarial loss is also incorporated so that knowledge can be transferred from source languages to target ones. To validate the effectiveness of the proposed Macular, we conduct extensive experiments on four public datasets including paraphrase identification, natural language understanding, question answering matching, and query advertisement matching. The experimental results show that the proposed Macular can outperform state-of-the-art cross-lingual NLU approaches.

ECGGAN: A Framework for Effective and Interpretable Electrocardiogram Anomaly Detection

Heart is the most important organ of the human body, and Electrocardiogram (ECG) is an essential tool for clinical monitoring of heart health and detecting cardiovascular diseases. Automatic detection of ECG anomalies is of great significance and clinical value in healthcare. However, performing automatic anomaly detection for the ECG data is challenging because we not only need to accurately detect the anomalies but also need to provide clinically meaningful interpretation of the results. Existing works on automatic ECG anomaly detection either rely on hand-crafted designs of feature extraction algorithms which are typically too simple to deliver good performance, or deep learning for automatically extracting features, which is not interpretable.

In this paper, we propose ECGGAN, a novel reconstruction-based ECG anomaly detection framework. The key idea of ECGGAN is to make full use of the characteristics of ECG with the periodic metadata, namely beat, to learn the universal pattern in ECG from representative normal data. We establish a reconstruction model, taking leads as constraints to capture the unique characteristics of different leads in ECG data, and achieve accurate anomaly detection at ECG-level by combining multiple leads. Experimental results on two real-world datasets and their mixed-set confirm that our method achieves superior performance than baselines in terms of precision, recall, F1-score, and AUC. In addition, ECGGAN can provide clinically meaningful interpretation of results by revealing the extent to which abnormal sites deviate from the normal pattern.

Fresh Content Needs More Attention: Multi-funnel Fresh Content Recommendation

Recommendation system serves as a conduit connecting users to an incredibly large, diverse and ever growing collection of contents. In practice, missing information on fresh (and tail) contents needs to be filled in order for them to be exposed and discovered by their audience. We here share our success stories in building a dedicated fresh content recommendation stack on a large commercial platform. To nominate fresh contents, we built a multi-funnel nomination system that combines (i) a two-tower model with strong generalization power for coverage, and (ii) a sequence model with near real-time update on user feedback for relevance. The multi-funnel setup effectively balances between coverage and relevance. An in-depth study uncovers the relationship between user activity level and their proximity toward fresh contents, which further motivates a contextual multi-funnel setup. Nominated fresh candidates are then scored and ranked by systems considering prediction uncertainty to further bootstrap content with less exposure. We evaluate the benefits of the dedicated fresh content recommendation stack, and the multi-funnel nomination system in particular, through user corpus co-diverted live experiments. We conduct multiple rounds of live experiments on a commercial platform serving billion of users demonstrating efficacy of our proposed methods.

Learning to Discover Various Simpson's Paradoxes

Simpson's paradox is a well-known statistical phenomenon that has captured the attention of statisticians, mathematicians, and philosophers for more than a century. The paradox often confuses people when it appears in data, and ignoring it may lead to incorrect decisions. Recent studies have found many examples of Simpson's paradox in social data and proposed a few methods to detect the paradox automatically. However, these methods suffer from many limitations, such as being only suitable for categorical variables or one specific paradox. To address these problems, we develop a learning-based approach to discover various Simpson's paradoxes. Firstly, we propose a framework from a statistical perspective that unifies multiple variants of Simpson's paradox currently known. Secondly, we present a novel loss function, Multi-group Pearson Correlation Coefficient (MPCC), to calculate the association strength of two variables of multiple subgroups. Then, we design a neural network model, coined SimNet, to automatically disaggregate data into multiple subgroups by optimizing the MPCC loss. Experiments on various datasets demonstrate that SimNet can discover various Simpson's paradoxes caused by discrete and continuous variables, even hidden variables. The code is available at https://github.com/ant-research/Learning-to-Discover-Various-Simpson-Paradoxes.

Removing Camouflage and Revealing Collusion: Leveraging Gang-crime Pattern in Fraudster Detection

As one of the major threats to the healthy development of various online platforms, fraud has become increasingly committed in the form of gangs since collusive fraudulent activities are much easier to obtain illicit benefits with lower exposure risk. To detect fraudsters in a gang, spatio-temporal graph neural network models have been widely applied to detect both temporal and spatial collusive patterns. However, a closer peek into real-world records of fraudsters can reveal that fraud gangs usually conduct community-level camouflage, specified by two types, i.e., temporal and spatial camouflage. Such camouflage can disguise gangs as benign communities by concealing collusive patterns and thus deceiving many existing graph neural network models. In the meantime, many existing graph neural network models suffer from the challenge of extreme sample imbalance caused by rare fraudsters hidden among massive users. To handle all these challenges, in this paper, we propose a generative adversarial network framework, named Adversarial Camouflage Detector, to detect fraudsters. Concretely, this ACD framework consists of four modules, in charge of community division, camouflage identification, fraudster detection, and camouflage generation, respectively. The first three modules form up a discriminator that uses spatio-temporal graph neural networks as the foundation model and enhance fraudster detection by amplifying the gangs' collusive patterns through automatically identifying and removing camouflage. Meanwhile, the camouflage generation module plays as the generator role that generates fraudsters samples by competing against the discriminator to alleviate the challenge of sample imbalance and increase the model robustness. The experimental result shows that our proposed method outperforms other methods on real-world datasets.

Root Cause Analysis for Microservice Systems via Hierarchical Reinforcement Learning from Human Feedback

In microservice systems, the identification of root causes of anomalies is imperative for service reliability and business impact. This process is typically divided into two phases: (i)constructing a service dependency graph that outlines the sequence and structure of system components that are invoked, and (ii) localizing the root cause components using the graph, traces, logs, and Key Performance Indicators (KPIs) such as latency. However, both phases are not straightforward due to the highly dynamic and complex nature of the system, particularly in large-scale commercial architectures like Microsoft Exchange.

In this paper, we propose a new framework that employs Hierarchical Reinforcement Learning from Human Feedback (HRLHF) to address these challenges. Our framework leverages the static topology of the microservice system and efficiently employs the feedback of engineers to reduce uncertainty in the discovery of the service dependency graph. The framework utilizes reinforcement learning to reduce the number of queries required from O(N2) to O(1), enabling the construction of the dependency graph with high accuracy and minimal human effort. Additionally, we extend the discovered dependency graphs to window causal graphs that capture the characteristics of time series over a specified time period, resulting in improved root cause analysis accuracy and robustness. Evaluations on both real datasets from Microsoft Exchange and synthetic datasets with injected anomalies demonstrate superior performance on various metrics compared to state-of-the-art methods. It is worth mentioning that, our framework has been integrated as a crucial component in Microsoft M365 Exchange service.

ShuttleSet: A Human-Annotated Stroke-Level Singles Dataset for Badminton Tactical Analysis

With the recent progress in sports analytics, deep learning approaches have demonstrated the effectiveness of mining insights into players' tactics for improving performance quality and fan engagement. This is attributed to the availability of public ground-truth datasets. While there are a few available datasets for turn-based sports for action detection, these datasets severely lack structured source data and stroke-level records since these require high-cost labeling efforts from domain experts and are hard to detect using automatic techniques. Consequently, the development of artificial intelligence approaches is significantly hindered when existing models are applied to more challenging structured turn-based sequences. In this paper, we present ShuttleSet, the largest publicly-available badminton singles dataset with annotated stroke-level records. It contains 104 sets, 3,685 rallies, and 36,492 strokes in 44 matches between 2018 and 2021 with 27 top-ranking men's singles and women's singles players. ShuttleSet is manually annotated with a computer-aided labeling tool to increase the labeling efficiency and effectiveness of selecting the shot type with a choice of 18 distinct classes, the corresponding hitting locations, and the locations of both players at each stroke. In the experiments, we provide multiple benchmarks (i.e., stroke influence, stroke forecasting, and movement forecasting) with baselines to illustrate the practicability of using ShuttleSet for turn-based analytics, which is expected to stimulate both academic and sports communities. Over the past two years, a visualization platform has been deployed to illustrate the variability of analysis cases from ShuttleSet for coaches to delve into players' tactical preferences with human-interactive interfaces, which was also used by national badminton teams during multiple international high-ranking matches.

Contrastive Learning of Stress-specific Word Embedding for Social Media based Stress Detection

Detecting stress via user's social media posts has attracted increasing research interests in recent years. The majority of the methods leverage word embeddings to represent each of the posted words as a vector, and then perform classification on a sequence of word vectors. To enhance the performance of distinguishing words/phrases related to stressors and stressful emotions from others, in this study, we present a stress-specific word embedding learning framework upon the pre-trained language model BERT. Specifically, we formulate three self-supervised contrastive learning tasks with a joint learning objective. (1) The stressor discrimination task, which is designed to allow the framework to be sensitive to words/phrases about stressors. (2) The stressor cluster discrimination task, which is designed to allow the framework to distinguish stressors into different categories. (3) The stressful emotion discrimination task, which is designed to allow the framework to grasp words/phrases about stressful emotions. Our performance study shows that the learned stress-specific word embedding can significantly benefit social media based stress detection tasks, especially in the more practical scenarios with insufficient labeled data. Besides, we build two user-level social media based stress detection datasets that can help train machine learning models to facilitate human well-being.

Doctor Specific Tag Recommendation for Online Medical Record Management

With the rapid growth of online medical platforms, more and more doctors are willing to manage and communicate with patients via online services. Considering the large volume and various patient conditions, identifying and classifying patients' medical records has become a crucial problem. To efficiently index these records, a common practice is to annotate them with semantically meaningful tags. However, manual labeling tags by doctors is impractical due to the possibility of thousands of tag candidates, which necessitates a tag recommender system. Due to the long tail distribution of tags and the dominance of low-activity doctors, as well as the unique uploaded medical records, this task is rather challenging. This paper proposes an efficient doctor specific tag recommendation framework for improved medical record management without side information. Specifically, we first utilize effective language models to learn the text representation. Then, we construct a doctor embedding learning module to enhance the recommendation quality by integrating implicit information within text representations and considering latent tag correlations to make more accurate predictions. Extensive experiment results demonstrate the effectiveness of our framework from the viewpoints of all doctors (20% improvement) or low-activity doctors (10% improvement).

Exploiting Intent Evolution in E-commercial Query Recommendation

Aiming at a better understanding of the search goals in the user search sessions, recent query recommender systems explicitly model the reformulations of queries, which hopes to estimate the intents behind these reformulations and thus benefit the next-query recommendation. However, in real-world e-commercial search scenarios, user intents are much more complicated and may evolve dynamically. Existing methods merely consider trivial reformulation intents from semantic aspects and fail to model dynamic reformulation intent flows in search sessions, leading to sub-optimal capacities to recommend desired queries. To deal with these limitations, we first explicitly define six types of query reformulation intents according to the desired products of two consecutive queries. We then apply two self-attentive encoders on top of two pre-trained large language models to learn the transition dynamics from semantic query and intent reformulation sequences, respectively. We develop an intent-aware query decoder to utilize the predicted intents for suggesting the next queries. We instantiate such a framework with an Intent-aware Variational AutoEncoder (IVAE) under deployment at Amazon. We conduct comprehensive experiments on two real-world e-commercial datasets from Amazon and one public dataset from BestBuy. Specifically, IVAE improves the Recall@15 by 25.44% and 60.47% on two Amazon datasets and 13.91% on BestBuy, respectively.

An Empirical Study of Selection Bias in Pinterest Ads Retrieval

Data selection bias has been a long-lasting challenge in the machine learning domain, especially in multi-stage recommendation systems, where the distribution of labeled items for model training is very different from that of the actual candidates during inference time. This distribution shift is even more prominent in the context of online advertising where the user base is diverse and the platform contains a wide range of contents. In this paper, we first investigate the data selection bias in the upper funnel (Ads Retrieval) of Pinterest's multi-cascade ads ranking system. We then conduct comprehensive experiments to assess the performance of various state-of-the-art methods, including transfer learning, adversarial learning, and unsupervised domain adaptation. Moreover, we further introduce some modifications into the unsupervised domain adaptation and evaluate the performance of different variants of this modified method. Our online A/B experiments show that the modified version of unsupervised domain adaptation (MUDA) could provide the largest improvements to the performance of Pinterest's advertisement ranking system compared with other methods and the one used in current production.

VRDU: A Benchmark for Visually-rich Document Understanding

Understanding visually-rich business documents to extract structured data and automate business workflows has been receiving attention both in academia and industry. Although recent multi-modal language models have achieved impressive results, we find that existing benchmarks do not reflect the complexity of real documents seen in industry. In this work, we identify the desiderata for a more comprehensive benchmark and propose one we call Visually Rich Document Understanding (VRDU). VRDU contains two datasets that represent several challenges: rich schema including diverse data types as well as hierarchical entities, complex templates including tables and multi-column layouts, and diversity of different layouts (templates) within a single document type. We design few-shot and conventional experiment settings along with a carefully designed matching algorithm to evaluate extraction results. We report the performance of strong baselines and offer three observations: (1) generalizing to new document templates is still very challenging, (2) few-shot performance has a lot of headroom, and (3) models struggle with hierarchical fields such as line-items in an invoice. We plan to open source the benchmark and the evaluation toolkit. We hope this helps the community make progress on these challenging tasks in extracting structured data from visually rich documents.

Sequence As Genes: An User Behavior Modeling Framework for Fraud Transaction Detection in E-commerce

With the explosive growth of e-commerce, detecting fraudulent transactions in real-world scenarios is becoming increasingly important for e-commerce platforms. Recently, several supervised approaches have been proposed to use user behavior sequences, which record the user's track on platforms and contain rich information for fraud transaction detection. Nevertheless, these methods always suffer from the scarcity of labeled data in real-world scenarios. The recent remarkable pre-training methods in Natural Language Processing (NLP) and Computer Vision (CV) domains offered glimmers of light. However, user behavior sequences differ intrinsically from text, images, and videos. In this paper, we propose a novel and general user behavior pre-training framework, named Sequence As GEnes (SAGE), which provides a new perspective for user behavior modeling. Following the inspiration of treating sequences as genes, we carefully designed the user behavior data organization paradigm and pre-training scheme. Specifically, we propose an efficient data organization paradigm inspired by the nature of DNA expression, which decouples the length of behavior sequences and the corresponding time spans. Also inspired by the natural mechanisms in genetics, we propose two pre-training tasks, namely sequential mutation and sequential recombination, to improve the robustness and consistency of user behavior representations in complicated real-world scenes. Extensive experiments on four differentiated fraud transaction detection real scenarios demonstrate the effectiveness of our proposed framework.

RLTP: Reinforcement Learning to Pace for Delayed Impression Modeling in Preloaded Ads

To increase brand awareness, many advertisers conclude contracts with advertising platforms to purchase traffic and deliver advertisements to target audiences. In a whole delivery period, advertisers desire a certain impression count for the ads, and they expect that the delivery performance is as good as possible. Advertising platforms employ real-time pacing algorithms to satisfy the demands. However, the delivery procedure is also affected by publishers. Preloading is a widely used strategy for many types of ads (e.g., video ads) to make sure that the response time for displaying is legitimate, which results in delayed impression phenomenon. In this paper, we focus on a new research problem of impression pacing for preloaded ads, and propose a Reinforcement Learning To Pace framework RLTP. It learns a pacing agent that sequentially produces selection probabilities in the whole delivery period. To jointly optimize the objectives of impression count and delivery performance, RLTP employs tailored reward estimator to satisfy guaranteed impression count, penalize over-delivery and maximize traffic value. Experiments on large-scale datasets verify that RLTP outperforms baselines by a large margin. We have deployed it online to our advertising platform, and it achieves significant uplift for delivery completion rate and click-through rate.

DNet: Distributional Network for Distributional Individualized Treatment Effects

There is a growing interest in developing methods to estimate individualized treatment effects (ITEs) for various real-world applications, such as e-commerce and public health. This paper presents a novel architecture, called DNet, to infer distributional ITEs. DNet can learn the entire outcome distribution for each treatment, whereas most existing methods primarily focus on the conditional average treatment effect and ignore the conditional variance around its expectation. Additionally, our method excels in settings with heavy-tailed outcomes and outperforms state-of-the-art methods in extensive experiments on benchmark and real-world datasets. DNet has also been successfully deployed in a widely used mobile app with millions of daily active users.

On-device Integrated Re-ranking with Heterogeneous Behavior Modeling

As an emerging field driven by industrial applications, integrated re-ranking combines lists from upstream sources into a single list, and presents it to the user. The quality of integrated re-ranking is especially sensitive to real-time user behaviors and preferences. However, existing methods are all built on the cloud-to-edge framework, where mixed lists are generated by the cloud model and then sent to the devices. Despite its effectiveness, such a framework fails to capture users' real-time preferences due to the network bandwidth and latency. Hence, we propose to place the integrated re-ranking model on devices, allowing for the full exploitation of real-time behaviors. To achieve this, we need to address two key issues: first, how to extract users' preferences for different sources from heterogeneous and imbalanced user behaviors; second, how to explore the correlation between the extracted personalized preferences and the candidate items. In this work, we present the first on-Device Integrated Re-ranking framework, DIR, to avoid delays in processing real-time user behaviors. DIR includes a multi-sequence behavior modeling module to extract the user's source-level preferences, and a preference-adaptive re-ranking module to incorporate personalized source-level preferences into the re-ranking of candidate items. Besides, we design exposure loss and utility loss to jointly optimize exposure fairness and overall utility. Extensive experiments on three datasets show that DIR significantly outperforms the state-of-the-art baselines in utility-based and fairness-based metrics.

A Predict-Then-Optimize Couriers Allocation Framework for Emergency Last-mile Logistics

In recent years, emergency last-mile logistics (ELML) have played an essential role in urban emergencies. The efficient allocation of couriers in ELML is of practical significance to ensure the supply of essential materials, especially in public health emergencies (PHEs). However, couriers allocation becomes challenging due to the instability of demand, dynamic supply comprehension, and the evolutional delivery environment for ELML caused by PHEs. While existing work has delved into couriers allocation, the impact of PHEs on demand-supply-delivery has yet to be considered. In this work, we design PTOCA, a Predict-Then-Optimize Couriers Allocation framework. Specifically, in the prediction stage, we design a resource-aware prediction module that performs spatio-temporal modeling of unstable demand characteristics using a variational graph GRU encoder and builds a task-resource regressor to predict demand accurately. In the optimization stage, firstly, the priority ranking module solves the matching of delivery resources under demand-supply imbalance. Then the multi-factor task allocation module is used to model the dynamic evolutional environment and reasonably assign the delivery tasks of couriers. We evaluate PTOCA using real-world data covering 170 delivery zones, more than 10,000 couriers, and 100 million delivery tasks. The data is collected from JD Logistics, one of the largest logistics service companies. Extensive experimental results show that our method outperforms the baseline in task delivery rate and on-time delivery rate.

TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest

Sequential models that encode user activity for next action prediction have become a popular design choice for building web-scale personalized recommendation systems. Traditional methods of sequential recommendation either utilize end-to-end learning on realtime user actions, or learn user representations separately in an offline batch-generated manner. This paper (1) presents Pinterest's ranking architecture for Homefeed, our personalized recommendation product and the largest engagement surface; (2) proposes TransAct, a sequential model that extracts users' short-term preferences from their realtime activities; (3) describes our hybrid approach to ranking, which combines end-to-end sequential modeling via TransAct with batch-generated user embeddings. The hybrid approach allows us to combine the advantages of responsiveness from learning directly on realtime user activity with the cost-effectiveness of batch user representations learned over a longer time period. We describe the results of ablation studies, the challenges we faced during productionization, and the outcome of an online A/B experiment, which validates the effectiveness of our hybrid ranking model. We further demonstrate the effectiveness of TransAct on other surfaces such as contextual recommendations and search. Our model has been deployed to production in Homefeed, Related Pins, Notifications, and Search at Pinterest.

Knowledge Based Prohibited Item Detection on Heterogeneous Risk Graphs

With the popularity of online shopping in recent years, various prohibited items are continuously attacking e-commerce portals. Searching and deleting such risk items online has played a fundamental role in protecting the health of e-commerce trades. To mitigate negative impact of limited supervision and adversarial behaviors of malicious sellers, current state-of-the-art work mainly introduces heterogeneous graph neural network with further improvements such as graph structure learning, pairwise training mechanism, etc. However, performance of these models is highly limited since domain knowledge is indispensable for identifying prohibited items but ignored by these methods. In this paper, we propose a novel Knowledge Based Prohibited item Detection system (named KBPD) to break through this limitation. To make full use of rich risk knowledge, the proposed method introduces the Risk-Domain Knowledge Graph (named RDKG), which is encoded by a path-based graph neural network method. Furthermore, to utilize information from both the RDKG and the Heterogeneous Risk Graph (named HRG), an interactive fusion framework is proposed and further improves the detection performance. We collect real-world datasets from the largest Chinese second-hand commodity trading platform, Xianyu. Both offline and online experimental results consistently demonstrate that KBPD outperforms the state-of-the-art baselines. The improvement over the second-best method is up to 22.67% in the AP metric.

Graph-Aware Language Model Pre-Training on a Large Graph Corpus Can Help Multiple Graph Applications

Model pre-training on large text corpora has been demonstrated effective for various downstream applications in the NLP domain. In the graph mining domain, a similar analogy can be drawn for pre-training graph models on large graphs in the hope of benefiting downstream graph applications, which has also been explored by several recent studies. However, no existing study has ever investigated the pre-training of text plus graph models on large heterogeneous graphs with abundant textual information (a.k.a. large graph corpora) and then fine-tuning the model on different related downstream applications with different graph schemas. To address this problem, we propose a framework of graph-aware language model pre-training (GaLM) on a large graph corpus, which incorporates large language models and graph neural networks, and a variety of fine-tuning methods on downstream applications. We conduct extensive experiments on Amazon's real internal datasets and large public datasets. Comprehensive empirical results and in-depth analysis demonstrate the effectiveness of our proposed methods along with lessons learned.

QUERT: Continual Pre-training of Language Model for Query Understanding in Travel Domain Search

In light of the success of the pre-trained language models (PLMs), continual pre-training of generic PLMs has been the paradigm of domain adaption. In this paper, we propose QUERT, A Continual Pre-trained Language Model for QUERy Understanding in Travel Domain Search. QUERT is jointly trained on four tailored pre-training tasks to the characteristics of query in travel domain search: Geography-aware Mask Prediction, Geohash Code Prediction, User Click Behavior Learning, and Phrase and Token Order Prediction. Performance improvement of downstream tasks and ablation experiment demonstrate the effectiveness of our proposed pre-training tasks. To be specific, the average performance of downstream tasks increases by 2.02% and 30.93% in supervised and unsupervised settings, respectively. To check on the improvement of QUERT to online business, we deploy QUERT and perform A/B testing on Fliggy APP. The feedback results show that QUERT increases the Unique Click-Through Rate and Page Click-Through Rate by 0.89% and 1.03% when applying QUERT as the encoder. Resources are available at https://github.com/hsaest/QUERT

NEON: Living Needs Prediction System in Meituan

Living needs refer to the various needs in human's daily lives for survival and well-being, including food, housing, entertainment, etc. At life service platforms that connect users to service providers, such as Meituan, the problem of living needs prediction is fundamental as it helps understand users and boost various downstream applications such as personalized recommendation. However, the problem has not been well explored and is faced with two critical challenges. First, the needs are naturally connected to specific locations and times, suffering from complex impacts from the spatiotemporal context. Second, there is a significant gap between users' actual living needs and their historical records on the platform. To address these two challenges, we design a system of living NEeds predictiON named NEON, consisting of three phases: feature mining, feature fusion and multi-task prediction. In the feature mining phase, we carefully extract individual-level user features for spatiotemporal modeling, and aggregated-level behavioral features for enriching data, which serve as the basis for addressing two challenges, respectively. Further, in the feature fusion phase, we propose a neural network that effectively fuses two parts of features into the user representation. Moreover, we design a multitask prediction phase, where the auxiliary task of needs-meeting way prediction can enhance the modeling of spatiotemporal context. Extensive offline evaluations verify that our NEON system can effectively predict users' living needs. Furthermore, we deploy NEON into Meituan's algorithm engine and evaluate how it enhances the three downstream prediction applications, via large-scale online A/B testing. As a representative result, deploying our system leads to a 1.886% increase w.r.t. CTCVR in Meituan homepage recommendation. The results demonstrate NEON's effectiveness in predicting fine-grained user needs, needs-meeting way, and potential needs, highlighting the immense application value of NEON.

A Data-Driven Decision Support Framework for Player Churn Analysis in Online Games

Faced with saturated market and fierce competition of online games, it is of great value to analyze the causes of the player churn for improving the game product, maintaining the player retention. A large number of research efforts on churn analysis have been made into churn prediction, which can achieve a sound accuracy benefiting from the booming of AI technologies. However, game publishers are usually unable to apply high-accuracy prediction methods in practice for preventing or relieving the churn due to the lack of the specific decision support (e.g., why they leave and what to do next). In this study, we fully exploit the expertise in online games and propose a comprehensive data-driven decision support framework for addressing game player churn. We first define the churn analysis in online games from a commercial perspective and elaborate the core demands of game publishers for churn analysis. Then we employ and improve the cutting-edge eXplainable AI (XAI) methods to predict player churn and analyze the potential churn causes. The possible churn causes can finally guide game publishers to make specific decisions of revision or intervention in our designed procedure. We demonstrate the effectiveness and high practical value of the framework by conducting extensive experiments on a real-world large-scale online game, Justice PC. The whole decision support framework, bringing interesting and valuable insights, also receives quite positive reviews from the game product and operation teams. Notably, the whole pipeline is readily transplanted to other online systems for decision support to address similar issues.

PlanRanker: Towards Personalized Ranking of Train Transfer Plans

Train transfer plan ranking has become the core business of online travel platforms (OTPs), due to the flourish development of high- speed rail technology and convenience of booking trains online. Currently, mainstream OTPs adopt rule-based or simple preference- based strategies to rank train transfer plans. However, the insuf- ficient emphasis on the costs of plans and the negligence of con- sidering reference transfer plans make these existing strategies less effective in solving the personalized ranking problem of train transfer plans. To this end, a novel personalized deep network (Plan- Ranker) is presented in this paper to better address the problem. In PlanRanker, a personalized learning component is first proposed to capture both of the query semantics and the target transfer plan- relevant personalized interests of a user over the user's behavior log data. Then, we present a cost learning component, where both of the price cost and the time cost of a target transfer plan are emphasized and learned. Finally, a reference transfer plan learning component is designed to enable the whole framework of PlanRanker to learn from reference transfer plans which are pieced together by plat- form users and thus reflect the wisdom of crowd. PlanRanker is now successfully deployed at Alibaba Fliggy, one of the largest OTPs in China, serving millions of users every day for train ticket reservation. Offline experiments on two production datasets and a country-scale online A/B test at Fliggy both demonstrate the superiority of the proposed PlanRanker over baselines.

Multi-factor Sequential Re-ranking with Perception-Aware Diversification

Feed recommendation systems, which recommend a sequence of items for users to browse and interact with, have gained significant popularity in practical applications. In feed products, users tend to browse a large number of items in succession, so the previously viewed items have a significant impact on users' behavior towards the following items. Therefore, traditional methods that mainly focus on improving the accuracy of recommended items are suboptimal for feed recommendations because they may recommend highly similar items. For feed recommendation, it is crucial to consider both the accuracy and diversity of the recommended item sequences in order to satisfy users' evolving interest when consecutively viewing items. To this end, this work proposes a general re-ranking framework named Multi-factor Sequential Re-ranking with Perception-Aware Diversification~(MPAD) to jointly optimize accuracy and diversity for feed recommendation in a sequential manner. Specifically, MPAD first extracts users' different scales of interests from their behavior sequences through graph clustering-based aggregations. Then, MPAD proposes two sub-models to respectively evaluate the accuracy and diversity of a given item by capturing users' evolving interest due to the ever-changing context and users' personal perception of diversity from an item sequence perspective. This is consistent with the browsing nature of the feed scenario. Finally, MPAD generates the return list by sequentially selecting optimal items from the candidate set to maximize the joint benefits of accuracy and diversity of the entire list. MPAD has been implemented in Taobao's homepage feed to serve the main traffic and provide services to recommend billions of items to hundreds of millions of users every day.

Multi-channel Integrated Recommendation with Exposure Constraints

Integrated recommendation, which aims at jointly recommending heterogeneous items from different channels in a main feed, has been widely applied to various online platforms. Though attractive, integrated recommendation requires the ranking methods to migrate from conventional user-item models to the new user-channel-item paradigm in order to better capture users' preferences on both item and channel levels. Moreover, practical feed recommendation systems usually impose exposure constraints on different channels to ensure user experience. This leads to greater difficulty in the joint ranking of heterogeneous items. In this paper, we investigate the integrated recommendation task with exposure constraints in practical recommender systems. Our contribution is forth-fold. First, we formulate this task as a binary online linear programming problem and propose a two-layer framework named Multi-channel Integrated Recommendation with Exposure Constraints~(MIREC) to obtain the optimal solution. Second, we propose an efficient online allocation algorithm to determine the optimal exposure assignment of different channels from a global view of all user requests over the entire time horizon. We prove that this algorithm reaches the optimal point under a regret bound of O (√T) with linear complexity. Third, we propose a series of collaborative models to determine the optimal layout of heterogeneous items at each user request. The joint modeling of user interests, cross-channel correlation, and page context in our models aligns more with the browsing nature of feed products than existing models. Finally, we conduct extensive experiments on both offline datasets and online A/B tests to verify the effectiveness of MIREC. The proposed framework has now been implemented on the homepage of Taobao to serve the main traffic.

AlerTiger: Deep Learning for AI Model Health Monitoring at LinkedIn

Data-driven companies use AI models extensively to develop products and intelligent business solutions, making the health of these models crucial for business success. Model monitoring and alerting in industries pose unique challenges, including a lack of clear model health metrics definition, label sparsity, and fast model iterations that result in short-lived models and features. As a product, there are also requirements for scalability, generalizability, and explainability. To tackle these challenges, we propose AlerTiger, a deep-learning-based MLOps model monitoring system that helps AI teams across the company monitor their AI models' health by detecting anomalies in models' input features and output score over time. The system consists of four major steps: model statistics generation, deep-learning-based anomaly detection, anomaly post-processing, and user alerting. Our solution generates three categories of statistics to indicate AI model health, offers a two-stage deep anomaly detection solution to address label sparsity and attain the generalizability of monitoring new models, and provides holistic reports for actionable alerts. This approach has been deployed to most of LinkedIn's production AI models for over a year and has identified several model issues that later led to significant business metric gains after fixing.

Assisting Clinical Decisions for Scarcely Available Treatment via Disentangled Latent Representation

Extracorporeal membrane oxygenation (ECMO) is an essential life-supporting modality for COVID-19 patients who are refractory to conventional therapies. However, the proper treatment decision has been the subject of significant debate and it remains controversial about who benefits from this scarcely available and technically complex treatment option. To support clinical decisions, it is a critical need to predict the treatment need and the potential treatment and no-treatment responses. Targeting this clinical challenge, we propose Treatment Variational AutoEncoder (TVAE), a novel approach for individualized treatment analysis. TVAE is specifically designed to address the modeling challenges like ECMO with strong treatment selection bias and scarce treatment cases. TVAE conceptualizes the treatment decision as a multi-scale problem. We model a patient's potential treatment assignment and the factual and counterfactual outcomes as part of their intrinsic characteristics that can be represented by a deep latent variable model. The factual and counterfactual prediction errors are alleviated via a reconstruction regularization scheme together with semi-supervision, and the selection bias and the scarcity of treatment cases are mitigated by the disentangled and distribution-matched latent space and the label-balancing generative strategy. We evaluate TVAE on two real-world COVID-19 datasets: an international dataset collected from 1651 hospitals across 63 countries, and a institutional dataset collected from 15 hospitals. The results show that TVAE outperforms state-of-the-art treatment effect models in predicting both the propensity scores and factual outcomes on heterogeneous COVID-19 datasets. Additional experiments also show TVAE outperforms the best existing models in individual treatment effect estimation on the synthesized IHDP benchmark dataset.

Contextual Self-attentive Temporal Point Process for Physical Decommissioning Prediction of Cloud Assets

As cloud computing continues to expand globally, the need for effective management of decommissioned cloud assets in data centers becomes increasingly important. This work focuses on predicting the physical decommissioning date of cloud assets as a crucial component in reverse cloud supply chain management and data center warehouse operation. The decommissioning process is modeled as a contextual self-attentive temporal point process, which incorporates contextual information to model sequences with parallel events and provides more accurate predictions with more seen historical data. We conducted extensive offline and online experiments in 20 sampled data centers. The results show that the proposed methodology achieves the best performance compared with baselines and improves remarkable 94% prediction accuracy in online experiments. This modeling methodology can be extended to other domains with similar workflow-like processes.

M3PT: A Multi-Modal Model for POI Tagging

POI tagging aims to annotate a point of interest (POI) with some informative tags, which facilitates many services related to POIs, including search, recommendation, and so on. Most of the existing solutions neglect the significance of POI images and seldom fuse the textual and visual features of POIs, resulting in suboptimal tagging performance. In this paper, we propose a novel M ulti-M odal M odel for P OI T agging, namely M3PT, which achieves enhanced POI tagging through fusing the target POI's textual and visual features, and the precise matching between the multi-modal representations. Specifically, we first devise a domain-adaptive image encoder (DIE) to obtain the image embeddings aligned to their gold tags' semantics. Then, in M3PT's text-image fusion module (TIF), the textual and visual representations are fully fused into the POIs' content embeddings for the subsequent matching. In addition, we adopt a contrastive learning strategy to further bridge the gap between the representations of different modalities. To evaluate the tagging models' performance, we have constructed two high-quality POI tagging datasets from the real-world business scenario of Ali Fliggy. Upon the datasets, we conducted the extensive experiments to demonstrate our model's advantage over the baselines of uni-modality and multi-modality, and verify the effectiveness of important components in M3PT, including DIE, TIF and the contrastive learning strategy.

Interactive Generalized Additive Model and Its Applications in Electric Load Forecasting

Electric load forecasting is an indispensable component of electric power system planning and management. Inaccurate load forecasting may lead to the threat of outages or a waste of energy. Accurate electric load forecasting is challenging when there is limited data or even no data, such as load forecasting in holiday, or under extreme weather conditions. As high-stakes decision-making usually follows after load forecasting, model interpretability is crucial for the adoption of forecasting models. In this paper, we propose an interactive GAM which is not only interpretable but also can incorporate specific domain knowledge in electric power industry for improved performance. This boosting-based GAM leverages piecewise linear functions and can be learned through our efficient algorithm. In both public benchmark and electricity datasets, our interactive GAM outperforms current state-of-the-art methods and demonstrates good generalization ability in the cases of extreme weather events. We launched a user-friendly web-based tool based on interactive GAM and already incorporated it into our eForecaster product, a unified AI platform for electricity forecasting.

From Labels to Decisions: A Mapping-Aware Annotator Model

Online platforms regularly rely on human annotators to make real-time operational decisions for tasks such as content moderation. While crowdsourcing models have been proposed for aggregating noisy labels, they do not generalize well when annotators produce a labels in a large space, e.g., generated from complex review trees. We study a novel crowdsourcing setting with D possible operational decisions or outcomes, but annotators produce labels in a larger space of size L > D which are mapped to decisions through a known mapping function. For content moderation, such labels can correspond to violation reasons (e.g. nudity, violence), while the space of decisions is binary: remove the content or keep it up. In this setting, it is more important to make the right decision rather than estimating the correct underlying label. Existing methods typically separate out the labels to decisions mapping from the modeling of annotators, leading to sub-optimal statistical inference efficiency and excessive computation complexity. We propose a novel confusion matrix model for each annotator that leverages this mapping. Our model is parameterized in a hierarchical manner with both population parameters shared across annotators to model shared confusions and individual parameters to admit heterogeneity among annotators. With extensive numerical experiments, we demonstrate that the proposed model substantially improves accuracy over existing methods and scales well for moderate and large L. In a real-world application on content moderation at Meta, the proposed method offers a 13% improvement in AUC over prior methods, including Meta's existing model in production.

Self-supervised Classification of Clinical Multivariate Time Series using Time Series Dynamics

To improve the accuracy of clinical multivariate time series (MTS) classification (such as EEG and ECG) by a novel self-supervised paradigm that directly captures the dynamics between the different time series learned together to optimize the classification task. Labels in clinical datasets are very often insufficient. One way to address this challenge is leveraging self-supervision. This paradigm attempts to identify a supervisory signal inherent within a dataset to serve as a surrogate label. We present a novel form of self-supervision: dynamics of clinical MTS. Unlike other self-supervision methods, such as masking, that are intuitive but still heuristic, we suggest to learn a representation justified by Koopman theory. The latter was shown useful for representing clinical time series and can be used as a form of surrogate task to improve the clinical MTS classification. In the ECG task, we show that our proposed framework achieved higher sensitivity and specificity than the state-of-the-art (SOTA) baseline over numerous common diagnoses. For EEG abnormality classification, our proposed framework also achieved higher sensitivity and specificity than the SOTA baseline. All results are statistically significant. Our technique yields reliable clinical diagnosis in an empirical study employing signals from thousands of patients in multiple clinical tasks employing two types of clinical-grade sensors (ECG and EEG) as compared to the state-of-the-art machine learning. Leveraging time-series-dynamics self-supervision can help mitigate the lack of labels in clinical datasets used for training machine learning algorithms and significantly improve their performance. Specifically, the ECG system presented in this work is being trialed in hospitals, used by top cardiologists for patient diagnosis and treatment. We believe that the deployment of such cutting-edge technology will significantly improve the accuracy and speed of cardiac assessments.

UA-FedRec: Untargeted Attack on Federated News Recommendation

News recommendation is essential for personalized news distribution. Federated news recommendation, which enables collaborative model learning from multiple clients without sharing their raw data, is a promising approach for preserving users' privacy. However, the security of federated news recommendation is still unclear. In this paper, we study this problem by proposing an untargeted attack on federated news recommendation called UA-FedRec. By exploiting the prior knowledge of news recommendation and federated learning, UA-FedRec can effectively degrade the model performance with a small percentage of malicious clients. First, the effectiveness of news recommendation highly depends on user modeling and news modeling. We design a news similarity perturbation method to make representations of similar news farther and those of dissimilar news closer to interrupt news modeling, and propose a user model perturbation method to make malicious user updates in opposite directions of benign updates to interrupt user modeling. Second, updates from different clients are typically aggregated with a weighted average based on their sample sizes. We propose a quantity perturbation method to enlarge sample sizes of malicious clients in a reasonable range to amplify the impact of malicious updates. Extensive experiments on two real-world datasets show that UA-FedRec can effectively degrade the accuracy of existing federated news recommendation methods, even when defense is applied. Our study reveals a critical security issue in existing federated news recommendation systems and calls for research efforts to address the issue. Our code is available at https://github.com/yjw1029/UA-FedRec.

DGI: An Easy and Efficient Framework for GNN Model Evaluation

While many systems have been developed to train graph neural networks (GNNs), efficient model evaluation, which computes node embedding according to a given model, remains to be addressed. For instance, using the widely adopted node-wise approach, model evaluation can account for over 90% of the time in the end-to-end training process due to neighbor explosion, which means that a node accesses its multi-hop neighbors. The layer-wise approach avoids neighbor explosion by conducting computation layer by layer in GNN models. However, layer-wise model evaluation takes considerable implementation efforts because users need to manually decompose the GNN model into layers, and different implementations are required for GNN models with different structures.

In this paper, we present DGI -a framework for easy and efficient GNN model evaluation, which automatically translates the training code of a GNN model for layer-wise evaluation to minimize user effort. DGI is general for different GNN models and evaluation requests (e.g., computing embedding for all or some of the nodes), and supports out-of-core execution on large graphs that cannot fit in CPU memory. Under the hood, DGI traces the computation graph of GNN model, partitions the computation graph into layers that are suitable for layer-wise evaluation according to tailored rules, and executes each layer efficiently by reordering the computation tasks and managing device memory consumption. Experiment results show that DGI matches hand-written implementations of layer-wise evaluation in efficiency and consistently outperforms node-wise evaluation across different datasets and hardware settings, and the speedup can be over 1,000x.

Learning Multivariate Hawkes Process via Graph Recurrent Neural Network

This paper presents a novel approach for modeling and predicting patterns of events in time-series learning, named graph recurrent temporal point process (GRTPP). Prior research has focused on using deep learning techniques, such as recurrent neural networks (RNNs) or attention-based sequential data embedding, on modeling the time-varying intensity of events. However, these models were typically limited to modeling a single intensity function capturing the event occurrence of all event types simultaneously. GRTPP addresses this issue by encoding multivariate event sequences into a sequence of graphs, where each node contains information about the event occurrence and time. The sequence of graphs is then embedded into node embeddings for each event type, taking into account the relationships between the event types. By integrating the estimated intensity functions, GRTPP predicts the event type and the timing of the next event. The proposed GRTPP model offers improved effectiveness and explainability compared to previous models, as demonstrated through empirical evaluations on five real-world datasets and the actual credit card transaction dataset. The code is available at https://github.com/im0j/GRTPP https://github.com/im0j/GRTPP.

Group-based Fraud Detection Network on e-Commerce Platforms

Along with the rapid technological and commercial innovation on the e-commerce platforms, there are an increasing number of frauds that bring great harm to these platforms. Many frauds are conducted by organized groups of fraudsters for higher efficiency and lower costs, which are also known as group-based frauds. Despite the high concealment and strong destructiveness of group-based fraud, there is no existing research work that can thoroughly exploit the information within the transaction networks of e-commerce platforms for group-based fraud detection. In this work, we analyze and summarize the characteristics of group-based frauds, based on which we propose a novel end-to-end semi-supervised Group-based Fraud Detection Network (GFDN) to support such fraud detection in real-world applications. Experimental results on large-scale e-commerce datasets from Taobao and Bitcoin trading datasets show the superior effectiveness and efficiency of our proposed model for group-based fraud detection on bipartite graphs.

Generating Synergistic Formulaic Alpha Collections via Reinforcement Learning

In the field of quantitative trading, it is common practice to transform raw historical stock data into indicative signals for the market trend. Such signals are called alpha factors. Alphas in formula forms are more interpretable and thus favored by practitioners concerned with risk. In practice, a set of formulaic alphas is often used together for better modeling precision, so we need to find synergistic formulaic alpha sets that work well together. However, most traditional alpha generators mine alphas one by one separately, overlooking the fact that the alphas would be combined later. In this paper, we propose a new alpha-mining framework that prioritizes mining a synergistic set of alphas, i.e., it directly uses the performance of the downstream combination model to optimize the alpha generator. Our framework also leverages the strong exploratory capabilities of reinforcement learning (RL) to better explore the vast search space of formulaic alphas. The contribution to the combination models' performance is assigned to be the return used in the RL process. This return drives the alpha generator to find better alphas that improve upon the current set. Experimental evaluations on real-world stock market data demonstrate both the effectiveness and the efficiency of our framework for stock trend forecasting. The investment simulation results show that our framework is able to achieve higher returns compared to previous approaches.

LibAUC: A Deep Learning Library for X-Risk Optimization

This paper introduces the award-winning deep learning (DL) library called LibAUC for implementing state-of-the-art algorithms towards optimizing a family of risk functions named X-risks. X-risks refer to a family of compositional functions in which the loss function of each data point is defined in a way that contrasts the data point with a large number of others. They have broad applications in AI for solving classical and emerging problems, including but not limited to classification for imbalanced data (CID), learning to rank (LTR), and contrastive learning of representations (CLR). The motivation of developing LibAUC is to address the convergence issues of existing libraries for solving these problems. In particular, existing libraries may not converge or require very large mini-batch sizes in order to attain good performance for these problems, due to the usage of the standard mini-batch technique in the empirical risk minimization (ERM) framework. Our library is for deep X-risk optimization (DXO) that has achieved great success in solving a variety of tasks for CID, LTR and CLR. The contributions of this paper include: (1) It introduces a new mini-batch based pipeline for implementing DXO algorithms, which differs from existing DL pipeline in the design of controlled data samplers and dynamic mini-batch losses; (2) It provides extensive benchmarking experiments for ablation studies and comparison with existing libraries. The LibAUC library features scalable performance for millions of items to be contrasted, faster and better convergence than existing libraries for optimizing X-risks, seamless PyTorch deployment and versatile APIs for various loss optimization. Our library is available to the open source community at https://github.com/Optimization-AI/LibAUC, to facilitate further academic research and industrial applications.

Multi Datasource LTV User Representation (MDLUR)

In this paper, we propose a novel user representation methodology called Multi Datasource LTV User Representation (MDLUR). Our model aims to establish a universal user embedding for downstream tasks, specifically lifetime value (LTV) prediction on specific days after installation. MDLUR uses a combination of various data sources, including user information, portrait, and behavior data from the first n days after installation of the social casino game "Club Vegas Slots" developed by Bagelcode. This model overcomes the limitation of conventional approaches that struggle with effectively utilizing various data sources or accurately capturing interactions in sparse datasets. MDLUR adopts unique model architectures tailored to each data source. Coupled with robust dimensionality reduction techniques, this model succeeds in the effective integration of insights from various data sources. Comprehensive experiments on real-world industrial data demonstrate the superiority of the proposed methods compared to SOTA baselines including Two-Stage XGBoost, WhalesDector, MSDMT, and BST. Not only did it outperform these models, but it has also been efficiently deployed and tested in a live environment using MLOps demonstrating its maintainability. The representation may potentially be applied to a wide range of downstream tasks, including conversion, churn, and retention prediction, as well as user segmentation and item recommendation.

Commonsense Knowledge Graph towards Super APP and Its Applications in Alipay

The recently explosive growth of Super Apps brings great convenience to people's daily life by providing a wide variety of services through mini-programs, including online shopping, travel, finance, and so on. Due to the considerable gap between various scenarios, the restriction of effective information transfer and sharing severely blocks the efficient delivery of online services, potentially affecting the user's app experience. To deeply understand users' needs, we propose SupKG, a commonsense knowledge graph towards Super APP to help comprehensively characterize user behaviors across different business scenarios. In particular, our SupKG is carefully established from multiplex and heterogeneous data source in Alipay (a well-known Super App in China), which also emphasize abundant spatiotemporal relations and intent-related entities to answer the fundamental question in life service ''which service do users need at what time and where''.

On the hand, the successful application of SupKG hinges on the effective form of network representation ie Knowledge Graph Embedding (KGE). However, a series of unsatisfying issues still need to be carefully considered in the industrial environment: i) bridging language representations with knowledge structure in a unified manner, ii) alleviating the skewed data distribution in SupKG, and iii) effectively characterizing hierarchical structures in SupKG. With these motivations, we develop a novel knowledge graph representation learning framework for SupKG, enabling various downstream applications to benefit from learned representations of entities and relations. Extensive experiments on the standard knowledge graph completion task demonstrate the consistent and significant performance improvement of our representation learning framework, which also greatly benefits the supplementation of potential knowledge of SupKG. Towards real-world applications in Alipay, our SupKG and learned representations show the potential superiority of integrating global behaviors in cold-start scenarios and providing high-quality knowledge for warming up the graph-based ranking.

Revisiting Neural Retrieval on Accelerators

Retrieval finds a small number of relevant candidates from a large corpus for information retrieval and recommendation applications. A key component of retrieval is to model (user, item) similarity, which is commonly represented as the dot product of two learned embeddings. This formulation permits efficient inference, commonly known as Maximum Inner Product Search (MIPS). Despite its popularity, dot products cannot capture complex user-item interactions, which are multifaceted and likely high rank. We hence examine non-dot-product retrieval settings on accelerators, and propose mixture of logits (MoL), which models (user, item) similarity as an adaptive composition of elementary similarity functions. This new formulation is expressive, capable of modeling high rank (user, item) interactions, and further generalizes to the long tail. When combined with a hierarchical retrieval strategy, h-indexer, we are able to scale up MoL to 100M corpus on a single GPU with latency comparable to MIPS baselines. On public datasets, our approach leads to uplifts of up to 77.3% in hit rate (HR). Experiments on a large recommendation surface at Meta showed strong metric gains and reduced popularity bias, validating the proposed approach's performance and improved generalization.

Towards a Generic Framework for Mechanism-guided Deep Learning for Manufacturing Applications

Manufacturing data analytics tasks are traditionally undertaken with Mechanism Models (MMs), which are domain-specific mathematical equations modeling the underlying physical or chemical processes of the tasks. Recently, Deep Learning (DL) has been increasingly applied to manufacturing. MMs and DL have their individual pros and cons, motivating the development of Mechanism-guided Deep Learning Models (MDLMs) that combine the two. Existing MDLMs are often tailored to specific tasks or types of MMs, and can fail to effectively 1) utilize interconnections of multiple input examples, 2) adaptively self-correct prediction errors with error bounding, and 3) ensemble multiple MMs. In this work, we propose a generic, task-agnostic MDLM framework that can embed one or more MMs in deep networks, and address the 3 aforementioned issues. We present 2 diverse use cases where we experimentally demonstrate the effectiveness and efficiency of our models.

A Personalized Automated Bidding Framework for Fairness-aware Online Advertising

Powered by machine learning techniques, online advertising platforms have launched various automated bidding strategy services to facilitate intelligent decision-making for advertisers. However, advertisers experience heterogeneous advertising environments, and thus the unified bidding strategies widely used in both academia and industry suffer from severe unfairness issues, resulting in significant ad performance disparity among advertisers. In this work, to resolve the unfairness issue and improve the overall system performance, we propose a personalized automated bidding framework, namely PerBid, shifting the classical automated bidding strategy with a unified agent to multiple context-aware agents corresponding to different advertiser clusters. Specifically, we first design an ad campaign profiling network to model dynamic advertising environments. By clustering the advertisers with similar profiles and generating context-aware automated bidding agents for each cluster, we can match advertisers with personalized automated bidding strategies. Experiments conducted on the real-world dataset and online A/B test on Alibaba display advertising platform demonstrate the effectiveness of PerBid in improving overall ad performance and guaranteeing fairness among heterogeneous advertisers.

Understanding the Semantics of GPS-based Trajectories for Road Closure Detection

The accurate detection of road closures is of great value for real-time updating of digital maps. The existing methods mainly follow the paradigm of detecting the drastic changes in traffic statistical values (e.g., traffic flow), but they may lead to misidentifying since 1) drastic changes of traffic statistical values are hard to be observed in low-heat roads where the passing vehicles are sparse; 2) statistical values are sensitive to noise (e.g., traffic flow for tiny roads and tunnels is prone to miscounting); and 3) statistical values are naturally delayed, and misidentifying may occur when they have not yet shown significant changes. Surprisingly, since GPS-based trajectories can also exhibit significant abnormal patterns for road closures and have the superiority in fine granularity and timeliness, they can naturally tackle the above challenges. In this paper, we present a novel road closure detection framework based on mining the semantics of trajectories, called T-Closure. We first construct a heterogeneous graph based on the trajectory and the planned route to extract the spatial-topological property of each trajectory, where a node-level auxiliary task is proposed to guide the learning of feature encoders. A multi-view heterogeneous graph neural network (MVH-GNN) with a graph-level auxiliary task is then introduced to capture the semantics of trajectories, where intra-category relevance and inter-category interaction are both considered. Finally, a sequence-level auxiliary task refines the ability of LSTM in modeling the semantic relevance among trajectories while enhancing the robustness of our framework. Experiments on four real-world road closure datasets demonstrate the superiority of T-Closure. Online performance shows that T-Closure can detect 7000+ closure events monthly, with a delay within 1.5 hours.

GLM-Dialog: Noise-tolerant Pre-training for Knowledge-grounded Dialogue Generation

We present GLM-Dialog, a large-scale language model (LLM) with 10B parameters capable of knowledge-grounded conversation in Chinese using a search engine to access the Internet knowledge. GLM-Dialog offers a series of applicable techniques for exploiting various external knowledge including both helpful and noisy knowledge, enabling the creation of robust knowledge-grounded dialogue LLMs with limited proper datasets. To evaluate the GLM-Dialog more fairly, we also propose a novel evaluation method to allow humans to converse with multiple deployed bots simultaneously and compare their performance implicitly instead of explicitly rating using multidimensional metrics. Comprehensive evaluations from automatic to human perspective demonstrate the advantages of GLM-Dialog comparing with existing open source Chinese dialogue models. We release both the model checkpoint and source code, and also deploy it as a WeChat application to interact with users. We offer our evaluation platform online in an effort to prompt the development of open source models and reliable dialogue evaluation systems. All the source code is available on Github.

A Collaborative Transfer Learning Framework for Cross-domain Recommendation

In the recommendation systems, there are multiple business domains to meet the diverse interests and needs of users, and the click-through rate(CTR) of each domain can be quite different, which leads to the demand for CTR prediction modeling for different business domains. The industry solution is to use domain-specific models or transfer learning techniques for each domain. The disadvantage of the former is that the data from other domains is not utilized by a single domain model, while the latter leverage all the data from different domains, but the fine-tuned model of transfer learning may trap the model in a local optimum of the source domain, making it difficult to fit the target domain. Meanwhile, significant differences in data quantity and feature schemas between different domains, known as domain shift, may lead to negative transfer in the process of transferring. To overcome these challenges, we propose the Collaborative Cross-Domain Transfer Learning Framework (CCTL). CCTL evaluates the information gain of the source domain on the target domain using a symmetric companion network and adjusts the information transfer weight of each source domain sample using the information flow network. This approach enables full utilization of other domain data while avoiding negative migration. Additionally, a representation enhancement network is used as an auxiliary task to preserve domain-specific features. Comprehensive experiments on both public and real-world industrial datasets, CCTL achieved SOTA score on offline metrics. At the same time, the CCTL algorithm has been deployed in Meituan, bringing 4.37% CTR and 5.43% GMV lift, which is significant to the business.

Constrained Social Community Recommendation

In online social networks, users with similar interests tend to come together, forming social communities. Nowadays, user-defined communities become a prominent part of online social platforms as people who have joined such communities tend to be more active in social networks. Therefore, recommending explicit communities to users provides great potential to advance online services.

In this paper, we focus on the constrained social community recommendation problem in real applications, where each user can only join at most one community. Previous attempts at community recommendation mostly adopt collaborative filtering approaches or random walk-based approaches, while ignoring social relationships between users as well as the local structure of each community. Therefore, they only derive an extremely sparse affinity matrix, which degrades the model performances. To tackle this issue, we propose ComRec which simultaneously captures both global and local information on the extended graph during pre-computation, speeding up the training process on real-world large graphs. In addition, we present a labeling component to improve the expressiveness of our framework. We conduct experiments on three Tencent mobile games to evaluate our proposed method. Extensive experimental results show that our ComRec consistently outperforms other competitors by up to 12.80% and 6.61% in the corresponding evaluation metrics of offline and online experiments, respectively.

TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations at Twitter

Pre-trained language models (PLMs) are fundamental for natural language processing applications. Most existing PLMs are not tailored to the noisy user-generated text on social media, and the pre-training does not factor in the valuable social engagement logs available in a social network. We present TwHIN-BERT, a multilingual language model productionized at Twitter, trained on in-domain data from the popular social network. TwHIN-BERT differs from prior pre-trained language models as it is trained with not only text-based self-supervision but also with a social objective based on the rich social engagements within a Twitter heterogeneous information network (TwHIN). Our model is trained on 7 billion tweets covering over 100 distinct languages, providing a valuable representation to model short, noisy, user-generated text. We evaluate our model on various multilingual social recommendation and semantic understanding tasks and demonstrate significant metric improvement over established pre-trained language models. We open-source TwHIN-BERT and our curated hashtag prediction and social engagement benchmark datasets to the research community.

Empowering Long-tail Item Recommendation through Cross Decoupling Network (CDN)

Industry recommender systems usually suffer from highly-skewed long-tail item distributions where a small fraction of the items receives most of the user feedback. This skew hurts recommender quality especially for the item slices without much user feedback. While there have been many research advances made in academia, deploying these methods in production is very difficult and very few improvements have been made in industry. One challenge is that these methods often hurt overall performance; additionally, they could be complex and expensive to train and serve.

In this work, we aim to improve tail item recommendations while maintaining the overall performance with less training and serving cost. We first find that the predictions of user preferences are biased under long-tail distributions. The bias comes from the differences between training and serving data in two perspectives: 1) the item distributions, and 2) user's preference given an item. Most existing methods mainly attempt to reduce the bias from the item distribution perspective, ignoring the discrepancy from user preference given an item. This leads to a severe forgetting issue and results in sub-optimal performance.

To address the problem, we design a novel Cross Decoupling Network (CDN) to reduce the two differences. Specifically, CDN (i) decouples the learning process of memorization and generalization on the item side through a mixture-of-expert architecture; (ii) decouples the user samples from different distributions through a regularized bilateral branch network. Finally, a new adapter is introduced to aggregate the decoupled vectors, and softly shift the training attention to tail items. Extensive experimental results show that CDN significantly outperforms state-of-the-art approaches on popular benchmark datasets. We also demonstrate its effectiveness by a case study of CDN in a large-scale recommendation system at Google.

Towards Disentangling Relevance and Bias in Unbiased Learning to Rank

Unbiased learning to rank (ULTR) studies the problem of mitigating various biases from implicit user feedback data such as clicks, and has been receiving considerable attention recently. A popular ULTR approach for real-world applications uses a two-tower architecture, where click modeling is factorized into a relevance tower with regular input features, and a bias tower with bias-relevant inputs such as the position of a document. A successful factorization will allow the relevance tower to be exempt from biases. In this work, we identify a critical issue that existing ULTR methods ignored - the bias tower can be confounded with the relevance tower via the underlying true relevance. In particular, the positions were determined by the logging policy, i.e., the previous production model, which would possess relevance information. We give both theoretical analysis and empirical results to show the negative effects on relevance tower due to such a correlation. We then propose two methods to mitigate the negative confounding effects by better disentangling relevance and bias. Offline empirical results on both controlled public datasets and a large-scale industry dataset show the effectiveness of the proposed approaches. We conduct a live experiment on a popular web store for four weeks, and find a significant improvement in user clicks over the baseline, which ignores the negative confounding effect.

Modeling Dual Period-Varying Preferences for Takeaway Recommendation

Takeaway recommender systems, which aim to accurately provide stores that offer foods meeting users' interests, have served billions of users in our daily life. Different from traditional recommendation, takeaway recommendation faces two main challenges: (1) Dual Interaction-Aware Preference Modeling. Traditional recommendation commonly focuses on users' single preferences for items while takeaway recommendation needs to comprehensively consider users' dual preferences for stores and foods. (2) Period-Varying Preference Modeling. Conventional recommendation generally models continuous changes in users' preferences from a session-level or day-level perspective. However, in practical takeaway systems, users' preferences vary significantly during the morning, noon, night, and late night periods of the day. To address these challenges, we propose a Dual Period-Varying Preference modeling (DPVP) for takeaway recommendation. Specifically, we design a dual interaction-aware module, aiming to capture users' dual preferences based on their interactions with stores and foods. Moreover, to model various preferences in different time periods of the day, we propose a time-based decomposition module as well as a time-aware gating mechanism. Extensive offline and online experiments demonstrate that our model outperforms state-of-the-art methods on real-world datasets and it is capable of modeling the dual period-varying preferences. Moreover, our model has been deployed online on Meituan Takeaway platform, leading to an average improvement in GMV (Gross Merchandise Value) of 0.70%.

Robust Multimodal Failure Detection for Microservice Systems

Proactive failure detection of instances is vitally essential to microservice systems because an instance failure can propagate to the whole system and degrade the system's performance. Over the years, many single-modal (i.e., metrics, logs, or traces) databased anomaly detection methods have been proposed. However, they tend to miss a large number of failures and generate numerous false alarms because they ignore the correlation of multimodal data. In this work, we propose AnoFusion, an unsupervised failure detection approach, to proactively detect instance failures through multimodal data for microservice systems. It applies a Graph Transformer Network (GTN) to learn the correlation of the heterogeneous multimodal data and integrates a Graph Attention Network (GAT) with Gated Recurrent Unit (GRU) to address the challenges introduced by dynamically changing multimodal data. We evaluate the performance of AnoFusion through two datasets, demonstrating that it achieves the F1-score of 0.857 and 0.922, respectively, outperforming the state-of-the-art failure detection approaches.

M5: Multi-Modal Multi-Interest Multi-Scenario Matching for Over-the-Top Recommendation

Matching preferred shows to the subscribers is extremely important in the Over-the-Top (OTT) platforms. The existing methods did not adequately consider the characteristics of the OTT services, i.e., rich meta information, diverse user interests, and mixed recommendation scenarios, leading to sub-optimal performance. This paper introduces the Multi-Modal Multi-Interest Multi-Scenario Matching (M5) for the OTT recommendation to fully exploit these attributes. A multi-modal embedding layer is first introduced to transform the show IDs into both ID embeddings initialized randomly and content graph (CG) embeddings derived from the node representations pre-trained on a metagraph. To segregate the semantics between ID and CG embeddings, M5 exploits the mirrored two-tower modeling in the subsequent layers for efficiency and effectiveness. Specifically, a multi-interest extraction layer is proposed separately on ID and CG behaviors to model users' coarse-grained and fine-grained interests through behavioral categorization, subsidiary decoration, masked-language-modeling augmented self-attention modeling and subsidiary-intensity interest calibration. Facing the inherent diverse scenarios, M5 distinguishes the scenario differences at both feature and model levels, which crosses features with the scenario indicators and employs Split Mixture-of-Experts to generate the ID, and CG user embeddings. Finally, a weighted candidate matching layer is established to calculate the ID- and CG-oriented user-item preferences and then merge into a hybrid score with dynamic weighting. The extensive online and offline experiments over two real-world OTT platforms Hulu and Disney+ reveal that M5 significantly outperforms the previous state-of-the-art and online matching algorithms over various scenarios, indicating the effectiveness and robustness of the proposed method. M5 has been fully deployed on the main traffic of the most popular "For You'' sets of both platforms, continuously enhancing the user experience for hundreds of millions of subscribers every day and steadily increasing business revenue.

JiuZhang 2.0: A Unified Chinese Pre-trained Language Model for Multi-task Mathematical Problem Solving

Although pre-trained language models~(PLMs) have recently advanced the research progress in mathematical reasoning, they are not specially designed as a capable multi-task solver, suffering from high cost for multi-task deployment (e.g. a model copy for a task) and inferior performance on complex mathematical problems in practical applications. To address these issues, we propose JiuZhang 2.0, a unified Chinese PLM specially for multi-task mathematical problem solving. Our idea is to maintain a moderate-sized model and employ the cross-task knowledge sharing to improve the model capacity in a multi-task setting. Specially, we construct a Mixture-of-Experts (MoE) architecture for modeling mathematical text, to capture the common mathematical knowledge across tasks. For optimizing the MoE architecture, we design multi-task continual pre-training and multi-task fine-tuning strategies for multi-task adaptation. These training strategies can effectively decompose the knowledge from the task data and establish the cross-task sharing via expert networks. To further improve the general capacity of solving different complex tasks, we leverage large language models (LLMs) as complementary models to iteratively refine the generated solution by our PLM, via in-context learning. Extensive experiments have demonstrated the effectiveness of our model.

CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X

Large pre-trained code generation models, such as OpenAI Codex, can generate syntax-and function-correct code, making the coding of programmers more productive. In this paper, we introduce CodeGeeX, a multilingual model with 13 billion parameters for code generation. CodeGeeX is pre-trained on 850 billion tokens of 23 programming languages as of June 2022. Our extensive experiments suggest that CodeGeeX outperforms multilingual code models of similar scale for both the tasks of code generation and translation on HumanEval-X. Building upon HumanEval (Python only), we develop the HumanEval-X benchmark for evaluating multilingual models by hand-writing the solutions in C++, Java, JavaScript, and Go. In addition, we build CodeGeeX-based extensions on Visual Studio Code, JetBrains, and Cloud Studio, generating 8 billion tokens for tens of thousands of active users per week. Our user study demonstrates that CodeGeeX can help to increase coding efficiency for 83.4% of its users. Finally, CodeGeeX is publicly accessible since Sep. 2022, we open-sourced its code, model weights, API, extensions, and HumanEval-X at https://github.com/THUDM/CodeGeeX.

MIDLG: Mutual Information based Dual Level GNN for Transaction Fraud Complaint Verification

"Transaction fraud" complaint verification, i.e., verifying whether a transaction corresponding to a complaint is fraudulent, is particularly critical to prevent economic loss. Compared with traditional fraud pre-transaction detection, complaint verification puts forward higher requirements: 1)an individual tends to exhibit different identities in different complaints, e.g., complainant or respondent, requiring the model to capture identity-related representations corresponding to the complaint; 2)the fraud ways evolve frequently to confront detection, requiring the model to perform stably under different fraud ways. Previous methods mainly focused on fraud pre-transaction detection, utilizing the historical information of users or conduct message passing based GNNs on relationship networks. However, they rarely consider capturing various identity-related representations and ignore the evolution of fraud ways, leading to failure in complaint verification. To address the above challenges, we propose the mutual information based dual level graph neural network, namely MIDLG, which defines a complaint as a super-node consisting of involved individuals, and characterizes the individual over node-level and super-node-level. Furthermore, the mutual information minimization objective is proposed based on "complaint verification-causal graph" to decouple the model prediction from relying on specific fraud ways, and thus achieve stability. MIDLG achieves SOTA results through extensive experiments in complaint verification on WeChat Finance, one online payment service serving more than 600 million users in China.

Road Planning for Slums via Deep Reinforcement Learning

Millions of slum dwellers suffer from poor accessibility to urban services due to inadequate road infrastructure within slums, and road planning for slums is critical to the sustainable development of cities. Existing re-blocking or heuristic methods are either time-consuming which cannot generalize to different slums, or yield sub-optimal road plans in terms of accessibility and construction costs. In this paper, we present a deep reinforcement learning based approach to automatically layout roads for slums. We propose a generic graph model to capture the topological structure of a slum, and devise a novel graph neural network to select locations for the planned roads. Through masked policy optimization, our model can generate road plans that connect places in a slum at minimal construction costs. Extensive experiments on real-world slums in different countries verify the effectiveness of our model, which can significantly improve accessibility by 14.3% against existing baseline methods. Further investigations on transferring across different tasks demonstrate that our model can master road planning skills in simple scenarios and adapt them to much more complicated ones, indicating the potential of applying our model in real-world slum upgrading. The code and data are available at https://github.com/tsinghua-fib-lab/road-planning-for-slums.

Online Few-Shot Time Series Classification for Aftershock Detection

Seismic monitoring systems sift through seismograms in real-time, searching for target events, such as underground explosions. In this monitoring system, a burst of aftershocks (minor earthquakes occur after a major earthquake over days or even years) can be a source of confounding signals. Such a burst of aftershock signals can overload the human analysts of the monitoring system. To alleviate this burden at the onset of a sequence of events (e.g., aftershocks), a human analyst can label the first few of these events and start an online classifier to filter out subsequent aftershock events. We propose an online few-shot classification model FewSig for time series data for the above use case. The framework of FewSig consists of a selective model to identify the high-confidence positive events which are used for updating the models and a general classifier to label the remaining events. Our specific technique uses a %two-level decision tree selective model based on sliding DTW distance and a general classifier model based on distance metric learning with Neighborhood Component Analysis (NCA). The algorithm demonstrates surprising robustness when tested on univariate datasets from the UEA/UCR archive. Furthermore, we show two real-world earthquake events where the FewSig reduces the human effort in monitoring applications by filtering out the aftershock events.

PDAS: A Practical Distributed ADMM System for Large-Scale Linear Programming Problems at Alipay

Linear programming (LP) is arguably the most common optimization problem encountered in practical settings. Important examples include machine learning systems optimization, resource allocation, and other decision-making scenarios. However, even with state-of-the-art (SOTA) solvers, it is extremely challenging to solve large-scale problems arising in industry settings, which could have up to billions of decision variables and require solutions within a time limit to meet business demands. This paper proposes PDAS, a Practical Distributed ADMM System to solve such problems with a variant of the Alternating Direction Method of Multipliers (ADMM) algorithm. PDAS offers user-friendly interfaces and provides near-linear speedup thanks to its high scalability and excellent performance. It also comes with a failover mechanism to ensure the stability of the iterative process. The convergence, feasibility, and optimality of PDAS have been verified on two real-world data-sets, resulting in a 10-4 average relative deviation from Gurobi. Although SOTA solvers do have advantages if only considering the solving time when tested on five small and medium-sized public data-sets, PDAS is more promising after including the modeling time. Moreover, when used to solve large-scale LP problems with up to 109 decision variables and 104 constraints in three real-world scenarios, PDAS achieves at least 2x speedups, well beyond the capabilities of SOTA.

ReLoop2: Building Self-Adaptive Recommendation Models via Responsive Error Compensation Loop

Industrial recommender systems face the challenge of operating in non-stationary environments, where data distribution shifts arise from evolving user behaviors over time. To tackle this challenge, a common approach is to periodically re-train or incrementally update deployed deep models with newly observed data, resulting in a continual learning process. However, the conventional learning paradigm of neural networks relies on iterative gradient-based updates with a small learning rate, making it slow for large recommendation models to adapt. In this paper, we introduce ReLoop2, a self-correcting learning loop that facilitates fast model adaptation in online recommender systems through responsive error compensation. Inspired by the slow-fast complementary learning system observed in human brains, we propose an error memory module that directly stores error samples from incoming data streams. These stored samples are subsequently leveraged to compensate for model prediction errors during testing, particularly under distribution shifts. The error memory module is designed with fast access capabilities and undergoes continual refreshing with newly observed data samples during the model serving phase to support fast model adaptation. We evaluate the effectiveness of ReLoop2 on three open benchmark datasets as well as a real-world production dataset. The results demonstrate the potential of ReLoop2 in enhancing the responsiveness and adaptiveness of recommender systems operating in non-stationary environments.

A Feature-Based Coalition Game Framework with Privileged Knowledge Transfer for User-tag Profile Modeling

User-tag profiling is an effective way of mining user attributes in modern recommender systems. However, prior researches fail to extract users' precise preferences for tags in the items due to their incomplete feature-input patterns. To convert user-item interactions to user-tag preferences, we propose a novel feature-based framework named Coalition Tag Multi-View Mapping (CTMVM), which identifies and investigates two special features, Coalition Feature and Privileged Feature. The former indicates decisive tags in each click where relationships between tags in one item are treated as a coalition game. The latter represents highly informative features that only occur during training. For the coalition feature, we adopt Shapley Value based Empowerment (SVE) to model the tags in items with a game-theoretic paradigm and charge the network to straight master user preferences for essential tags. For the privileged feature, we present Privileged Knowledge Mapping (PKM) to explicitly distill privileged feature knowledge for each tag into one single embedding, which assists the model in predicting user-tag preferences at a more fine-grained level. However, the barren capacity of single embeddings limits the diverse relations between each tag and different privileged features. Therefore, we further propose Adaptive Multi-View Mapping (AMVM) model to enhance effect by handling multiple mapping networks. Excellent offline experiment results on two public and one private datasets show the out-standing performance of CTMVM. After the deployment on Alibaba large-scale recommendation systems, CTMVM achieved improvement by 10.81% and 6.74% in terms of Theme-CTR and Item-CTR respectively, which validates the effectiveness of taking in the two particular features for training.

C-AOI: Contour-based Instance Segmentation for High-Quality Areas-of-Interest in Online Food Delivery Platform

Online food delivery (OFD) services have become popular globally, serving people's daily needs. Precise area-of-interest (AOI) boundaries help OFD platforms determine customers' exact locations, which is crucial for maintaining consistency in delivery difficulty and providing a uniform customer experience within an AOI. Existing AOI generation methods primarily rely on predefined shapes or density-based clustering, which limits the quality of the contours. Recently, Meituan has treated the AOI contours as a binary semantic segmentation problem. Their approach involves a multi-step post-process to address the issues with boundary breaks caused by semantic segmentation models, leading to decreased quality and inefficiency in the learning process. In this paper, we propose a novel method for AOI contour generation called C-AOI (Contour-based Area-of-Interest). C-AOI is an instance segmentation model that focuses on generating high-quality AOI contours. Unlike the former method, which relies on pixel-by-pixel classification, C-AOI starts from the center point of the AOI and regresses the boundary. This approach results in a higher-quality boundary and is less computationally intensive. C-AOI first corrects errors on the contour using a local aggregation mechanism. Then, we propose a novel deforming module called the contour transformer, which captures the global geometry of the object. To enhance the positional relationship among vertices, we introduce a learnable cyclic positional encoding applied to the contour transformer. Finally, to improve the boundary details, we propose the Adaptive Matching Loss (AML) that eliminates over-smoothed boundaries and promotes optimized convergence pathways. Experimental results on real-world datasets collected from Meituan have demonstrated that C-AOI significantly improves the mask and boundary quality compared to Meituan's previous work. Moreover, Its inference speed is comparable to that of E2EC, a state-of-the-art real-time contour-based method. It is noteworthy that C-AOI has been deployed in the Meituan platform for producing AOIs.

SESSION: Hands On Tutorials

AI Explainability 360 Toolkit for Time-Series and Industrial Use Cases

With the growing adoption of AI, trust and explainability have become critical which has attracted a lot of research attention over the past decade and has led to the development of many popular AI explainability libraries such as AIX360, Alibi, OmniXAI, etc. Despite that, applying explainability techniques in practice often poses challenges such as lack of consistency between explainers, semantically incorrect explanations, or scalability. Furthermore, one of the key modalities that has been less explored, both from the algorithmic and practice point of view, is time-series. Several application domains involve time-series including Industry 4.0, asset monitoring, supply chain or finance to name a few.

The AIX360 library (https://github.com/Trusted-AI/AIX360) has been incubated by the Linux Foundation AI & Data open-source projects and it has gained significant popularity: its public GitHub repository has over 1.3K stars and has been broadly adopted in the academic and applied settings. Motivated by industrial applications, large scale client projects and deployments in software products in the areas of IoT, asset management or supply chain, the AIX360 library has been recently expanded significantly to address the above challenges. AIX360 now includes new techniques including support for time-series modality introducing time series based explainers such as TS-LIME, TS Saliency explainer, TS-ICE and TS-SHAP. It also introduces improvements in generating model agnostic, consistent, diverse, and scalable explanations, and new algorithms for tabular data.

In this hands-on tutorial, we provide an overview of the library with the focus on the latest additions, time series explainers and use cases such as forecasting, time series anomaly detection or classification, and hands-on demonstrations based on industrial use-cases selected to demonstrate practical challenges and how they are addressed. The audience will be able to evaluate different types of explanations with a focus on practical aspects motivated by real deployments.

Addressing Bias and Fairness in Machine Learning: A Practical Guide and Hands-on Tutorial

As data science and machine learning (ML) increasingly shape our society, the importance of developing fair algorithmic decision-making systems becomes paramount. There is a pressing need to train data scientists and practitioners on handling bias and fairness in real-world scenarios, from early stages of a data science project to maintaining ML systems in production. Existing resources are mostly academic and cover the ML training and optimization aspects of bias mitigation, leaving practitioners without comprehensive frameworks for making decisions throughout a real-world project lifecycle. This tutorial aims to bridge the gap between research and practice, providing an in-depth exploration of algorithmic fairness, encompassing metrics and definitions, practical case studies, data bias understanding, bias mitigation and model fairness audits using the Aequitas toolkit. Participants will be equipped to engage in conversations about bias, assist decision-makers in understanding options and trade-offs, evaluate project scoping aspects influencing fairness outcomes, and define actions and interventions based on model predictions. They will also learn to identify cohorts, target variables, evaluation metrics, and establish bias and fairness goals for different groups. Moreover, participants will gain insights into auditing and mitigating model bias, and implementing continuous monitoring to assess retraining needs. The tutorial addresses the current lack of practical training materials, methodologies, and tools for researchers and developers working on real-world algorithmic decision-making systems. By the conclusion of this hands-on tutorial, attendees will be well-versed in navigating bias-related issues, selecting appropriate metrics, and applying bias audit and mitigation frameworks and tools for informed design decisions in real-world data science systems.

Practical Design of Performant Recommender Systems using Large-scale Linear Programming-based Global Inference

Several key problems in web-scale recommender systems, such as optimal matching and allocation, can be formulated as large-scale linear programs (LPs) [4, 1]. These LPs take predictions from ML models such as probabilities of click, like, etc. as inputs and optimize recommendations made to users. In recent years, there has been an explosion in the research and development of large-scale recommender systems, but effective optimization of business objectives using the output of those systems remains a challenge. Although LPs can help optimize such business objectives, and algorithms for solving LPs have existed since the 1950s [5, 8], generic LP solvers cannot handle the scale of these problems. At LinkedIn, we have developed algorithms that can solve LPs of various forms with trillions of variables in a Spark-based library called "DuaLip" [7], a novel distributed solver that solves a perturbation of the LP problem at scale via gradient-based algorithms on the smooth dual of the perturbed LP. DuaLip has been deployed in production at LinkedIn and powers several very large-scale recommender systems. DuaLip is open-sourced and extensible in terms of features and algorithms.

In this first-of-its-kind tutorial, we will motivate the application of LPs to improve recommender systems, cover the theory of key LP algorithms [8, 6], and introduce DuaLip (https://github.com/linkedin/DuaLip), a highly performant Spark-based library that solves extreme-scale LPs for a large variety of recommender system problems. We will describe practical successes of large-scale LP in t