KDD ‘18- Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Full Citation in the ACM Digital Library

SESSION: Keynote Addresses

Data Science for Financial Applications

David J. Hand

Financial applications of data science provide a perfect illustration of the power of the shift from subjective decision-making to data- and evidence-driven decision-making. In the space of some fifty years, an entire sector of industry has been totally revolutionised. Such applications come in three broad areas: actuarial and insurance, consumer banking, and investment banking. Actuarial and insurance work was one of the earliest adopters of data science ideas, dating from long before the term had been coined, and even before the computer had been invented. But these areas have fallen behind the latest advances in data science technology - which means there is considerable potential for applying modern data analytic ideas. Consumer banking has been described as one the first and major success stories of the data revolution. Dating from the 1960s, when the first credit cards were launched, techniques for analysing the massive data sets of consumer financial transactions have driven much of the development of data mining and data science ideas. But new model types, and new sources of data, are leading to a rich opportunity for significant developments. In investment banking the “efficient market hypothesis” of classic economics says that it is impossible to predict the financial markets. But this is false - though very nearly true. That means that there is an opportunity to use advanced data analytic methods to exploit the tiny gap between conventional theory and what actually happens. Other data science issues, such as data quality, ethics, and security, along with the need to understand the limitations of models, become particularly pointed in the context of financial applications.

Market Design and Computerized Marketplaces

Alvin E. Roth

Markets and marketplaces are ancient human artifacts, but in recent years they have become ever more important. In part this is because marketplaces are becoming computerized. Together with the introduction of smart phones, this also makes them ubiquitous. We can order car rides to the airport, plane rides to London, and hotel rooms for when we arrive, all on our smartphones. And as we do so we leave a data trail that is easily combined with other streams of data. This is changing not only how we interact with markets, but also how we manage and regard privacy. I’ll discuss some recent developments in computerized markets and speculate about some still to come.

On Big Data Learning for Small Data Problems

Yee Whye Teh

Much recent progress in machine learning have been fueled by the explosive growth in the amount and diversity of data available, and the computational resources needed to crunch through the data. This begs the question of whether machine learning systems necessarily need large amounts of data to solve a task well. An exciting recent development, under the banners of meta-learning, lifelong learning, learning to learn, multitask learning etc., has been the observation that often there is heterogeneity within the data sets at hand, and in fact a large data set can be viewed more productively as many smaller data sets, each pertaining to a different task. For example, in recommender systems each user can be said to be a different task with a small associated data set, and in AI one holy grail is how to develop systems that can learn to solve new tasks quickly from small amounts of data. In such settings, the problem is then how to “learn to learn quickly”, by making use of similarities among tasks. One perspective for how this is achievable is that exposure to lots of previous tasks allows the system to learn a rich prior knowledge about the world in which tasks are sampled from, and it is with rich world knowledge that the system is able to solve new tasks quickly. This is a very active, vibrant and diverse area of research, with many different approaches proposed recently. In this talk I will describe a view of this problem from probabilistic and deep learning perspectives, and describe a number of efforts in this direction that I have recently been involved in.

Data for Good: Abstract

Jeannette M. Wing

I use the tagline “Data for Good” to state paronomastically how we as a community should be promoting data science, especially in training future generations of data scientists. First, we should use data science for the good of humanity and society. Data science should be used to better people’s lives. Data science should be used to improve relationships among people, organizations, and institutions. Data science, in collaboration with other disciplines, should be used to help tackle societal grand challenges such as climate change, education, energy, environment, healthcare, inequality, and social justice. Second, we should use data in a good manner. The acronym FATES suggests what “good” means. Fairness means that the models we build are used to make unbiased decisions or predictions. Accountability means to determine and assign responsibility-to someone or to something-for a judgment made by a machine. Transparency means being open and clear to the end user about how an outcome, e.g., a classification, a decision, or a prediction, is made. Ethics for data science means paying attention to both the ethical and privacy-preserving collection and use of data as well as the ethical decisions that the automated systems we build will make. Safety and security (yes, two words for one “S”) means ensuring that the systems we build are safe (do no harm) and secure (guard against malicious behavior).

SESSION: Applied Data Science Track Papers

ActiveRemediation: The Search for Lead Pipes in Flint, Michigan

Jacob Abernethy
Alex Chojnacki
Arya Farahi
Eric Schwartz
Jared Webb

We detail our ongoing work in Flint, Michigan to detect pipes made of lead and other hazardous metals. After elevated levels of lead were detected in residents’ drinking water, followed by an increase in blood lead levels in area children, the state and federal governments directed over $125 million to replace water service lines, the pipes connecting each home to the water system. In the absence of accurate records, and with the high cost of determining buried pipe materials, we put forth a number of predictive and procedural tools to aid in the search and removal of lead infrastructure. Alongside these statistical and machine learning approaches, we describe our interactions with government officials in recommending homes for both inspection and replacement, with a focus on the statistical model that adapts to incoming information. Finally, in light of discussions about increased spending on infrastructure development by the federal government, we explore how our approach generalizes beyond Flint to other municipalities nationwide.

Deploying Machine Learning Models for Public Policy: A Framework

Klaus Ackermann
Joe Walsh
Adolfo De Unánue
Hareem Naveed
Andrea Navarrete Rivera
Sun-Joo Lee
Jason Bennett
Michael Defoe
Crystal Cody
Lauren Haynes
Rayid Ghani

Machine learning research typically focuses on optimization and testing on a few criteria, but deployment in a public policy setting requires more. Technical and non-technical deployment issues get relatively little attention. However, for machine learning models to have real-world benefit and impact, effective deployment is crucial. In this case study, we describe our implementation of a machine learning early intervention system (EIS) for police officers in the Charlotte-Mecklenburg (North Carolina) and Metropolitan Nashville (Tennessee) Police Departments. The EIS identifies officers at high risk of having an adverse incident, such as an unjustified use of force or sustained complaint. We deployed the same code base at both departments, which have different underlying data sources and data structures. Deployment required us to solve several new problems, covering technical implementation, governance of the system, the cost to use the system, and trust in the system. In this paper we describe how we addressed and solved several of these challenges and provide guidance and a framework of important issues to consider for future deployments.

Online Parameter Selection for Web-based Ranking Problems

Deepak Agarwal
Kinjal Basu
Souvik Ghosh
Ying Xuan
Yang Yang
Liang Zhang

Web-based ranking problems involve ordering different kinds of items in a list or grid to be displayed in mediums like a website or a mobile app. In most cases, there are multiple objectives or metrics like clicks, viral actions, job applications, advertising revenue and others that we want to balance. Constructing a serving algorithm that achieves the desired tradeoff among multiple objectives is challenging, especially for more than two objectives. In addition, it is often not possible to estimate such a serving scheme using offline data alone for non-stationary systems with frequent online interventions. We consider a large-scale online application where metrics for multiple objectives are continuously available and can be controlled in a desired fashion by changing certain control parameters in the ranking model. We assume that the desired balance of metrics is known from business considerations. Our approach models the balance criteria as a composite utility function via a Gaussian process over the space of control parameters. We show that obtaining a solution can be equated to finding the maximum of the Gaussian process, practically obtainable via Bayesian optimization. However, implementing such a scheme for large-scale applications is challenging. We provide a novel framework to do so and illustrate its efficacy in the context of LinkedIn Feed. In particular, we show the effectiveness of our method by using both offline simulations as well as promising online A/B testing results. At the time of writing this paper, the method described was fully deployed on the LinkedIn Feed.

Predicting Estimated Time of Arrival for Commercial Flights

Samet Ayhan
Pablo Costas
Hanan Samet

Unprecedented growth is expected globally in commercial air traffic over the next ten years. To accommodate this increase in volume, a new concept of operations has been implemented in the context of the Next Generation Air Transportation System (NextGen) in the USA and the Single European Sky ATM Research (SESAR) in Europe. However, both of the systems approach airspace capacity and efficiency deterministically, failing to account for external operational circumstances which can directly affect the aircraft’s actual flight profile. A major factor in increased airspace efficiency and capacity is accurate prediction of Estimated Time of Arrival (ETA) for commercial flights, which can be a challenging task due to a non-deterministic nature of environmental factors, and air traffic. Inaccurate prediction of ETA can cause potential safety risks and loss of resources for Air Navigation Service Providers (ANSP), airlines and passengers. In this paper, we present a novel ETA Prediction System for commercial flights. The system learns from historical trajectories and uses their pertinent 3D grid points to collect key features such as weather parameters, air traffic, and airport data along the potential flight path. The features are fed into various regression models and a Recurrent Neural Network (RNN) and the best performing models with the most accurate ETA predictions are compared with the ETAs currently operational by the European ANSP, EUROCONTROL. Evaluations on an extensive set of real trajectory, weather, and airport data in Europe verify that our prediction system generates more accurate ETAs with a far smaller standard deviation than those of EUROCONTROL. This translates to smaller prediction windows of flight arrival times, thereby enabling airlines to make more cost-effective ground resource allocation and ANSPs to make more efficient flight schedules.

Interpretable Representation Learning for Healthcare via Capturing Disease Progression through Time

Tian Bai
Shanshan Zhang
Brian L. Egleston
Slobodan Vucetic

Various deep learning models have recently been applied to predictive modeling of Electronic Health Records (EHR). In medical claims data, which is a particular type of EHR data, each patient is represented as a sequence of temporally ordered irregularly sampled visits to health providers, where each visit is recorded as an unordered set of medical codes specifying patient’s diagnosis and treatment provided during the visit. Based on the observation that different patient conditions have different temporal progression patterns, in this paper we propose a novel interpretable deep learning model, called Timeline. The main novelty of Timeline is that it has a mechanism that learns time decay factors for every medical code. This allows the Timeline to learn that chronic conditions have a longer lasting impact on future visits than acute conditions. Timeline also has an attention mechanism that improves vector embeddings of visits. By analyzing the attention weights and disease progression functions of Timeline, it is possible to interpret the predictions and understand how risks of future visits change over time. We evaluated Timeline on two large-scale real world data sets. The specific task was to predict what is the primary diagnosis category for the next hospital visit given previous visits. Our results show that Timeline has higher accuracy than the state of the art deep learning models based on RNN. In addition, we demonstrate that time decay factors and attentions learned by Timeline are in accord with the medical knowledge and that Timeline can provide a useful insight into its predictions.

Scalable Query N-Gram Embedding for Improving Matching and Relevance in Sponsored Search

Xiao Bai
Erik Ordentlich
Yuanyuan Zhang
Andy Feng
Adwait Ratnaparkhi
Reena Somvanshi
Aldi Tjahjadi

Sponsored search has been the major source of revenue for commercial web search engines. It is crucial for a sponsored search engine to retrieve ads that are relevant to user queries to attract clicks as advertisers only pay when their ads get clicked. Retrieving relevant ads for a query typically involves in first matching related ads to the query and then filtering out irrelevant ones. Both require understanding the semantic relationship between a query and an ad. In this work, we propose a novel embedding of queries and ads in sponsored search. The query embeddings are generated from constituent word n-gram embeddings that are trained to optimize an event level word2vec objective over a large volume of search data. We show through a query rewriting task that the proposed query n-gram embedding model outperforms the state-of-the-art word embedding models for capturing query semantics. This allows us to apply the proposed query n-gram embedding model to improve query-ad matching and relevance in sponsored search. First, we use the similarity between a query and an ad derived from the query n-gram embeddings as an additional feature in the query-ad relevance model used in Yahoo Search. We show through online A/B test that using the new relevance model to filter irrelevant ads offline leads to 0.47% CTR and 0.32% revenue increase. Second, we propose a novel online query to ads matching system, built on an open-source big-data serving engine [30], using the learned query n-gram embeddings. Online A/B test shows that the new matching technique increases the search revenue by 2.32% as it significantly increases the ad coverage for tail queries.

Buy It Again: Modeling Repeat Purchase Recommendations

Rahul Bhagat
Srevatsan Muralidharan
Alex Lobzhanidze
Shankar Vishwanath

Repeat purchasing, i.e., a customer purchasing the same product multiple times, is a common phenomenon in retail. As more customers start purchasing consumable products (e.g., toothpastes, diapers, etc.) online, this phenomenon has also become prevalent in e-commerce. However, in January 2014, when we looked at popular e-commerce websites, we did not find any customer-facing features that recommended products to customers from their purchase history to promote repeat purchasing. Also, we found limited research about repeat purchase recommendations and none that deals with the large scale purchase data that e-commerce websites collect. In this paper, we present the approach we developed for modeling repeat purchase recommendations. This work has demonstrated over 7% increase in the product click through rate on the personalized recommendations page of the Amazon.com website and has resulted in the launch of several customer-facing features on the Amazon.com website, the Amazon mobile app, and other Amazon websites.

Rosetta: Large Scale System for Text Detection and Recognition in Images

Fedor Borisyuk
Albert Gordo
Viswanath Sivakumar

In this paper we present a deployed, scalable optical character recognition (OCR) system, which we call Rosetta , designed to process images uploaded daily at Facebook scale. Sharing of image content has become one of the primary ways to communicate information among internet users within social networks such as Facebook, and the understanding of such media, including its textual information, is of paramount importance to facilitate search and recommendation applications. We present modeling techniques for efficient detection and recognition of text in images and describe Rosetta ‘s system architecture. We perform extensive evaluation of presented technologies, explain useful practical approaches to build an OCR system at scale, and provide insightful intuitions as to why and how certain components work based on the lessons learnt during the development and deployment of the system.

Product Characterisation towards Personalisation: Learning Attributes from Unstructured Data to Recommend Fashion Products

Ângelo Cardoso
Fabio Daolio
Saúl Vargas

We describe a solution to tackle a common set of challenges in e-commerce, which arise from the fact that new products are continually being added to the catalogue. The challenges involve properly personalising the customer experience, forecasting demand and planning the product range. We argue that the foundational piece to solve all of these problems is having consistent and detailed information about each product, which is rarely available or consistent given the multitude of suppliers and types of products. We describe in detail the architecture and methodology implemented at ASOS, one of the world’s largest fashion e-commerce retailers, to tackle this problem. We then show how this quantitative understanding of the products can be leveraged to improve recommendations in a hybrid recommender system approach.

Rotation-blended CNNs on a New Open Dataset for Tropical Cyclone Image-to-intensity Regression

Boyo Chen
Buo-Fu Chen
Hsuan-Tien Lin

Tropical cyclone (TC) is a type of severe weather systems that occur in tropical regions. Accurate estimation of TC intensity is crucial for disaster management. Moreover, the intensity estimation task is the key to understand and forecast the behavior of TCs better. Recently, the task has begun to attract attention from not only meteorologists but also data scientists. Nevertheless, it is hard to stimulate joint research between both types of scholars without a benchmark dataset to work on together. In this work, we release a such a benchmark dataset, which is a new open dataset collected from satellite remote sensing, for the TC-image-to-intensity estimation task. We also propose a novel model to solve this task based on the convolutional neural network (CNN). We discover that the usual CNN, which is mature for object recognition, requires several modifications when being used for the intensity estimation task. Furthermore, we combine the domain knowledge of meteorologists, such as the rotation-invariance of TCs, into our model design to reach better performance. Experimental results on the released benchmark dataset verify that the proposed model is among the most accurate models that can be used for TC intensity estimation, while being relatively more stable across all situations. The results demonstrate the potential of applying data science for meteorology study.

Distributed Collaborative Hashing and Its Applications in Ant Financial

Chaochao Chen
Ziqi Liu
Peilin Zhao
Longfei Li
Jun Zhou
Xiaolong Li

Collaborative filtering, especially latent factor model, has been popularly used in personalized recommendation. Latent factor model aims to learn user and item latent factors from user-item historic behaviors. To apply it into real big data scenarios, efficiency becomes the first concern, including offline model training efficiency and online recommendation efficiency. In this paper, we propose a D istributed C ollaborative H ashing ( DCH ) model which can significantly improve both efficiencies. Specifically, we first propose a distributed learning framework, following the state-of-the-art parameter server paradigm, to learn the offline collaborative model. Our model can be learnt efficiently by distributedly computing subgradients in minibatches on workers and updating model parameters on servers asynchronously. We then adopt hashing technique to speedup the online recommendation procedure. Recommendation can be quickly made through exploiting lookup hash tables. We conduct thorough experiments on two real large-scale datasets. The experimental results demonstrate that, comparing with the classic and state-of-the-art (distributed) latent factor models, DCH has comparable performance in terms of recommendation accuracy but has both fast convergence speed in offline model training procedure and realtime efficiency in online recommendation procedure. Furthermore, the encouraging performance of DCH is also shown for several real-world applications in Ant Financial.

MIX: Multi-Channel Information Crossing for Text Matching

Haolan Chen
Fred X. Han
Di Niu
Dong Liu
Kunfeng Lai
Chenglin Wu
Yu Xu

Short Text Matching plays an important role in many natural language processing tasks such as information retrieval, question answering, and conversational system. Conventional text matching methods rely on predefined templates and rules, which are not applicable to short text with limited numebr of words and limit their ability to generalize to unobserved data. Many recent efforts have been made to apply deep neural network models to natural language processing tasks, which reduces the cost of feature engineering. In this paper, we present the design of Multi-Channel Information Crossing , a multi-channel convolutional neural network model for text matching, with additional attention mechanisms from sentence and text semantics. MIX compares text snippets at varied granularities to form a series of multi-channel similarity matrices, which are crossed with another set of carefully designed attention matrices to expose the rich structures of sentences to deep neural networks. We implemented MIX and deployed the system on Tencent’s Venus distributed computation platform. Thanks to carefully engineered multi-channel information crossing, evaluation results suggest that MIX outperforms a wide range of state-of-the-art deep neural network models by at least 11.1% in terms of the normalized discounted cumulative gain ([email protected]), on the English WikiQA dataset. Moreover, we also performed online A/B tests with real users on the search service of Tencent QQ Browser. Results suggest that MIX raised the number of clicks on the returned results by 5.7%, due to an increased accuracy in query-document matching, which demonstrates the superior performance of MIX in production environments.

How LinkedIn Economic Graph Bonds Information and Product: Applications in LinkedIn Salary

Xi Chen
Yiqun Liu
Liang Zhang
Krishnaram Kenthapadi

The LinkedIn Salary product was launched in late 2016 with the goal of providing insights on compensation distribution to job seekers, so that they can make more informed decisions when discovering and assessing career opportunities. The compensation insights are provided based on data collected from LinkedIn members and aggregated in a privacy-preserving manner. Given the simultaneous desire for computing robust, reliable insights and for having insights to satisfy as many job seekers as possible, a key challenge is to reliably infer the insights at the company level when there is limited or no data at all. We propose a two-step framework that utilizes a novel, semantic representation of companies (Company2vec) and a Bayesian statistical model to address this problem. Our approach makes use of the rich information present in the LinkedIn Economic Graph, and in particular, uses the intuition that two companies are likely to be similar if employees are very likely to transition from one company to the other and vice versa. We compute embeddings for companies by analyzing the LinkedIn members’ company transition data using machine learning algorithms, then compute pairwise similarities between companies based on these embeddings, and finally incorporate company similarities in the form of peer company groups as part of the proposed Bayesian statistical model to predict insights at the company level. We perform extensive validation using several different evaluation techniques, and show that we can significantly increase the coverage of insights while, in fact, even slightly improving the quality of the obtained insights. For example, we were able to compute salary insights for 35 times as many title-region-company combinations in the U.S. as compared to previous work, corresponding to 4.9 times as many monthly active users. Finally, we highlight the lessons learned from practical deployment of our system.

Scalable Optimization for Embedding Highly-Dynamic and Recency-Sensitive Data

Xumin Chen
Peng Cui
Lingling Yi
Shiqiang Yang

A dataset which is highly-dynamic and recency-sensitive means new data are generated in high volumes with a fast speed and of higher priority for the subsequent applications. Embedding technique is a popular research topic in recent years which aims to represent any data into low-dimensional vector space, which is widely used in different data types and have multiple applications. Generating embeddings on such data in a high-speed way is a challenging problem to consider the high dynamics and the recency sensitiveness together with both effectiveness and efficient. Popular embedding methods are usually time-consuming. As well as the common optimization methods are limited since it may not have enough time to converge or deal with recency-sensitive sample weights. This problem is still an open problem. In this paper, we propose a novel optimization method named Diffused Stochastic Gradient Descent for such highly-dynamic and recency-sensitive data. The notion of our idea is to assign recency-sensitive weights to different samples, and select samples according to their weights in calculating gradients. And after updating the embedding of the selected sample, the related samples are also updated in a diffusion strategy. We propose a Nested Segment Tree to improve the recency-sensitive weight method and the diffusion strategy into a complexity no slower than the iteration step in practice. We also theoretically prove the convergence rate of D-SGD for independent data samples, and empirically prove the efficacy of D-SGD in large-scale real datasets.

Q&R: A Two-Stage Approach toward Interactive Recommendation

Konstantina Christakopoulou
Alex Beutel
Rui Li
Sagar Jain
Ed H. Chi

Recommendation systems, prevalent in many applications, aim to surface to users the right content at the right time. Recently, researchers have aspired to develop conversational systems that offer seamless interactions with users, more effectively eliciting user preferences and offering better recommendations. Taking a step towards this goal, this paper explores the two stages of a single round of conversation with a user: which question to ask the user, and how to use their feedback to respond with a more accurate recommendation. Following these two stages, first, we detail an RNN-based model for generating topics a user might be interested in, and then extend a state-of-the-art RNN-based video recommender to incorporate the user’s selected topic. We describe our proposed system Q&R, i.e., Question & Recommendation, and the surrogate tasks we utilize to bootstrap data for training our models. We evaluate different components of Q&R on live traffic in various applications within YouTube: User Onboarding, Homepage Recommendation, and Notifications. Our results demonstrate that our approach improves upon state-of-the-art recommendation models, including RNNs, and makes these applications more useful, such as a >1% increase in video notifications opened. Further, our design choices can be useful to practitioners wanting to transition to more conversational recommendation systems.

Detection of Apathy in Alzheimer Patients by Analysing Visual Scanning Behaviour with RNNs

Jonathan Chung
Sarah A. Chau
Nathan Herrmann
Krista L. Lanctôt
Moshe Eizenman

Assessment of apathy in patients with Alzheimer’s disease (AD) relies heavily on interviews with caregivers and patients, which can be ambiguous and time consuming. More precise and objective methods of evaluation can better inform treatment decisions. In this study, visual scanning behaviours (VSBs) on emotional and non-emotional stimuli were used to detect apathy in patients with AD. Forty-eight AD patients participated in the study. Sixteen of the patients were apathetic. Patients looked at 48 slides with non-emotional images and 32 slides with emotional images. We described two methods that use recurrent neural networks (RNNs) to learn differences between the VSBs of apathetic and non-apathetic AD patients. Method 1 uses two separate RNNs to learn group differences between visual scanning sequences on emotional and non-emotional stimuli. The outputs of the RNNs are then combined and used by a logistic regression classifier to characterise patients as either apathetic or non-apathetic. Method 1 achieved an AUC gain of 0.074 compared to a previously presented handcrafted feature method of detecting emotional blunting (AUC handcrafted = 0.646). Method 2 assumes that each individual’s “style of scanning” (stereotypical eye movements) is independent of the content of the visual stimuli and uses the “style of scanning” to normalise the individual’s VSBs on emotional and non-emotional stimuli. Method 2 uses RNNs in a sequence-to-sequence configuration to learn the individual’s “style of scanning”. The trained model is then used to create vector representations that contain information on the individual’s “style of scanning” (content independent) and her/his VSBs (content dependent) on emotional and non-emotional stimuli. The distance between these vector representations is used by a logistic regression classifier to characterise patients as either apathetic or non-apathetic. Using Method 2 the AUC of the classifier improved to 0.814. The results presented suggest that using RNNs to analyse differences between VSBs on emotional and non-emotional stimuli (a measure of emotional blunting) can improve objective detection of apathy in individual patients with AD.

Assessing Candidate Preference through Web Browsing History

Giovanni Comarela
Ramakrishnan Durairajan
Paul Barford
Dino Christenson
Mark Crovella

Predicting election outcomes is of considerable interest to candidates, political scientists, and the public at large. We propose the use of Web browsing history as a new indicator of candidate preference among the electorate, one that has potential to overcome a number of the drawbacks of election polls. However, there are a number of challenges that must be overcome to effectively use Web browsing for assessing candidate preference - including the lack of suitable ground truth data and the heterogeneity of user populations in time and space. We address these challenges, and show that the resulting methods can shed considerable light on the dynamics of voters’ candidate preferences in ways that are difficult to achieve using polls.

Pangloss: Fast Entity Linking in Noisy Text Environments

Michael Conover
Matthew Hayes
Scott Blackburn
Pete Skomoroch
Sam Shah

Entity linking is the task of mapping potentially ambiguous terms in text to their constituent entities in a knowledge base like Wikipedia. This is useful for organizing content, extracting structured data from textual documents, and in machine learning relevance applications like semantic search, knowledge graph construction, and question answering. Traditionally, this work has focused on text that has been well-formed, like news articles, but in common real world datasets such as messaging, resumes, or short-form social media, non-grammatical, loosely-structured text adds a new dimension to this problem. This paper presents Pangloss, a production system for entity disambiguation on noisy text. Pangloss combines a probabilistic linear-time key phrase identification algorithm with a semantic similarity engine based on context-dependent document embeddings to achieve better than state-of-the-art results (>5% in F1) compared to other research or commercially available systems. In addition, Pangloss leverages a local embedded database with a tiered architecture to house its statistics and metadata, which allows rapid disambiguation in streaming contexts and on-device disambiguation in low-memory environments such as mobile phones.

State Space Models for Forecasting Water Quality Variables: An Application in Aquaculture Prawn Farming

Joel Janek Dabrowski
Ashfaqur Rahman
Andrew George
Stuart Arnold
John McCulloch

A novel approach to deterministic modelling of diurnal water quality parameters in aquaculture prawn ponds is presented. The purpose is to provide assistance to prawn pond farmers in monitoring pond water quality with limited data. Obtaining sufficient water quality data is generally a challenge in commercial prawn farming applications. Farmers can sustain large losses in their crop if water quality is not well managed. The model presented provides a means for modelling and forecasting various water quality parameters. It is inspired by data dynamics and does not rely on physical ecosystem modelling. The model is constructed within the Bayesian filtering framework. The Kalman filter and the unscented Kalman filer are applied for inference. The results demonstrate generalisability to both variables and environments. The ability for short term forecasting with mean absolute percentage errors between 0.5% and 11% is demonstrated.

Automated Audience Segmentation Using Reputation Signals

Maria Daltayanni
Ali Dasdan
Luca de Alfaro

Selecting the right audience for an advertising campaign is one of the most challenging, time-consuming and costly steps in the advertising process. To target the right audience, advertisers usually have two options: a) market research to identify user segments of interest and b) sophisticated machine learning models trained on data from past campaigns. In this paper we study how demand-side platforms (DSPs) can leverage the data they collect (demographic and behavioral) in order to learn reputation signals about end user convertibility and advertisement (ad) quality. In particular, we propose a reputation system which learns interest scores about end users, as an additional signal of ad conversion, and quality scores about ads, as a signal of campaign success. Then our model builds user segments based on a combination of demographic, behavioral and the new reputation signals and recommends transparent targeting rules that are easy for the advertiser to interpret and refine. We perform an experimental evaluation on industry data that showcases the benefits of our approach for both new and existing advertiser campaigns.

SHIELD: Fast, Practical Defense and Vaccination for Deep Learning using JPEG Compression

Nilaksh Das
Madhuri Shanbhogue
Shang-Tse Chen
Fred Hohman
Siwei Li
Li Chen
Michael E. Kounavis
Duen Horng Chau

The rapidly growing body of research in adversarial machine learning has demonstrated that deep neural networks (DNNs) are highly vulnerable to adversarially generated images. This underscores the urgent need for practical defense techniques that can be readily deployed to combat attacks in real-time. Observing that many attack strategies aim to perturb image pixels in ways that are visually imperceptible, we place JPEG compression at the core of our proposed SHIELD defense framework, utilizing its capability to effectively “compress away” such pixel manipulation. To immunize a DNN model from artifacts introduced by compression, SHIELD “vaccinates” the model by retraining it with compressed images, where different compression levels are applied to generate multiple vaccinated models that are ultimately used together in an ensemble defense. On top of that, SHIELD adds an additional layer of protection by employing randomization at test time that compresses different regions of an image using random compression levels, making it harder for an adversary to estimate the transformation performed. This novel combination of vaccination, ensembling, and randomization makes SHIELD a fortified multi-pronged defense. We conducted extensive, large-scale experiments using the ImageNet dataset, and show that our approaches eliminate up to 98% of gray-box attacks delivered by strong adversarial techniques such as Carlini-Wagner’s L2 attack and DeepFool. Our approaches are fast and work without requiring knowledge about the model.

Adaptive Paywall Mechanism for Digital News Media

Heidar Davoudi
Aijun An
Morteza Zihayat
Gordon Edall

Many online news agencies utilize the paywall mechanism to increase reader subscriptions. This method offers a non-subscribed reader a fixed number of free articles in a period of time (e.g., a month), and then directs the user to the subscription page for further reading. We argue that there is no direct relationship between the number of paywalls presented to readers and the number of subscriptions, and that this artificial barrier, if not used well, may disengage potential subscribers and thus may not well serve its purpose of increasing revenue. Moreover, the current paywall mechanism neither considers the user browsing history nor the potential articles which the user may visit in the future. Thus, it treats all readers equally and does not consider the potential of a reader in becoming a subscriber. In this paper, we propose an adaptive paywall mechanism to balance the benefit of showing an article against that of displaying the paywall (i.e., terminating the session). We first define the notion of cost and utility that are used to define an objective function for optimal paywall decision making. Then, we model the problem as a stochastic sequential decision process. Finally, we propose an efficient policy function for paywall decision making. The experimental results on a real dataset from a major newspaper in Canada show that the proposed model outperforms the traditional paywall mechanism as well as the other baselines.

Tax Fraud Detection for Under-Reporting Declarations Using an Unsupervised Machine Learning Approach

Daniel de Roux
Boris Perez
Andrés Moreno
Maria del Pilar Villamil
César Figueroa

Tax fraud is the intentional act of lying on a tax return form with intent to lower one’s tax liability. Under-reporting is one of the most common types of tax fraud, it consists in filling a tax return form with a lesser tax base. As a result of this act, fiscal revenues are reduced, undermining public investment.

Detecting tax fraud is one of the main priorities of local tax authorities which are required to develop cost-efficient strategies to tackle this problem. Most of the recent works in tax fraud detection are based on supervised machine learning techniques that make use of labeled or audit-assisted data. Regrettably, auditing tax declarations is a slow and costly process, therefore access to labeled historical information is extremely limited. For this reason, the applicability of supervised machine learning techniques for tax fraud detection is severely hindered.

Such limitations motivate the contribution of this work. We present a novel approach for the detection of potential fraudulent tax payers using only unsupervised learning techniques and allowing the future use of supervised learning techniques. We demonstrate the ability of our model to identify under-reporting taxpayers on real tax payment declarations, reducing the number of potential fraudulent tax payers to audit. The obtained results demonstrate that our model doesn’t miss on marking declarations as suspicious and labels previously undetected tax declarations as suspicious, increasing the operational efficiency in the tax supervision process without needing historic labeled data.

Automatic Discovery of Tactics in Spatio-Temporal Soccer Match Data

Tom Decroos
Jan Van Haaren
Jesse Davis

Sports teams are nowadays collecting huge amounts of data from training sessions and matches. The teams are becoming increasingly interested in exploiting these data to gain a competitive advantage over their competitors. One of the most prevalent types of new data is event stream data from matches. These data enable more advanced descriptive analysis as well as the potential to investigate an opponent’s tactics in greater depth. Due to the complexity of both the data and game strategy, most tactical analyses are currently performed by humans reviewing video and scouting matches in person. As a result, this is a time-consuming and tedious process. This paper explores the problem of automatic tactics detection from event-stream data collected from professional soccer matches. We highlight several important challenges that these data and this problem setting pose. We describe a data-driven approach for identifying patterns of movement that account for both spatial and temporal information which represent potential offensive tactics. We evaluate our approach on the 2015/2016 season of the English Premier League and are able to identify interesting strategies per team related to goal kicks, corners and set pieces.

Applying the Delta Method in Metric Analytics: A Practical Guide with Novel Ideas

Alex Deng
Ulf Knoblich
Jiannan Lu

During the last decade, the information technology industry has adopted a data-driven culture, relying on online metrics to measure and monitor business performance. Under the setting of big data, the majority of such metrics approximately follow normal distributions, opening up potential opportunities to model them directly without extra model assumptions and solve big data problems via closed-form formulas using distributed algorithms at a fraction of the cost of simulation-based procedures like bootstrap. However, certain attributes of the metrics, such as their corresponding data generating processes and aggregation levels, pose numerous challenges for constructing trustworthy estimation and inference procedures. Motivated by four real-life examples in metric development and analytics for large-scale A/B testing, we provide a practical guide to applying the Delta method, one of the most important tools from the classic statistics literature, to address the aforementioned challenges. We emphasize the central role of the Delta method in metric analytics by highlighting both its classic and novel applications.

Releasing eHealth Analytics into the Wild: Lessons Learnt from the SPHERE Project

Tom Diethe
Mike Holmes
Meelis Kull
Miquel Perello Nieto
Kacper Sokol
Hao Song
Emma Tonkin
Niall Twomey
Peter Flach

The SPHERE project is devoted to advancing eHealth in a smart-home context, and supports full-scale sensing and data analysis to enable a generic healthcare service. We describe, from a data-science perspective, our experience of taking the system out of the laboratory into more than thirty homes in Bristol, UK. We describe the infrastructure and processes that had to be developed along the way, describe how we train and deploy Machine Learning systems in this context, and give a realistic appraisal of the state of the deployed systems.

Gotcha - Sly Malware!: Scorpion A Metagraph2vec Based Malware Detection System

Yujie Fan
Shifu Hou
Yiming Zhang
Yanfang Ye
Melih Abdulhayoglu

Due to its severe damages and threats to the security of the Internet and computing devices, malware detection has caught the attention of both anti-malware industry and researchers for decades. To combat the evolving malware attacks, in this paper, we first study how to utilize both content- and relation-based features to characterize sly malware; to model different types of entities (i.e., file, archive, machine, API, DLL ) and the rich semantic relationships among them (i.e., file-archive, file-machine, file-file, API-DLL, file-API relations), we then construct a structural heterogeneous information network (HIN) and present meta-graph based approach to depict the relatedness over files. To measure the relatedness over files on the constructed HIN, since malware detection is a cost-sensitive task, it calls for efficient methods to learn latent representations for HIN. To address this challenge, based on the built meta-graph schemes, we propose a new HIN embedding model metagraph2vec on the first attempt to learn the low-dimensional representations for the nodes in HIN, where both the HIN structures and semantics are maximally preserved for malware detection. A comprehensive experimental study on the real sample collections from Comodo Cloud Security Center is performed to compare various malware detection approaches. The promising experimental results demonstrate that our developed system Scorpion which integrate our proposed method outperforms other alternative malware detection techniques. The developed system has already been incorporated into the scanning tool of Comodo Antivirus product.

Towards Knowledge Discovery from the Vatican Secret Archives. In Codice Ratio - Episode 1: Machine Transcription of the Manuscripts.

Donatella Firmani
Marco Maiorino
Paolo Merialdo
Elena Nieddu

In Codice Ratio is a research project to study tools and techniques for analyzing the contents of historical documents conserved in the Vatican Secret Archives (VSA). In this paper, we present our efforts to develop a system to support the transcription of medieval manuscripts. The goal is to provide paleographers with a tool to reduce their efforts in transcribing large volumes, as those stored in the VSA, producing good transcriptions for significant portions of the manuscripts. We propose an original approach based on character segmentation. Our solution is able to deal with the dirty segmentation that inevitably occurs in handwritten documents. We use a convolutional neural network to recognize characters, and statistical language models to compose word transcriptions. Our approach requires minimal training effort, making the transcription process more scalable, as the production of training sets requires a few pages and can be easily crowdsourced. We have conducted experiments on manuscripts from the Vatican Registers, an unreleased corpus containing the correspondence of the popes. With training data produced by 120 high school students, our system has been able to produce good transcriptions that can be used by paleographers as a solid basis to speedup the transcription process at a large scale.

Device Graphing by Example

Keith Funkhouser
Matthew Malloy
Enis Ceyhun Alp
Phillip Poon
Paul Barford

Datasets that organize and associate the many identifiers produced by PCs, smartphones, and tablets accessing the internet are referred to as internet device graphs . In this paper, we demonstrate how measurement, tracking, and other internet entities can associate multiple identifiers with a single device or user after coarse associations, e.g ., based on IP-colocation , are made. We employ a Bayesian similarity algorithm that relies on examples of pairs of identifiers and their associated telemetry, including user agent, screen size, and domains visited, to establish pair-wise scores. Community detection algorithms are applied to group identifiers that belong to the same device or user. We train and validate our methodology using a unique dataset collected from a client panel with full visibility, apply it to a dataset of 700 million device identifiers collected over the course of six weeks in the United States, and show that it outperforms several unsupervised learning approaches. Results show mean precision and recall exceeding 90% for association of identifiers at both the device and user levels.

Near Real-time Optimization of Activity-based Notifications

Yan Gao
Viral Gupta
Jinyun Yan
Changji Shi
Zhongen Tao
PJ Xiao
Curtis Wang
Shipeng Yu
Romer Rosales
Ajith Muralidharan
Shaunak Chatterjee

In recent years, social media applications (e.g., Facebook, LinkedIn) have created mobile applications (apps) to give their members instant and real-time access from anywhere. To keep members informed and drive timely engagement, these mobile apps send event notifications. However, sending notifications for every possible event would result in too many notifications which would in turn annoy members and create a poor member experience.

In this paper, we present our strategy of optimizing notifications to balance various utilities (e.g., engagement, send volume) by formulating the problem using constrained optimization. To guarantee freshness of notifications, we implement the solution in a stream computing system in which we make multi-channel send decisions in near real-time. Through online A/B test results, we show the effectiveness of our proposed approach on tens of millions of members.

Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist

Alex Gittens
Kai Rothauge
Shusen Wang
Michael W. Mahoney
Lisa Gerhardt
Prabhat
Jey Kottalam
Michael Ringenburg
Kristyn Maschhoff

Apache Spark is a popular system aimed at the analysis of large data sets, but recent studies have shown that certain computations—-in particular, many linear algebra computations that are the basis for solving common machine learning problems—-are significantly slower in Spark than when done using libraries written in a high-performance computing framework such as the Message-Passing Interface (MPI).

To remedy this, we introduce Alchemist, a system designed to call MPI-based libraries from Apache Spark. Using Alchemist with Spark helps accelerate linear algebra, machine learning, and related computations, while still retaining the benefits of working within the Spark environment. We discuss the motivation behind the development of Alchemist, and we provide a brief overview of its design and implementation.

We also compare the performances of pure Spark implementations with those of Spark implementations that leverage MPI-based codes via Alchemist. To do so, we use data science case stud

SIGKDD Proceedings

KDD ‘18- Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

SESSION: Keynote Addresses

SESSION: Applied Data Science Track Papers