The objective of the Industry Practice Expo track is to bring together leading industry and government practitioners to share their insights and experiences will inspire the KDD community and spread awareness of the variety of seminal, innovative, and proven applications of data mining and knowledge discovery in the industry and government. This new track will complement the already established Industry and Government track at KDD that focuses on peer reviewed publications.
Speaker: Richard Boire
Introduction
In many data mining exercises, we see information that appears on the surface to demonstrate a particular conclusion. But closer examination of the data reveals that these results are indeed misleading. In this session, we will examine this notion of misleading results in three areas:
- Statistical Issues
- Overstating of Results
- Overfitting
Statistical Issues
Statistical issues such as multicollinearity and outliers can impact results dramatically. We will first outline how these statistical issues can provide misleading results. At the same time, we will demonstrate how the data mining practitioner overcomes these issues through data analysis approaches that provide both more meaningful and non-misleading results to the business community.
Overstating of Results
From a business standpoint, we will also look at results that appear to be too good to be true. In other words, there appears to be some overstating of results within a given data mining solution. Initially, we will discuss how to identify these situations. Secondly, we will outline what causes this overstatement of results and detail our approach on how we would overcome this predicament.
Overfitting
Another topic for discussion is overfitting of results. This is particularly the case when building predictive models. In this section of the seminar, we will define what overfitting is and why it is becoming more relevant for understanding by the business community. Once again, analytical approaches will be discussed in terms of how to best handle this issue.
Case Studies
The first case study comes from the financial services area where the organization was experiencing challenges in profitably upselling regular credit card customers to a gold card. In this case study, we demonstrate how our 4 step approach is applied in arriving at a solution to this challenge. These 4 steps are as follows:
- How to Identify the problem
- How we construct the right data environment to conduct our analytics
- What kind of analytics are employed which include techniques such as:
- Correlation analysis
- EDA Reports
- Logistic Regression
- Gains Charts
- How do we apply the learning to a future initiative and what were the actual results
Speaker Bio:
Richard Boire's experience in database marketing and predictive analytics dates back to 1983, when he received an MBA from Concordia University in Finance and Statistics. His initial experience at organizations such as Reader’s Digest and American Express allowed him to become a pioneer in the application of predictive modelling technology for all direct marketing programs. This extended to the introduction of models which targeted the acquisition of new customers based on return on investment. With this experience, Richard formed his own consulting company back in 1994 which is now called the Boire Filler Group, a Canadian leader in offering analytical and database services to companies seeking solutions to their existing predictive analyticsor database marketing challenges. Richard is a recognized authority on predictive analytics and is among a very few, select top five experts in this field in Canada, with expertise and knowledge that is difficult, if not impossible to replicate in Canada. This expertise has evolved into international speaking assignments and workshop seminars in the U.S. , England, Eastern Europe, and Southeast Asia. Within Canada, he gives seminars on segmentation and predictive analytics for such organizations as Canadian Marketing Association (CMA), Direct Marketing News,Direct Marketing Association Toronto and the Association for Advanced Relationship Marketing(AARM.). His written articles have appeared in numerous Canadian publications such as Direct Marketing News, Strategy Magazine, and Marketing Magazine. He has taught applied statistics, data mining and database marketing at a variety of institutions across Canada which include University of Toronto, George Brown College, Seneca College, etc. Richard is currently Chair at the CMA’s Customer Insight and Analytics Committee and currently sits on the CMA’s Board of Directors. He has chaired numerous full day conferences on behalf of the CMA (the 2000 Database and Technology Seminar as well as the 2002 Database and Technology Seminar and the first-ever Customer Profitability Conference in 2005. He has co-authored white papers on the following topics: ‘Best Practices in Data Mining’ as well as ‘Customer Profitability: The State of Evolution among Canadian Companies’.
Speaker: Tai Hsu
AliExpress is an online e-commerce platform for wholesale products. Credit card is one of its various payment methods. An online transaction using credit cards is called a "card not present" (CNP) transaction where the physical card has not been swiped into a reader. It’s also the major type of credit card frauds causing a great overhead of the online operation, sellers, and buyers. To protect customers on our platform, we developed a real-time credit card fraud detection system, using the machine learning technologies which allows us to achieve a precision of 97%, at a recall of 80%. With the system, we can provide the best online shopping experience for our customers, without the high risk of online transactions which always result a high operational cost. We will briefly share our experience and practice in the expo.
Speaker bio:
Dr. Tai Hsu’s research/specialized areas cover algorithm, artificial intelligence, chemistry, computational biology, cybernetics, data mining, machine learning, robotics, and supercomputing. His work in online risk control of AliExpress’ significantly reduced the online risk, competitive with Cybersource’s. His work in machine 3D vision won the best paper award (European Meeting in on Cybernetics and Systems Research, 2008 and 2006). His research in computational biology won the NLM award (National Library of Medicine, of National Institute of Health, 2001). His work in quantum chemistry won the best paper award (Journal of Chinese Chemical Society, Taiwan, 1988). He was named the distinguished employee (2001) of Providian Financial Corporation. His worked in several Fortune 500 companies in San Francisco Bay Area. He’s currently the principle scientist of Alibaba, Inc., as well as the R&D director of Northwestern Polytechnic University. He holds a Ph.D. in CS from Oregon State University, a M.S. in CS from University of Missouri-Rolla, a B.A. in CS plus a minor in math from Wartburg College, and a B.S. in chemistry from National Taiwan University.
Speaker: Mario E. Inchiosa
In more and more industries, competitive advantage hinges on exploiting the largest quantity of data in the shortest possible time - and doing so cost-effectively. Data volumes are growing exponentially, while businesses are striving to deploy sophisticated and computationally intensive predictive analytics. Often, massive data is stored in a data warehouse running on dedicated parallel hardware, but advanced analytics is performed on a separate compute platform. Moving data from the data warehouse to the compute environment can constitute a significant bottleneck. Organizations resort to considering only a fraction of their data or refreshing their analyses infrequently. To address the data movement bottleneck and take full advantage of parallel data warehouse platforms, vendors are offering new in-database analytics capabilities. They are opening up their platforms, allowing users to run their own user-defined functions and statistical models as well as vendor- and partner-supplied advanced analytics on the database platform, close to the data, in parallel, without transporting the data through a host node or corporate network. In this talk, we will present the need for in-database analytics and discuss a number of the new solutions available, highlighting case studies where solution times have been reduced from hours to minutes or seconds.
Speaker bio:
Dr. Mario E. Inchiosa is U.S. Chief Scientist at Netezza, an IBM Company, where he develops data-intensive high performance computing appliances. His work focuses in particular on the juncture of data warehousing and parallelized advanced analytics and optimization. Dr. Inchiosa received an A.B. in Physics from Harvard College and an A.M. and Ph.D. in Physics from Harvard University. At Harvard, he combined his dual interests in Physics and Computer Science by applying statistical physics to the study of neural network associative memories. He moved on to study the dynamics of neural network associative memories as a post-doc at the Technical University of Munich. Next, he joined SPAWAR Systems Center San Diego, specializing in stochastic non-linear dynamics, signal detection, Monte Carlo simulation, and high performance computing. He was awarded four patents as a result of his research, and he has published over 30 papers, earning Publication of the Year and Technical Publication Excellence awards. In 2001 Dr. Inchiosa joined BiosGroup, a Santa Fe Institute complexity science spin-off (subsequently NuTech Solutions), applying evolutionary algorithms and swarm-like agent based modeling to problems in business and government. He developed pipeline simulation and optimization engines, served as Principal Investigator researching general and geospatial reasoning under uncertainty, and used agent based models to study global market dynamics and co-evolutionary business strategy optimization. As NuTech’s Chief Science Officer, Dr. Inchiosa was involved with Netezza’s acquisition of NuTech as part of Netezza’s strategy to bring advanced analytics capabilities to data warehouse appliances.
Speaker: Colleen McCue
Why just count crime when you can anticipate, prevent and respond more effectively? Companies in the commercial sector have long understood the importance of being able to anticipate or predict future behavior and demand in order to respond efficiently and effectively. Embracing the promise of predictive analytics, the public safety community is moving from a focus on "what happened," to a system that enables the ability to anticipate future events and effectively deploy resources in front of crime; thereby, changing outcomes. While we have become familiar with the use of advanced analytics in support of fraud detection and prevention, techniques similar to those used to support customer loyalty programs and supply chain management have been used to prevent and solve violent crimes, enhance investigative pace and efficacy, support information-based risk and threat assessment, and deploy public safety resources more efficiently. As public safety agencies increasingly are asked to do more with less, the ability to anticipate crime represents a game changing paradigm shift; enabling information-based tactics, strategy and policy in support of prevention and response. Reporting, collecting and compiling data are necessary but not sufficient to increasing public safety. Ultimately, the ability to anticipate, prevent and respond more effectively will enable us to do more with less and change public safety outcomes.
Speaker Bio:
Dr. Colleen McLaughlin McCue, GeoEye Analytics, brings over 18 years of experience in advanced analytics and the development of actionable solutions to complex information processing problems in the applied public safety and national security environment. Her areas of expertise include the application of data mining and predictive analytics to the analysis of crime and intelligence data, with particular emphasis on deployment strategies, surveillance detection, threat and vulnerability assessment, fraud detection, geospatial predictive analytics, and the behavioral analysis of violent crime. Dr. McCue's experience in the applied law enforcement setting and pioneering work in operationally relevant analytical strategies has been used to support a wide array of national security and public safety clients. Dr. McCue has published her research findings in journals and book chapters, and has authored a book on the use of advanced analytics in the applied public safety environment entitled, Data Mining and Predictive Analysis: Intelligence Gathering and Crime Analysis. Dr. McCue earned her undergraduate degree from the University of Illinois at Chicago and Doctorate in Psychology from Dartmouth College. She completed a five-year postdoctoral fellowship in the Department of Pharmacology & Toxicology at the Medical College of Virginia at the Virginia Commonwealth University.
Speaker: David Norton
Caesars Entertainment, the largest provider of branded casino entertainment, captures a wealth of data for 40 million+ customers through its Total Rewards program. In-depth data analysis has helped Caesars weather the economic downturn by prioritizing marketing spend, expense savings targets and identifying new revenue opportunities. This talk will describe how closed-loop marketing, state-of-the-art user segmentation, and ongoing experimentation via test and control groups have enabled Caesars Entertainment to achieve all-time high customer satisfaction scores and outperform the competition in a challenging economic climate. The lessons learned are generic and apply across multiple industries. Insights will also be provided on the next wave of challenges to be answered analytically.
Speaker bio: David Norton is the Senior Vice President and Chief Marketing Officer at Caesars Entertainment, which operates more than 40 casinos nationwide and 10 others worldwide and has been recognized for its outstanding marketing practices by the Wall St. Journal, Info Week and CIO Magazine. HET’s brands include Caesars, Horseshoe, Harrah’s, World Series of Poker, Paris, Flamingo and several others. Norton is responsible for the company’s direct marketing strategy, Brand Management, Promotions, Alliances, Research, VIP marketing, revenue management, the Total Rewards loyalty program, Internet marketing, multi-cultural marketing, mobile initiatives, Retail, Entertainment, Sales and Travel Services. Prior to joining Harrah’s in October of 1998, Norton worked in the credit card industry with American Express, Household International and MBNA. He has a B.S. in Finance from Boston College, an MBA from Loyola College and a Masters in Management of Technology from the University of Pennsylvania and the Wharton School.
Speaker: Paul Rejto
Biased and unbiased approaches to develop predictive biomarkers of response to drug treatment will be introduced and their utility demonstrated for cell cycle inhibitors. Opportunities to leverage the growing knowledge of tumors characterized by modern methods to measure DNA and RNA will be shown, including the use of appropriate preclinical models and selection of patients. Furthermore, techniques to identify mechanisms of resistance prior to clinical treatment will be discussed. Prospects for systematic data mining and current barriers to the application of precision medicine in cancer will be reviewed along with potential solutions.
Speaker bio:
Dr. Rejto is Director of Computational Biology, Oncology Research Unit, Pfizer La Jolla. His research interests include computational methods to support target discovery and validation, animal models, patient selection, resistance modeling and combination therapy. Paul trained in physical and theoretical chemistry (Harvard A.B. magna cum laude; Stanford Ph.D.; UC Berkeley post-doc) and joined Pfizer La Jolla (Agouron) in 1994. During his career, Paul has developed and applied tools for structure-based drug design, led a team that progressed compounds for the treatment of diabetes into the clinic, and built the computational biology group at Pfizer La Jolla. Coaching youth soccer has taught him about leadership.
Speakers: Dan Steinberg and Felipe Fernandez Martinez
The challenge of predicting retail sales on a product-by-product basis throughout a network of retail stores has been researched intensively by applied econometricians and statisticians for decades. The principal tools of analysis have been linear regression with Bayesian inspired adjustments to stabilize demand curve estimates. The scale of such analytics can be challenging as retailers often work with more than 100,000 products (SKUs) and typically operate networks of hundreds of brick and mortar stores. Department and grocery stores are excellent examples but fast food restaurants also require such detailed predictive modeling systems. Depending on the objectives of the company, predictions may be required for blocks of time spanning a week or more, or, as in the case of fast food operators, predictions are required for each 15-minute time interval of the operating day. The authors have modernized industry standard approaches to such predictive modeling by leveraging advanced data mining techniques and constructing a completely automated prediction and optimization. The modern techniques are more adept in detecting nonlinear response and accommodating interactions and automatically sifting through hundreds if not thousands of potential factors influencing sales outcomes. Our results confirm that conventional statistical models miss a substantial fraction of the explainable variance and that the modern methods dominate in terms of performance and speed of model development.
Accurate prediction is required for reliable planning and logistics, and is also essential for optimization. Optimization with respect to pricing, promotion and assortment can be asked for relative to a variety of objectives (e.g. revenue, profits) and short term and long-term optimization may result in different decisions being taken. A unique challenge for retailers is encountered in the large number of constraints to which most complex retail organizations are subject. Contracts and special understandings with valued suppliers will severely constrain a retailer’s flexibility. For example, certain products may not be promotable (or discounted) in isolation, and others (say from competitors) may not be promoted jointly, and the costs of goods sold may well depend on the quantities contracted. We discuss how we have resolved such challenges via a cycle of prediction and simulation driven from a database to develop a flexible high speed system that can deal with arbitrary constraints, arbitrary objectives, and achieve new levels of predictive accuracy and reliability.
Speaker Bios:
Dan Steinberg is CEO and founder of Salford Systems, the developer of the CART® decision tree, MARS® spline regression, TreeNet® gradient boosting, Beriman's RandomForests®, and other influential data mining technology. After earning a PhD in Econmometrics at Harvard Dan began his professional career as a Member of the Technical Staff at Bell Labs, Murray Hill, and then as Assistant Professor of Economics at the University of California, San Diego. His consulting experience at Salford Systems has included complex modeling projects for major banks world wide, including CitiBank, Chase, American Express, Credit Suisse, and has included projects in Europe, Australia, New Zealand, Malaysia, Korea, Japan and Brazil. Dan led the teams that won first place awards in the KDDCup 2000, and the 2002 Duke/TeraData Churn modeling competition, and the teams that won awards in the PAKDD competitions of 2006 and 2007. Dan has published papers in economics, econometrics, and computer science journals, and contributes actively to the ongoing R & D at Salford.
Felipe Fernandez Martinez obtained a degree in Chemical Engineering at the Universidad Michoacana de San Nicolás de Hidalgo in Morelia, México and subsequently completed an MBA at the Instituto Panamericano de Alta Dirección de Empresas (IPADE) and certificates in corporate finance at ESCP Europe and the Institituto Tecnológico Autonomo de México (ITAM). Felipe worked at Carrefour, the world's 2nd largest retailer, for over 12 years. He was Carrefour’s Director of Strategic Projects for Latin America with responsibility for Cost optimization, Procurement, Pricing, Supply Chain, and implementation of new analytics tools for Carrefour, Brazil. Prior to joining Carrefour Latin America, Felipe worked for the Carrefour Group in senior level positions in Paris, Italy and Mexico as Cost Optimization Director. Felipe currently is CEO and partner at Interefe, where he advises retailers on projects turning complexity into competitive advantages on three main areas of expertise: Energy efficiency, Analytics and Cost optimization. Felipe is fluent in Spanish, French, Portuguese, Italian and English.
Speaker: Ravi Vijayaraghavan
With the coming of age of web as a mainstream customer service channel, B2C companies have invested substantial resources in enhancing their web presence. Today customers can interact with a company, not only through the traditional phone channel but also through chat, email, social media or web self-service. With the availability of web logs, CRM data and text transcripts these online channels are rich with data and they track several aspects of customer behavior and intent. 24/7 Customer Innovation Labs has developed a series of data mining and statistics driven solutions to improve customer experience in each of these online channels [1].
This talk will focus on solutions we have developed to enhance performance of web chat as a customer service channel. 2 stages of customer life-cycle will be considered for the purpose of this study– new customer acquisition (or sales) and service of existing customers. In customer acquisition the key objective is to maximize "incremental" revenues through the chat channel. While in customer service the objective is to drive up the quality of customer experience (as measured by customer satisfaction surveys or mined customer sentiments) through chat. In both these scenarios, applications of data mining/text mining and machine learning have been developed and deployed in:
- Real-time targeting of the right visitors to chat
- Predicting customer needs
- Routing customer to the customer service representatives with the right skill set
- Mining chat transcripts and Social Media Portals to identify key customer issues and customer sentiments
- Mining representatives’ responses to identify opportunities for improving performance
- Feeding back learning from 4 and 5 to 1 (better targeting)
[1] Vijayaraghavan, Ravi et al, Predictive Systems for Customer Interactions, Service Systems Implementation, Chapter 18, Springer, 2011
Speaker Bio:
Ravi Vijayaraghavan is a Vice-President at 24/7 Customer Innovation Labs where he leads the Analytics and Data Sciences Organization. His team builds data-driven solutions and predictive systems that enable superior customer acquisition and customer service through online and offline channels.
Prior to 24/7 Customer, Ravi was at Ford Motor where he started his career at Ford Research Laboratories. His research in Ford was in the application of large scale numerical computations for engineering design. Later, he took up a position in the IT Strategy organization. In each of these roles he drove the use of mathematical and quantitative methods to improve decision-making capability. Most recently he led the development and implementation of analytics driven solution to improve the profitability of Ford of Brazil.
Ravi was also a Vice President and part of the executive leadership team of Mu Sigma Inc., a Chicago based pure-play analytical services company, where he was responsible for client management. As a researcher, Ravi has several refereed and invited publications in major scientific and Technical journals and has presented as an invited speaker at several international conferences. He has served in leadership committees in academic societies such as Sigma Xi. In 2004, he was the recipient of a Henry Ford Technology award - the highest technical recognition at Ford Motor Company. Ravi holds a B.Tech degree from Indian Institute of Technology, Madras, a PhD in Engineering from University of Wisconsin-Madison and an MBA (with high distinction) in Strategy and Finance from Ross School of Business, University of Michigan, Ann Arbor.
Co-chairs:
- Ying Li, Microsoft
- Rajesh Parekh, Groupon
Advisory Committee:
- Chid Apte, IBM
- John Elder, Elder Research
- Usama Fayyad, OpenInsights
- Brendan Kitts, Lucid Commerce
- Gabor Melli, Simon Fraser University
- Gregory Piatetsky-Shapiro, KDNuggets
- Bharat Rao, Siemens
- Ted Senator, SAIC
- Ashok Srivastava, NASA
- Ramasamy Uthurusamy, GM
For more information please contact the Industry Practice Expo co-chairs - Ying Li and Rajesh Parekh - at industrial_practice@kdd2011.com








