数据采集和营销工具英文版-

Knowledge discovery & data mining Tools, methods, and experiencesFosca Giannotti and Dino PedreschiPisa KDD LabCNUCE-CNR & Univ. PisaA tutorial EDBT2000Contributors and acknowledgementszThe people Pisa KDD Lab: Francesco BONCHI, Giuseppe MANCO, Mirco NANNI, Chiara RENSO, Salvatore RUGGIERI, Franco TURINI and many studentszThe many KDD tutorialists and teachers which made their slides available on the web (all of them listed in bibliography) ;-)zIn particular:yJiawei HAN, Simon Fraser University, whose forthcoming book Data mining: concepts and techniques has influenced the whole tutorialyRajeev RASTOGI and Kyuseok SHIM, Lucent Bell LabsyDaniel A. KEIM, University of HalleyDaniel Silver, CogNova Technologies zThe EDBT2000 board who accepted our tutorial proposalTutorial goalszIntroduce you to major aspects of the Knowledge Discovery Process, and theory and applications of Data Mining technologyzProvide a systematization to the many many concepts around this area, according the following linesythe processythe methods applied to paradigmatic casesythe support environmentythe research challengeszImportant issues that will be not covered in this tutorial:ymethods: time series, exception detection, neural netsysystems: parallel implementationsTutorial Outline1.Introduction and basic concepts1.Motivations, applications, the KDD process, the techniques 2.Deeper into DM technology1.Decision Trees and Fraud Detection 2.Association Rules and Market Basket Analysis3.Clustering and Customer Segmentation3.Trends in technology1.Knowledge Discovery Support Environment2.Tools, Languages and Systems4.Research challengesIntroduction - module outlinezMotivationszApplication AreaszKDD Decisional ContextzKDD ProcesszArchitecture of a KDD systemzThe KDD steps in shortEvolution of Database Technology:from data management to data analysisz1960s:yData collection, database creation, IMS and network DBMS.z1970s: yRelational data model, relational DBMS implementation.z1980s: yRDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.).z1990s: yData mining and data warehousing, multimedia databases, and Web technology.Motivations “Necessity is the Mother of Invention”zData explosion problem: yAutomated data collection tools, mature database technology and internet lead to tremendous amounts of data stored in databases, data warehouses and other information repositories. zWe are drowning in information, but starving for knowledge! (John Naisbett)zData warehousing and data mining :yOn-line analytical processingyExtraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases.zAlso referred to as: Data dredging, Data harvesting, Data archeologyzA multidisciplinary field:yDatabase yStatisticsyArtificial intelligencexMachine learning, Expert systems and Knowledge AcquisitionyVisualization methodsA rapidly emerging fieldA rapidly emerging fieldMotivations for DM zAbundance of business and industry datazCompetitive focus - Knowledge ManagementzInexpensive, powerful computing engineszStrong theoretical/mathematical foundations ymachine learning & logicystatisticsydatabase management systemsWhat is DM useful for?MarketingDatabaseMarketingDataWarehousingKDD &Data Mining Increase knowledge to base decision upon.E.g., impact on marketingThe Value Chain DataData Customer data Store data Demographical Data Geographical data InformationInformation X lives in Z S is Y years old X and S moved W has money in Z KnowledgeKnowledge A quantity Y of product A is used in region Z Customers of class Y use x% of C during period D DecisionDecision Promote product A in region Z. Mail ads to families of profile P Cross-sell service B to clients CApplication Areas and OpportunitieszMarketing: segmentation, customer targeting, .zFinance: investment support, portfolio managementzBanking & Insurance: credit and policy approvalzSecurity: fraud detectionzScience and medicine: hypothesis discovery, prediction, classification, diagnosis zManufacturing: process modeling, quality control,resource allocationzEngineering: simulation and analysis, pattern recognition, signal processingzInternet: smart search engines, web marketing Classes of applicationszMarket analysisxtarget marketing, customer relation management, market basket analysis, cross selling, market segmentation.zRisk analysisxForecasting, customer retention, improved underwriting, quality control, competitive analysis.zFraud detectionzText (news group, email, documents) and Web analysis.Market AnalysiszWhere are the data sources for analysis?yCredit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies.zTarget marketingyFind clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.zDetermine customer purchasing patterns over timeyConversion of single to a joint bank account: marriage, etc.zCross-market analysisyAssociations/co-relations between product salesyPrediction based on the association information.14zCustomer profilingydata mining can tell you what types of customers buy what products (clustering or classification).zIdentifying customer requirementsyidentifying the best products for different customersyuse prediction to find what factors will attract new customerszProvides summary informationyvarious multidimensional summary reports;ystatistical summary information (data central tendency and variation)Market Analysis and ManagementMarket Analysis (2)15Risk AnalysiszFinance planning and asset evaluation: ycash flow analysis and predictionycontingent claim analysis to evaluate assets ycross-sectional and time series analysis (financial-ratio, trend analysis, etc.)zResource planning:ysummarize and compare the resources and spendingzCompetition:ymonitor competitors and market directions (CI: competitive intelligence).ygroup customers into classes and class-based pricing proceduresyset pricing strategy in a highly competitive market16Fraud DetectionzApplications:ywidely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.zApproach:yuse historical data to build models of fraudulent behavior and use data mining to help identify similar instances.zExamples:yauto insurance: detect a group of people who stage accidents to collect on insuranceymoney laundering: detect suspicious money transactions (US Treasurys Financial Crimes Enforcement Network) ymedical insurance: detect professional patients and ring of doctors and ring of references17zMore examples:yDetecting inappropriate medical treatment: xAustralian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr).yDetecting telephone fraud: xTelephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm.xBritish Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. yRetail: Analysts estimate that 38% of retail shrink is due to dishonest employees.Fraud Detection (2)18zSportsyIBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat.zAstronomyyJPL and the Palomar Observatory discovered 22 quasars with the help of data miningzInternet Web Surf-AidyIBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.yWatch for the PRIVACY pitfall!Other applications19zThe selection and processing of data for:ythe identification of novel, accurate, and useful patterns, and ythe modeling of real-world phenomena.zData mining is a major component of the KDD process - automated discovery of patterns and the development of predictive and explanatory models.What is KDD? A process!Selection and PreprocessingData MiningInterpretation and EvaluationData ConsolidationKnowledgep(x)=0.02WarehouseData SourcesPatterns & ModelsPrepared Data ConsolidatedDataThe KDD processThe KDD ProcessCore Problems & Approaches zProblems:yidentification of relevant datayrepresentation of dataysearch for valid pattern or modelzApproaches:ytop-down deduction by expertyinteractive visualization of data/modelsy* bottom-up induction from data *DataMiningOLAPzLearning the application domain:yrelevant prior knowledge and goals of applicationzData consolidation: Creating a target data setzSelection and Preprocessing yData cleaning : (may take 60% of effort!)yData reduction and projection:xfind useful features, dimensionality/variable reduction, invariant representation.zChoosing functions of data mining ysummarization, classification, regression, association, clustering.zChoosing the mining algorithm(s)zData mining: search for patterns of interestzInterpretation and evaluation: analysis of results.yvisualization, transformation, removing redundant patterns, zUse of discovered knowledgeThe steps of the KDD process23IdentifyProblem or OpportunityMeasure effectof ActionAct onKnowledgeKnowledgeResultsStrategyProblemThe virtuous cycleApplications, operations, techniquesRoles in the KDD processIncreasing potentialto supportbusiness decisionsEnd UserBusiness Analyst DataAnalystDBA MakingDecisionsData PresentationVisualization TechniquesData MiningInformation DiscoveryData ExplorationOLAP, MDAStatistical Analysis, Querying and ReportingData Warehouses / Data MartsData SourcesPaper, Files, Information Providers, Database Systems, OLTPData mining and business intelligenceGraphical User InterfaceDataConsolidationSelectionandPreprocessingDataMiningInterpretationand EvaluationWarehouseKnowledgeData SourcesArchitecture of a KDD systemA business intelligence environmentSelection and PreprocessingData MiningInterpretation and EvaluationData ConsolidationKnowledgep(x)=0.02WarehouseData SourcesPatterns & ModelsPrepared Data ConsolidatedDataThe KDD processGarbage in Garbage out zThe quality of results relates directly to quality of the dataz50%-70% of KDD process effort is spent on data consolidation and preparationzMajor justification for a corporate data warehouseData consolidation and preparationFrom data sources to consolidated data repositoryRDBMSLegacy DBMSFlat FilesDataConsolidationand CleansingWarehouseObject/Relation DBMS Object/Relation DBMS Multidimensional DBMS Multidimensional DBMS Deductive Database Deductive Database Flat files Flat files ExternalData consolidationzDetermine preliminary list of attributes zConsolidate data into working databaseyInternal and External sourceszEliminate or estimate missing valueszRemove outliers (obvious exceptions)zDetermine prior probabilities of categories and deal with volume biasData consolidationSelection and PreprocessingData Mining Interpretation and EvaluationData ConsolidationKnowledgep(x)=0.02WarehouseThe KDD processzGenerate a set of examplesychoose sampling methodyconsider sample complexityydeal with volume bias issueszReduce attribute dimensionalityyremove redundant and/or correlating attributesycombine attributes (sum, multiply, difference)zReduce attribute value rangesygroup symbolic discrete valuesyquantize continuous numeric valueszTransform datayde-correlate and normalize values ymap time-series data to static representationzOLAP and visualization tools play key roleData selection and preprocessingSelection and PreprocessingData Mining Interpretation and EvaluationData ConsolidationKnowledgep(x)=0.02WarehouseThe KDD processData mining tasks and methods zAutomated Exploration/Discoveryye.g. discovering new market segmentsyclustering analysiszPrediction/Classificationye.g. forecasting gross sales given current factorsyregression, neural networks, genetic algorithms, decision treeszExplanation/Descriptionye.g. characterizing customers by demographics and purchase historyydecision trees, association rulesx1x2f(x)xif age 35 and income $35k then .zClustering: partitioning a set of data into a set of classes, called clusters, whose members share some interesting common properties.zDistance-based numerical clusteringymetric grouping of examples (K-NN)ygraphical visualization can be usedzBayesian clusteringysearch for the number of classes which result in best fit of a probability distribution to the data yAutoClass (NASA) one of best examplesAutomated exploration and discoveryzLearning a predictive modelzClassification of a new case/sample zMany methods:yArtificial neural networksyInductive decision tree and rule systemsyGenetic algorithmsyNearest neighbor clustering algorithmsyStatistical (parametric, and non-parametric)Prediction and classificationzThe objective of learning is to achieve good generalization to new unseen cases.zGeneralization can be defined as a mathematical interpolation or regression over a set of training pointszModels can be validated with a previously unseen test set or using cross-validation methodsf(x)xGeneralization and regressionClassification and predictionyClassify data based on the values of a target attribute, e.g., classify countries based on climate, or classify cars based on gas mileage.yUse obtained model to predict some unknown or missing attribute values based on other information.Objective: Develop a general model or hypothesis from specific exampleszFunction approximation (curve fitting)zClassification (concept learning, pattern recognition)x1x2ABf(x)xSummarizing: inductive modeling = learningzLearn a generalized hypothesis (model) from selected datazDescription/Interpretation of model provides new knowledge zMethods:yInductive decision tree and rule systemsyAssociation rule systemsyLink Analysisy Explanation and descriptionzGenerate a model of normal activityzDeviation from model causes alertzMethods:yArtificial neural networksyInductive decision tree and rule systemsyStatistical methodsyVisualization toolsException/deviation detectionOutlier and exception data analysiszTime-series analysis (trend and deviation): yTrend and deviation analysis: regression, sequential pattern, similar sequences, trend and deviation, e.g., stock analysis.ySimilarity-based pattern-directed analysisyFull vs. partial periodicity analysiszOther pattern-directed or statistical analysisSelection and PreprocessingData Mining Interpretation and EvaluationData Consolidationand WarehousingKnowledgep(x)=0.02WarehouseThe KDD processzA data mining system/query may generate thousands of patterns, not all of them are interesting.zInterestingness measures:yeasily understood by humansyvalid on new or test data with some degree of certainty.ypotentially usefulynovel, or validates some hypothesis that a user seeks to confirm zObjective vs. subjective interestingness measuresyObjective: based on statistics and structures of patterns, e.g., support, confidence, etc.ySubjective: based on users beliefs in the data, e.g., unexpectedness, novelty, etc.Are all the discovered pattern interesting?47zFind all the interesting patterns: Completeness.yCan a data mining system find all the interesting patterns?zSearch for only interesting patterns: Optimization.yCan a data mining system find only the interesting patterns?yApproachesxFirst generate all the patterns and then filter out the uninteresting ones.xGenerate only the interesting patterns - mining query optimization.Completeness vs. optimization48EvaluationzStatistical validation and significance testingzQualitative review by experts in the fieldzPilot surveys to evaluate model accuracyInterpretationzInductive tree and rule models can be read directlyzClustering results can be graphed and tabledzCode can be automatically generated by some systems (IDTs, Regression models)Interpretation and evaluationzVisualization tools can be very helpfulysensitivity analysis (I/O relationship)yhistograms of value distributionytime-series plots and animationyrequires training and practiceResponseVelocityTempInterpretation and evaluationz1989 IJCAI Workshop on KDDyKnowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, eds., 1991)z1991-1994 Workshops on KDDyAdvances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., 1996)z1995-1998 AAAI Int. Conf. on KDD and DM (KDD95-98)yJournal of Data Mining and Knowledge Discovery (1997)z1998 ACM SIGKDD z1999 SIGKDD99 Conf.Important dates of data mining51References - generalzP. Adriaans and D. Zantinge. Data Mining. Addison-Wesley: Harlow, England, 1996.zM. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database perspective. IEEE Trans. Knowledge and Data Engineering, 8:866-883, 1996.zU. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.zJ. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000. To appear.zT. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of ACM, 39:58-64, 1996.zG. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge discovery: An overview. In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 1-35. AAAI/MIT Press, 1996.zG. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991.zMichael Berry & Gordon Linoff. Data Mining Techniques for Marketing, Sales and Customer Support. John Wiley & Sons, 1997. zSholom M. Weiss and Nitin Indurkhya. Predictive Data Mining: A Practical Guide. Morgan Kaufmann, 1997.zW.H. Inmon, J.D. Welch, Katherine L. Glassey. Managing the data warehouse. Wiley, 1997. zT. Mitchell. Machine Learning. McGraw-Hill, 1997.Main Web resourceszKDD Newsletter and comprehensive websitez zACM SIGKDD zJournal of Data Mining and Knowledge DiscoveryTutorial OutlinezIntroduction and basic conceptsxMotivations, applications, the KDD process, the techniques zDeeper into DM technologyyDecision Trees and Fraud Detection yAssociation Rules and Market Basket AnalysisyClustering and Customer SegmentationzTrends in technologyxKnowledge Discovery Support EnvironmentxTools, Languages and SystemszResearch challenges