第4课数据分类和预测-

第4课数据分类和预测徐从富，副教授浙江大学人工智能研究所浙江大学本科生数据挖掘导论课件内容提纲nWhat is classification? What is prediction?nIssues regarding classification and predictionnClassification by decision tree inductionnBayesian ClassificationnPredictionnSummarynReferencenClassification predicts categorical class labels (discrete or nominal)classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new datanPrediction models continuous-valued functions, i.e., predicts unknown or missing values nTypical applicationsCredit approvalTarget marketingMedical diagnosisFraud detectionI.Classification vs. PredictionClassificationA Two-Step Process nModel construction: describing a set of predetermined classesEach tuple/sample is assumed to belong to a predefined class, as determined by the class label attributeThe set of tuples used for model construction is training setThe model is represented as classification rules, decision trees, or mathematical formulaenModel usage: for classifying future or unknown objectsEstimate accuracy of the modelnThe known label of test sample is compared with the classified result from the modelnAccuracy rate is the percentage of test set samples that are correctly classified by the modelnTest set is independent of training set, otherwise over-fitting will occurIf the accuracy is acceptable, use the model to classify data tuples whose class labels are not knownClassification Process (1): Model ConstructionTrainingDataClassificationAlgorithmsIF rank = professorOR years 6THEN tenured = yes Classifier(Model)Classification Process (2): Use the Model in PredictionClassifierTestingDataUnseen Data(Jeff, Professor, 4)Tenured?Supervised vs. Unsupervised LearningnSupervised learning (classification)Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observationsNew data is classified based on the training setnUnsupervised learning (clustering)The class labels of training data is unknownGiven a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the dataII.Issues Regarding Classification and Prediction (1): Data PreparationnData cleaningPreprocess data in order to reduce noise and handle missing valuesnRelevance analysis (feature selection)Remove the irrelevant or redundant attributesnData transformationGeneralize and/or normalize dataIssues regarding classification and prediction (2): Evaluating classification methodsnAccuracy: classifier accuracy and predictor accuracynSpeed and scalabilitytime to construct the model (training time)time to use the model (classification/prediction time)nRobustnesshandling noise and missing valuesnScalabilityefficiency in disk-resident databases nInterpretabilityunderstanding and insight provided by the modelnOther measures, e.g., goodness of rules, such as decision tree size or compactness of classification rulesIII.Decision Tree Induction: Training DatasetThis follows an example of Quinlans ID3 (Playing Tennis)Output: A Decision Tree for “buys_computer”age?overcaststudent?credit rating?noyesfairexcellent40nonoyesyesyes30.40Algorithm for Decision Tree InductionnBasic algorithm (a greedy algorithm)Tree is constructed in a top-down recursive divide-and-conquer mannerAt start, all the training examples are at the rootAttributes are categorical (if continuous-valued, they are discretized in advance)Examples are partitioned recursively based on selected attributesTest attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)nConditions for stopping partitioningAll samples for a given node belong to the same classThere are no remaining attributes for further partitioning majority voting is employed for classifying the leafThere are no samples leftAttribute Selection Measure: Information Gain (ID3/C4.5)nSelect the attribute with the highest information gainnS contains si tuples of class Ci for i = 1, , m ninformation measures info required to classify any arbitrary tuplenentropy of attribute A with values a1,a2,avninformation gained by branching on attribute AAttribute Selection by Information Gain ComputationgClass P: buys_computer = “yes”gClass N: buys_computer = “no”gI(p, n) = I(9, 5) =0.940gCompute the entropy for age: means “age split-pointExtracting Classification Rules from TreesnRepresent the knowledge in the form of IF-THEN rulesnOne rule is created for each path from the root to a leafnEach attribute-value pair along a path forms a conjunctionnThe leaf node holds the class predictionnRules are easier for humans to understandnExampleIF age = “=30” AND student = “no” THEN buys_computer = “no”IF age = “40” AND credit_rating = “excellent” THEN buys_computer = “yes”IF age = “=30” AND credit_rating = “fair” THEN buys_computer = “no”Avoid Overfitting in ClassificationnOverfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliersPoor accuracy for unseen samplesnTwo approaches to avoid overfitting Prepruning: Halt tree construction earlydo not split a node if this would result in the goodness measure falling below a thresholdnDifficult to choose an appropriate thresholdPostpruning: Remove branches from a “fully grown” treeget a sequence of progressively pruned treesnUse a set of data different from the training data to decide which is the “best pruned tree”Approaches to Determine the Final Tree SizenSeparate training (2/3) and testing (1/3) setsnUse cross validationnUse all the data for trainingbut apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distributionnEnhancements to Basic Decision Tree InductionnAllow for continuous-valued attributesDynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervalsnHandle missing attribute valuesAssign the most common value of the attributeAssign probability to each of the possible valuesnAttribute constructionCreate new attributes based on existing ones that are sparsely representedThis reduces fragmentation, repetition, and replicationClassification in Large DatabasesnClassificationa classical problem extensively studied by statisticians and machine learning researchersnScalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speednWhy decision tree induction in data mining?relatively faster learning speed (than other classification methods)convertible to simple and easy to understand classification rulescan use SQL queries for accessing databasescomparable classification accuracy with other methodsScalable Decision Tree Induction MethodsnSLIQ (EDBT96 Mehta et al.)builds an index for each attribute and only class list and the current attribute list reside in memorynSPRINT (VLDB96 J. Shafer et al.)constructs an attribute list data structure nPUBLIC (VLDB98 Rastogi & Shim)integrates tree splitting and tree pruning: stop growing the tree earliernRainForest (VLDB98 Gehrke, Ramakrishnan & Ganti)separates the scalability aspects from the criteria that determine the quality of the treebuilds an AVC-list (attribute, value, class label)Presentation of Classification ResultsVisualization of a Decision Tree in SGI/MineSet 3.0Interactive Visual Mining by Perception-Based Classification (PBC)IV.Bayesian Classification: Why?nProbabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problemsnIncremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.nProbabilistic prediction: Predict multiple hypotheses, weighted by their probabilitiesnStandard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measuredBayesian Theorem: BasicsnLet X be a data sample whose class label is unknownnLet H be a hypothesis that X belongs to class C nFor classification problems, determine P(H|X): the probability that the hypothesis holds given the observed data sample XnP(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge)nP(X): probability that sample data is observednP(X|H): probability of observing the sample X, given that the hypothesis holdsBayesian TheoremnGiven training data X, posteriori probability of a hypothesis H, P(H|X) follows the Bayes theoremnInformally, this can be written as posteriori = likelihood x prior / evidencenMAP (maximum posteriori) hypothesisnPractical difficulty: require initial knowledge of many probabilities, significant computational costNave Bayes Classifier nA simplified assumption: attributes are conditionally independent:nThe product of occurrence of say 2 elements x1 and x2, given the current class is C, is the product of the probabilities of each element taken separately, given the same class P(y1,y2, C) = P(y1, C) * P(y2, C)nNo dependence relation between attributes nGreatly reduces the computation cost, only count the class distribution.nOnce the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci) * P(Ci)Training datasetClass:C1:buys_computer=yesC2:buys_computer=noData sample X =(age=30,Income=medium,Student=yesCredit_rating=Fair)Nave Bayesian Classifier: An ExamplenCompute P(X|Ci) for each classP(age=“30” | buys_computer=“yes”) = 2/9=0.222 P(age=“30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age=30 , income =medium, student=yes, credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007 Therefore, X belongs to class “buys_computer=yes”Nave Bayesian Classifier: CommentsnAdvantages Easy to implement Good results obtained in most of the casesnDisadvantagesAssumption: class conditional independence, therefore loss of accuracyPractically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc Dependencies among these cannot be modeled by Nave Bayesian ClassifiernHow to deal with these dependencies?Bayesian Belief Networks Bayesian Belief NetworksnBayesian belief network allows a subset of the variables conditionally independentnA graphical model of causal relationshipsRepresents dependency among the variables Gives a specification of joint probability distribution XYZPqNodes: random variablesqLinks: dependencyqX,Y are the parents of Z, and Y is the parent of PqNo dependency between Z and PqHas no loops or cyclesBayesian Belief Network: An ExampleFamilyHistoryLungCancerPositiveXRaySmokerEmphysemaDyspneaLCLC(FH, S)(FH, S)(FH, S) (FH, S)0.80.20.50.50.70.30.10.9Bayesian Belief NetworksThe conditional probability table for the variable LungCancer:Shows the conditional probability for each possible combination of its parentsLearning Bayesian NetworksnSeveral casesGiven both the network structure and all variables observable: learn only the CPTsNetwork structure known, some hidden variables: method of gradient descent, analogous to neural network learningNetwork structure unknown, all variables observable: search through the model space to reconstruct graph topology Unknown structure, all hidden variables: no good algorithms known for this purposenD. Heckerman, Bayesian networks for data miningV.What Is Prediction?n(Numerical) prediction is similar to classificationconstruct a modeluse model to predict continuous or ordered value for a given inputnPrediction is different from classificationClassification refers to predict categorical class labelPrediction models continuous-valued functionsnMajor method for prediction: regressionmodel the relationship between one or more independent or predictor variables and a dependent or response variablenRegression analysisLinear and multiple regressionNon-linear regressionOther regression methods: generalized linear model, Poisson regression, log-linear models, regression treesLinear Regression nLinear regression: involves a response variable y and a single predictor variable x，y = w0 + w1xwhere w0 (y-intercept) and w1 (slope) are regression coefficients nMethod of least squares: estimates the best-fitting straight linenMultiple linear regression: involves more than one predictor variableTraining data is of the form (X1, y1), (X2, y2), (X|D|, y|D|) Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2Solvable by extension of least square method or using SAS, S-PlusMany nonlinear functions can be transformed into the abovenSome nonlinear models can be modeled by a polynomial functionnA polynomial regression model can be transformed into linear regression model. For example,y = w0 + w1 x + w2 x2 + w3 x3convertible to linear with new variables: x2 = x2, x3= x3y = w0 + w1 x + w2 x2 + w3 x3 nOther functions, such as power function, can also be transformed to linear modelnSome models are intractable nonlinear (e.g., sum of exponential terms)possible to obtain least square estimates through extensive calculation on more complex formulaeNonlinear Regression nGeneralized linear model: Foundation on which linear regression can be applied to modeling categorical response variablesVariance of y is a function of the mean value of y, not a constantLogistic regression: models the prob. of some event occurring as a linear function of a set of predictor variablesPoisson regression: models the data that exhibit a Poisson distributionnLog-linear models: (for categorical data)Approximate discrete multidimensional prob. distributions Also useful for data compression and smoothingnRegression trees and model treesTrees to predict continuous values rather than class labelsOther Regression-Based ModelsRegression Trees and Model TreesnRegression tree: proposed in CART system (Breiman et al. 1984)CART: Classification And Regression TreesEach leaf stores a continuous-valued predictionIt is the average value of the predicted attribute for the training tuples that reach the leafnModel tree: proposed by Quinlan (1992)Each leaf holds a regression modela multivariate linear equation for the predicted attributeA more general case than regression treenRegression and model trees tend to be more accurate than linear regression when the data are not represented well by a simple linear modelnPredictive modeling: Predict data values or construct generalized linear models based on the database datanOne can only predict value ranges or category distributionsnMethod outline: Minimal generalization Attribute relevance analysis Generalized linear model construction PredictionnDetermine the major factors which influence the predictionData relevance analysis: uncertainty measurement, entropy analysis, expert judgement, etc.nMulti-level prediction: drill-down and roll-up analysisPredictive Modeling in Multidimensional DatabasesPrediction: Numerical DataPrediction: Categorical DataVI.SummarynClassification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. nEffective and scalable methods have been developed for decision trees induction, Naive Bayesian classification, Bayesian belief network, rule-based classifier, Backpropagation, Support Vector Machine (SVM), associative classification, nearest neighbor classifiers, and case-based reasoning, and other classification methods such as genetic algorithms, rough set and fuzzy set approaches.nLinear, nonlinear, and generalized linear models of regression can be used for prediction. Many nonlinear problems can be converted to linear problems by performing transformations on the predictor variables. Regression trees and model trees are also used for prediction. VII.ReferencenJ. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.nJ. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.nR. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification (2nd ed.). John Wiley & Sons, 2001.nT. M. Mitchell. Machine Learning. McGraw Hill, 1997.