数据挖掘数据预处理 Data Preprocessing-

Data Mining: Concepts and Techniques Chapter 2 Date1Data Mining: Concepts and TechniquesChapter 2: Data PreprocessingnWhy preprocess the data?nDescriptive data summarizationnData cleaning nData integration and transformationnData reductionnDiscretization and concept hierarchy generationnSummaryDate2Data Mining: Concepts and TechniquesWhy Data Preprocessing?nData in the real world is dirtynincomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate datane.g., occupation=“ ”nnoisy: containing errors or outliersne.g., Salary=“-10”Date3Data Mining: Concepts and TechniquesWhy Data Preprocessing?ninconsistent: containing discrepancies in codes or namesne.g., Age=“42” Birthday=“03/07/1997”ne.g., Was rating “1,2,3”, now rating “A, B, C”ne.g., discrepancy between duplicate recordsDate4Data Mining: Concepts and TechniquesWhy Is Data Dirty?nIncomplete data may come fromn“Not applicable” data value when collectednDifferent considerations between the time when the data was collected and when it is analyzed.nHuman/hardware/software problemsnNoisy data (incorrect values) may come fromnFaulty data collection instrumentsnHuman or computer error at data entrynErrors in data transmissionnInconsistent data may come fromnDifferent data sourcesnFunctional dependency violation (e.g., modify some linked data)nDuplicate records also need data cleaningDate5Data Mining: Concepts and TechniquesWhy Is Data Preprocessing Important?nNo quality data, no quality mining results!nQuality decisions must be based on quality datane.g., duplicate or missing data may cause incorrect or even misleading statistics.nData warehouse needs consistent integration of quality datanData extraction, cleaning, and transformation comprises the majority of the work of building a data warehouseDate6Data Mining: Concepts and TechniquesMajor Tasks in Data PreprocessingnData cleaningnFill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistenciesnData integrationnIntegration of multiple databases, data cubes, or filesnData transformationnNormalization and aggregationnData reductionnObtains reduced representation in volume but produces the same or similar analytical resultsnData discretizationnPart of data reduction but with particular importance, especially for numerical dataDate7Data Mining: Concepts and TechniquesForms of Data Preprocessing Date8Data Mining: Concepts and TechniquesChapter 2: Data PreprocessingnWhy preprocess the data?nDescriptive data summarizationnData cleaning nData integration and transformationnData reductionnDiscretization and concept hierarchy generationnSummaryDate9Data Mining: Concepts and TechniquesMining Data Descriptive CharacteristicsnMotivationnTo better understand the data: central tendency, variation and spreadnData dispersion characteristics nmedian, max, min, quantiles, outliers, variance, etc.nNumerical dimensions correspond to sorted intervalsnData dispersion: analyzed with multiple granularities of precisionnBoxplot or quantile analysis on sorted intervalsDate10Data Mining: Concepts and TechniquesMeasuring the Central TendencynMean (algebraic measure) (sample vs. population):nWeighted arithmetic mean:nTrimmed mean: chopping extreme valuesnMedian: A holistic measurenMiddle value if odd number of values, or average of the middle two values otherwiseDate11Data Mining: Concepts and TechniquesMeasuring the Central TendencynModenValue that occurs most frequently in the datanUnimodal, bimodal, trimodalnEmpirical formula:Date12Data Mining: Concepts and TechniquesSymmetric vs. Skewed DatanMedian, mean and mode of symmetric, positively and negatively skewed dataDate13Data Mining: Concepts and TechniquesMeasuring the Dispersion of DatanQuartiles, outliers and boxplotsnQuartiles: Q1 (25th percentile), Q3 (75th percentile)nInter-quartile range: IQR = Q3 Q1 nFive number summary: min, Q1, M, Q3, maxnBoxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individuallynOutlier: usually, a value higher/lower than 1.5 x IQRDate14Data Mining: Concepts and TechniquesBoxplot AnalysisnFive-number summary of a distribution:Minimum, Q1, M, Q3, MaximumnBoxplotnData is represented with a boxnThe ends of the box are at the first and third quartiles, i.e., the height of the box is IRQnThe median is marked by a line within the boxnWhiskers: two lines outside the box extend to Minimum and MaximumDate15Data Mining: Concepts and TechniquesMeasuring the Dispersion of Data (cont.)nVariance and standard deviation (sample: s, population: )nVariance: (algebraic, scalable computation)nStandard deviation s (or ) is the square root of variance s2 (or 2)Date16Data Mining: Concepts and TechniquesProperties of Normal Distribution CurvenThe normal (distribution) curvenFrom to +: contains about 68% of the measurements (: mean, : standard deviation)nFrom 2 to +2: contains about 95% of itnFrom 3 to +3: contains about 9