第2课数据预处理技术-

第2课数据预处理技术徐从富，副教授浙江大学人工智能研究所浙江大学本科生数据挖掘导论课件内容提纲nWhy preprocess the data?nData cleaning nData integration and transformationnData reductionnDiscretization and concept hierarchy generationnSummaryI.Why Data Preprocessing? nData in the real world is dirtyincomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate datane.g., occupation=“”noisy: containing errors or outliersne.g., Salary=“-10”inconsistent: containing discrepancies in codes or namesne.g., Age=“42” Birthday=“03/07/1997”ne.g., Was rating “1,2,3”, now rating “A, B, C”ne.g., discrepancy between duplicate recordsWhy Is Data Dirty?nIncomplete data comes fromn/a data value when collecteddifferent consideration between the time when the data was collected and when it is analyzed.human/hardware/software problemsnNoisy data comes from the process of datacollectionentrytransmissionnInconsistent data comes fromDifferent data sourcesFunctional dependency violationWhy Is Data Preprocessing Important?nNo quality data, no quality mining results!Quality decisions must be based on quality datane.g., duplicate or missing data may cause incorrect or even misleading statistics.Data warehouse needs consistent integration of quality datanData extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse. Bill InmonMulti-Dimensional Measure of Data QualitynA well-accepted multidimensional view:AccuracyCompletenessConsistencyTimelinessBelievabilityValue addedInterpretabilityAccessibilitynBroad categories:intrinsic, contextual, representational, and accessibility.Major Tasks in Data Preprocessing nData cleaningFill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistenciesnData integrationIntegration of multiple databases, data cubes, or filesnData transformationNormalization and aggregationnData reductionObtains reduced representation in volume but produces the same or similar analytical resultsnData discretizationPart of data reduction but with particular importance, especially for numerical dataForms of data preprocessing II.Data CleaningnImportance“Data cleaning is one of the three biggest problems in data warehousing”Ralph Kimball“Data cleaning is the number one problem in data warehousing”DCI surveynData cleaning tasksFill in missing valuesIdentify outliers and smooth out noisy data Correct inconsistent dataResolve redundancy caused by data integrationMissing DatanData is not always availableE.g., many tuples have no recorded value for several attributes, such as customer income in sales datanMissing data may be due to equipment malfunctioninconsistent with other recorded data and thus deleteddata not entered due to misunderstandingcertain data may not be considered important at the time of entrynot register history or changes of the datanMissing data may need to be inferred.How to Handle Missing Data?nIgnore the tupleusually done when class label is missing (assuming the tasks in classificationnot effective when the percentage of missing values per attribute varies considerably).nFill in the missing value manually tedious + infeasible?nFill in it automatically witha global constant : e.g., “unknown”, a new class?! the attribute mean the attribute mean for all samples belonging to the same class: smarter the most probable value: inference-based such as Bayesian formula or decision treeNoisy DatanNoise: random error or variance in a measured variablenIncorrect attribute values may due tofaulty data collection instrumentsdata entry problemsdata transmission problemstechnology limitationinconsistency in naming convention nOther data problems which requires data cleaningduplicate recordsincomplete datainconsistent dataHow to Handle Noisy Data?nBinning method:first sort data and partition into (equi-depth) binsthen one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.nClusteringdetect and remove outliersnCombined computer and human inspectiondetect suspicious values and check by human (e.g., deal with possible outliers)nRegressionsmooth by fitting the data into regression functionsSimple Discretization Methods: BinningnEqual-width (distance) partitioning:Divides the range into N intervals of equal size: uniform gridif A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B A)/N.The most straightforward, but outliers may dominate presentationSkewed data is not handled well.nEqual-depth (frequency) partitioning:Divides the range into N intervals, each containing approximately same number of samplesGood data scalingManaging categorical attributes can be tricky.Binning Methods for Data Smoothing Sorted data for price (in dollars)4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins:- Bin 1: 4, 8, 9, 15- Bin 2: 21, 21, 24, 25- Bin 3: 26, 28, 29, 34