基于最小差异度的关联分类方法的研究与应用-

摘要 I 摘摘要要随着信息技术的迅猛发展，人们所需要处理的数据也越来越庞大。如何从数据中发现有价值的知识成为当今的研究热点。数据挖掘通过分析数据，揭示数据内部所蕴含的知识，成为未来信息技术应用的重要目标之一。在数据挖掘的各个分支中，关联规则和分类是两个重要的研究分支，关联规则主要用于描述数据库中数据项之间的潜在联系，通过挖掘属性与类标号之间的相关性，可以将关联规则挖掘用于分类，有着极其重要的应用价值。在整个数据挖掘的研究中，算法的研究占据着特别重要的地位。一方面，数据挖掘面对的是大量数据集，时间和空间的消耗都是很重要的问题，因此算法的效率对其能否应用于现实生活中的问题起关键作用。另一方面是算法是否具有较高的分类精确度，毕竟高分类准确率是算法能否应用到现实生活中的最重要的目的。首先，本文概述了关联分类的基本理论和几种经典算法，并介绍了经典关联规则挖掘算法。其次，本文针对产生关联规则时仅考虑支持度和置信度的局限性，引入了最小差异度这个概念，改进了关联分类算法，提出了基于最小频繁模式矩阵的分类算法 CFPM（classification based on frequent pattern matrix）。详细介绍了算法的基本思想以及算法步骤。对部分 UCI 数据集进行测试，将分类结果对比决策树算法 C4.5 以及其他经典关联分类方法 CBA， CMAR 和 CPAR。实验结果表明了算法 CFPM 在大部分数据集上具有较高的分类精度，更适合于离散数据集。并针对同一数据集在记录数不同的情况下，分别测试分类精度和运行时间的变化情况。随后，针对算法对大数据集效率不高，即运行时间过长，所占内存开销过大的问题，又引入了属性重要性和独立属性值这两个概念。并分别在只考虑属性重要性，只考虑独立属性值，以及两者均考虑的情况下对比算法的分类精度和运行时间。实验结果表明，在删除非重要属性后，运行时间极大减小，分类精度反而略有提高。说明被删除的属性是对分类起负作用的属性，即该属性会引起分类错误。最后，将算法 CFPM 应用于现实生活中。本文采用的数据集是 UCI 公共数据库中的人口调查局数据 adult，即在知道各个属性以及属性值的情况下，判别人的年收安徽大学学位论文基于最小差异度的关联分类方法的研究与应用 II 入是否大于 50K。实验结果表明基于最小差异度的关联分类算法可以应用于较大型数据集，同时较其它算法有较高的分类精度。当支持度增大的时候，在准确率稍微降低的情况下，运行时间却大大降低，表明此算法具有现实可行性。关键词：关键词：最小差异度；属性重要性；独立属性值；关联分类； Abstract III Abstract With the rapid development of information technology, the data which are ready to be processed become larger and larger. How to discover the knowledge which has great value becomes the research hotspot recently. Data Mining finds out the knowledge among data via the data analysis, and becomes one of the most important goals of the information technology appliance. In all the branch of Data Mining, association rules and classification are two important branches, association rules mainly used to describe the potential relation among the data items, use association rules mining to classify the databases through mining the relationship between attributes and class labels, all of these have important applied value. In the research of Data Ming, the research of the algorithm takes up the significant status. On one hand, Data Ming is confronted with large databases; the consumption of space-time is one of the most important problems, so the efficiency of the algorithm takes the key function whether the algorithm can be applied to the true-life. On the other hand, whether the algorithm has high precision of the classification; After all, the high precision of the algorithm is the main intention of whether it can be applied to the reality. Firstly, this thesis mainly summarizes the basic theory of the association classification and the classical algorithm and the classical association rules algorithm is introduced. Secondly, in allusion to the localization of the min-support and confidence, the concept min-discrepancy is introduced, the algorithm of association classification is improved, and then the basic idea and the step of the algorithm CFPM (classification based on frequent pattern matrix) is proposed. Do experiments in the UCI public databases, and then compare the classification precision with the decision tree algorithm and other classical association classification algorithm, for example, CBA, CMAR and CPAR. The result implies that the algorithm CFPM has the higher classification precision and more applied to the discrete datasets. And use different amount of datasets to test the classification precision and the runtime. Thirdly, in allusion to the lower efficiency of the algorithm CFPM, in other words, the runtime is so long and the memory spending is so large, the concept of 安徽大学学位论文基于最小差异度的关联分类方法的研究与应用 IV attribute-importance and attribute-independency is proposed. Compared with the classification precision and runtime in three situations, such as only consider attribute-importance, only attribute- independency, and both are considered. The result implies that after delete the unimportant attributes, the runtime diminishes so obviously, but on the contrast the classification precision improves. So the conclusion that the attribute that is deleted is the unrelated attribute, which has the negative effects to the classification precision, the attribute can cause the inaccuracy classification. Finally, apply the algorithm CFPM to the true-life. Adult from the UCI public datasets is used in the thesis, it is the data from the US Census Bureau. In other words, to estimate whether the annual salary is more than 50K. The result implied that the classification based on association rules can be applied to the lager datasets and has higher classification precision. The runtime is much reduced at a little expense of the classification precision. All implied that the algorithm CFPM is feasible. Keywords：Min-discrepancy; attribute-importance; attribute-independency; association classification; 目录 I 目目录录摘要 . I Abstract .