资源预览内容
第1页 / 共66页
第2页 / 共66页
第3页 / 共66页
第4页 / 共66页
第5页 / 共66页
第6页 / 共66页
第7页 / 共66页
第8页 / 共66页
第9页 / 共66页
第10页 / 共66页
亲,该文档总共66页,到这儿已超出免费预览范围,如果喜欢就下载吧!
资源描述
摘 要 I 摘摘 要要 随着信息技术的迅猛发展,人们所需要处理的数据也越来越庞大。如何从数据 中发现有价值的知识成为当今的研究热点。数据挖掘通过分析数据,揭示数据内部 所蕴含的知识, 成为未来信息技术应用的重要目标之一。 在数据挖掘的各个分支中, 关联规则和分类是两个重要的研究分支,关联规则主要用于描述数据库中数据项之 间的潜在联系,通过挖掘属性与类标号之间的相关性,可以将关联规则挖掘用于分 类,有着极其重要的应用价值。 在整个数据挖掘的研究中,算法的研究占据着特别重要的地位。一方面,数据 挖掘面对的是大量数据集,时间和空间的消耗都是很重要的问题,因此算法的效率 对其能否应用于现实生活中的问题起关键作用。另一方面是算法是否具有较高的分 类精确度,毕竟高分类准确率是算法能否应用到现实生活中的最重要的目的。 首先,本文概述了关联分类的基本理论和几种经典算法,并介绍了经典关联规 则挖掘算法。 其次,本文针对产生关联规则时仅考虑支持度和置信度的局限性,引入了最小 差异度这个概念,改进了关联分类算法,提出了基于最小频繁模式矩阵的分类算法 CFPM(classification based on frequent pattern matrix) 。详细介绍了算法的基本思想 以及算法步骤。对部分 UCI 数据集进行测试,将分类结果对比决策树算法 C4.5 以 及其他经典关联分类方法 CBA, CMAR 和 CPAR。 实验结果表明了算法 CFPM 在大 部分数据集上具有较高的分类精度,更适合于离散数据集。并针对同一数据集在记 录数不同的情况下,分别测试分类精度和运行时间的变化情况。 随后,针对算法对大数据集效率不高,即运行时间过长,所占内存开销过大的 问题, 又引入了属性重要性和独立属性值这两个概念。 并分别在只考虑属性重要性, 只考虑独立属性值,以及两者均考虑的情况下对比算法的分类精度和运行时间。实 验结果表明,在删除非重要属性后,运行时间极大减小,分类精度反而略有提高。 说明被删除的属性是对分类起负作用的属性,即该属性会引起分类错误。 最后,将算法 CFPM 应用于现实生活中。本文采用的数据集是 UCI 公共数据库 中的人口调查局数据 adult,即在知道各个属性以及属性值的情况下,判别人的年收 安徽大学学位论文 基于最小差异度的关联分类方法的研究与应用 II 入是否大于 50K。实验结果表明基于最小差异度的关联分类算法可以应用于较大型 数据集,同时较其它算法有较高的分类精度。当支持度增大的时候,在准确率稍微 降低的情况下,运行时间却大大降低,表明此算法具有现实可行性。 关键词:关键词:最小差异度;属性重要性;独立属性值;关联分类; Abstract III Abstract With the rapid development of information technology, the data which are ready to be processed become larger and larger. How to discover the knowledge which has great value becomes the research hotspot recently. Data Mining finds out the knowledge among data via the data analysis, and becomes one of the most important goals of the information technology appliance. In all the branch of Data Mining, association rules and classification are two important branches, association rules mainly used to describe the potential relation among the data items, use association rules mining to classify the databases through mining the relationship between attributes and class labels, all of these have important applied value. In the research of Data Ming, the research of the algorithm takes up the significant status. On one hand, Data Ming is confronted with large databases; the consumption of space-time is one of the most important problems, so the efficiency of the algorithm takes the key function whether the algorithm can be applied to the true-life. On the other hand, whether the algorithm has high precision of the classification; After all, the high precision of the algorithm is the main intention of whether it can be applied to the reality. Firstly, this thesis mainly summarizes the basic theory of the association classification and the classical algorithm and the classical association rules algorithm is introduced. Secondly, in allusion to the localization of the min-support and confidence, the concept min-discrepancy is introduced, the algorithm of association classification is improved, and then the basic idea and the step of the algorithm CFPM (classification based on frequent pattern matrix) is proposed. Do experiments in the UCI public databases, and then compare the classification precision with the decision tree algorithm and other classical association classification algorithm, for example, CBA, CMAR and CPAR. The result implies that the algorithm CFPM has the higher classification precision and more applied to the discrete datasets. And use different amount of datasets to test the classification precision and the runtime. Thirdly, in allusion to the lower efficiency of the algorithm CFPM, in other words, the runtime is so long and the memory spending is so large, the concept of 安徽大学学位论文 基于最小差异度的关联分类方法的研究与应用 IV attribute-importance and attribute-independency is proposed. Compared with the classification precision and runtime in three situations, such as only consider attribute-importance, only attribute- independency, and both are considered. The result implies that after delete the unimportant attributes, the runtime diminishes so obviously, but on the contrast the classification precision improves. So the conclusion that the attribute that is deleted is the unrelated attribute, which has the negative effects to the classification precision, the attribute can cause the inaccuracy classification. Finally, apply the algorithm CFPM to the true-life. Adult from the UCI public datasets is used in the thesis, it is the data from the US Census Bureau. In other words, to estimate whether the annual salary is more than 50K. The result implied that the classification based on association rules can be applied to the lager datasets and has higher classification precision. The runtime is much reduced at a little expense of the classification precision. All implied that the algorithm CFPM is feasible. Keywords:Min-discrepancy; attribute-importance; attribute-independency; association classification; 目 录 I 目目 录录 摘 要 . I Abstract .
收藏 下载该资源
网站客服QQ:2055934822
金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号