基于粗糙粒计算的数据挖掘算法研究（学位论文-工学）-

单位代码： 10293 密级：硕士学位论文论文题目：基于粗糙粒计算的数据挖掘算法研究学号姓名导师学科专业研究方向申请学位类别论文提交日期1010061514陈龙张腾飞模式识别与智能系统数据挖掘工学硕士2013 年 2 月Study of Data Mining Based onRough Set and Granular ComputingThesis Submitted to Nanjing University of Posts andTelecommunications for the Degree ofMaster of EngineeringByChen LongSupervisor: Prof. Zhang TengfeiFeb. 2013摘要现实世界数据集合的规模正在飞速膨胀。挖掘隐藏在数据内部的、模式化的信息或知识，变得日益重要。这使得数据挖掘成为了一个热门的研究问题。数据挖掘技术日趋成熟，研究发现数据中往往存在着大量的近似的、模糊的、不可分辨的信息。为处理不可分辨问题，很多的数据挖掘算法与粗糙集理论、粒计算理论相互结合。研究工作采用粗糙集、粒计算理论处理带有模糊性的数据集合，主要包括以下几个方面：1、提出一种单维度的层次粒化属性约简算法。分析了邻域方法在处理连续信息属性约简时，存在的粒化条件不统一的问题。即使用距离度量作为衡量近似关系的标准，对不同维度的距离计算使用相同的近似阈值，难免会造成分类精度上的误差。单维度层次粒化属性约简算法针对每个属性，使用统一的距离阈值粒化数据对象的邻域。并通过网络序列层次粒化模型的相邻层次等价粒之间的性质，计算数据集合的分类性能。实验证明，算法减少了需要输入的主观参数，具有较好的约简性能，降低了必要信息的损失。2、提出一种基于簇内不平衡度量的粗糙 K-means 聚类算法。以往的粗糙 K-means 算法及其改进方法，将研究的重点放在边界对象的模糊性和数据点在簇间的相异程度上，并没有关注数据样本因分布位置不同造成的簇内差异。簇内不平衡度量可以有效的地反映数据对象因与均值中心距离不同而在簇内的贡献程度不同。通过对 UCI 数据的仿真分析，表明该算法可以使得聚类簇内更加紧凑，簇间更加分离。3、提出一种密度自适应簇内不平衡度量的粗糙 K-means 聚类算法。数据对象在簇内的分布不平衡，不仅反映在与均值中心的距离上，还应该反映在区域的聚集程度上。某些距离较远，但聚集程度较高的对象，在簇内的重要性也应当有所表现。密度自适应簇内不平衡粗糙K-means 聚类算法使得算法均值中心迭代过程，移动步长更加准确，灵活性更强。实验仿真结果表明，算法具有很高的聚类精度，并且提高了算法的收敛速度。综上所述，基于粗糙集、粒计算理论的数据挖掘算法研究，为处理数据挖掘算法中的不可分辨问题提供了有利的支持，具有较好的理论价值和意义。关键词 : 粗糙集，粒计算，属性约简，聚类算法；IAbstractAbstract: Data sets in the world are expanding by leaps and bounds. Mining hidden within thedata, information or knowledge of modeling, is becoming increasingly important. It makes the datamining has become a hot research issue. The study finds that the indiscernibility information oftenexists in the data, many data mining algorithms can not adapt to the processing of these data. Todeal with indiscernibility problem, a lot of data mining algorithms combine with rough set theoryand granular computing theory. Research mainly includes the following aspects:1. A single-dimension hierarchical granulated attribute reduction algorithm. In handlingcontinuous information attribute reduction, neighborhood granulation conditions are not same.Distance metric as a standard to measure the approximate relationship of different dimensions ofdistance calculated using the same approximate threshold, will inevitably lead to error on theclassification accuracy. A single-dimension hierarchical granulated attribute reduction algorithmconstructs neighborhood system in the same threshold condition, and uses hierarchical granulatedrelationship to calculate the classification accuracy. Experiments show that, the algorithm still hasbetter attribute reduction effect in high classification accuracy.2. Rough K-means clustering algorithm based on imbalanced degree of cluster. Past roughK-means algorithm and its improved method, focus on the boundary of the object indiscernibilityand the differences of data points between clusters, not concerning about differences of the datadistribution in a cluster. Imbalance degree can effectively reflect importance of the data object in acluster with distance to the mean center. Simulation analysis of UCI data show that the clusteringalgorithm can make inner-cluster more compact, more inter-cluster separation.3. Improved the imbalance degree of cluster. Not only the distance, but also some intensiveareas can make an influence on the distribution of data. The importance of some removed data, butwith a high density, should also be seen. Rough K-means clustering algorithm based on densityself-adaptive imbalance degree of cluster makes mean centers assemble, moving step more accurate,and more flexible. The simulation results show that the clustering algorithm has a high accuracy,and improve the speed of convergence of the algorithm.In summary, the data mining algorithm based on rough sets theory, provides support fordealing with indiscernibility, and has better theoretical value and significance.Key words: Attributes Reduction, Clustering, K-means, Rough set, Granular Computing;II目录专用术语注释表 . 1第一章绪论 . 21.1 研究背景与意义 . 21.2 研究现状 . 31.2.1 属性约简算法的发展 .