《基于类别概念的中文文本分类研究》-公开DOC·毕业论文-

密级：保密期限：硕士研究生学位论文题目：基于类别概念的中文文本分类研究学号： 035008 姓名：专业：电路与系统导师：学院：电信工程学院年月日北京邮电大学硕士学位论文基于类别概念的中文文本分类研究摘要网络技术的发展和Internet的开放性使它逐步成为一个全方位的资源宝库，越来越多的信息通过互联网被传送到世界各地，互联网中也积聚了越来越多的信息，从发展的趋势来看，网络必将成为人们获取信息的主要来源。但互联网的组织杂乱，缺乏必要的条理，多且杂的信息使得人们从中获取自己感兴趣的内容变得越来越困难。从大量的数据中挖掘出有用的信息是数据挖掘的任务。文本作为互联网上主要的信息载体，随着互联网的迅速发展，文本挖掘也成为数据挖掘的热点之一。文本分类技术是文本挖掘的基础和核心。文本分类的方法包括人工分类和自动分类。传统的文本分类是基于人工方式的，这种方式缺点很多，如周期长、费用高、效率低、需要大量专业人员以及分类结果的一致性低等。20世纪90年代以后，基于机器学习的文本自动分类方法越来越成为主流。相比于人工方式，它具有周期短，效率高，节省人力资源，分类结果一致性高等优点。但文本自动分类研究开展以来，准确率一直不能达到令人满意的效果。在Internet信息急剧膨胀的今天，为文本分类提供了广阔的发展空间，文本自动分类面临前所未有的机遇和挑战，如何提高分类准确率成为研究热点。向量空间模型是文本自动分类应用最广泛的模型之一，以向量空间模型为基础，我们研究发现，对文本的合理向量表示是实现正确分类非常关键的前提，而传统分类方法中，特征选择算法各有优劣，选择出的特征不能很好地代表文本，这在很大程度上制约了文本分类的准确率。我们以此为出发点，分析特征项应当具备的条件，并提出了基于类别概念的特征选择方法。区别于传统的特征选择方法只考虑文本词语的外在形式的做法，它以分析词语的语义概念为主，并且考虑特征的类别信息，选取单类别指示意义强的特征项，建立特征空间。在实验中，我们对相同的数据集，在同一种分类算法上，对传统特征选择方法和我们提出的特征选择方法分别实验，实验数据表明我们的特征选择算法能够得到较高的准确率和召回率。关键词：文本分类向量空间模型知网类别概念A Study on Category Conception in Text ClassificationAbstractThe development of network and the opening of the Internet make it a omnidirectional resource storehouse step by step. More and more information are delivering to everywhere of the world, and more and more information are congregated in internet. At the viewpoint of developmental trend, network will be the main source from which people get information. But the Internets organization is very disordered, the informations hugeness and confusion make it more and more difficult to get interesting information from it.The task of data mining is mining useful information from a mass of data. Texts mining is becoming one of the focuses of data mining with the rapid development of the Internet because that text is the main information carrier of web pages. The text classification is the base and center of texts mining.Text classification include manual method and automatic method. Conventional text classification based on manual mode has a lot of shortcomings, such as long period, high charge, low efficiency, the requirement of large numbers of professional people and low consistency of the results et. The automatic method of text classification based on machine learning was becoming mainstream after 1990s stage by stage. Compared with manual, it has short period, high efficiency, and high consistency of the results. Though automatic text classification has so many merits, the accuracy of its results is not satisfied till now. Text classification gets a wide stage in the age of the information in Internet increasing rapidly. It is confronted with opportunities and challenges, and the study focuses how to improve the accuracy of the text classification result.Vector space model is one of the models that are used far and wide in text classification field. Based on vector space model, we discovered that the vector of texts is a key precondition for accuracy. But in many conventional text classification systems, each one of the feature selection methods has its strong point. The features can not primely express the texts, then restrict the improvement of accuracy. Our study got going from this point, proposed a new feature selection method based on category concept after analyzing the conditions that the features should have. Conventional feature selections take the extrinsic form of the words in texts into account only. Know from them, our feature selection method mainly analyzing the immanent conception of the words, and take the classificatory information of features into account synchronously. It selects the conceptions which has strong single class meaning as its feature space. In our experiment, we made compassion between conventional feature selection methods and our method in the same conditions that include the same corpus and the same class arithmetic. The result showed that the feature selection method based on category concept we proposed could get comparatively high accuracy and recall.Key Word：text classification VSM Hownet category concept2目录摘要1Abstract3第一章绪论21.1 研究背景及意义21.2 数据挖掘31.2.1 数据挖掘的由来31.2.2数据挖掘的定义41.2.3数据挖掘研究的内容和本质41.2.4数据挖掘的功能61.2.5数据挖掘未来的研究方向71.3 文本挖掘81.3.1文本挖掘的定义81.3.2文本挖掘的分类91.4本文组织101.5本章小结10第二章文本分类技术112.1文本分类简介112.1.1文本分类发展及应用112.1.2 文本分类定义132.1.3 文本分类类型132.1.4 文本分类模型132.1.5 基于VSM的文本分类142.2特征提取技术152.2.1文本频率152.2.2 信息增益162.2.3 互信息162.2.4 CHI172.2.5词条权172.2.6期望交叉熵182.2.7几率比182.2.8文本证据权182.3 分类技术192.3.1简单距离向量分类法192.3.2基于TFIDF的Rocchio算法202.3.3 朴素贝叶斯模型202.3.4 K最近邻居算法222.3.5决策树222.3.6神经网络232.3.7支撑向量机252.3 本章小结26第三章基于类别概念的特征选择方法273.1 问题分析273.2 知网283.2.1 知网简介283.2.2概念排歧293.2.3同义词30