基于文本挖掘的主观题评阅辅助系统的研究-

河北工业大学硕士学位论文基于文本挖掘的主观题评阅辅助系统的研究姓名：赵彦春申请学位级别：硕士专业：控制科学与工程指导教师：魏玮20091201河北工业大学硕士学位论文 i 基于文本挖掘的主观题评阅辅助系统的研究基于文本挖掘的主观题评阅辅助系统的研究摘摘要要计算机评阅客观题已基本实现，对于主观题，由于它的答题特点和复杂性，目前还没有一种考试系统能很好地完成其自动评阅。随着数据挖掘、模式识别、人工智能及自然语言理解等方面的不断发展，利用计算机评阅主观题将逐渐成为事实。在教师评阅主观题时，一般要检查学生答案中的得分点，即关键字，关键字越多，分数越高；其次要看关键字和标准答案的相似度，相似度越高，则分数越高。本文主要通过文本挖掘中的文本分类算法将主观题答案分到某一分值的类别中，主观题的最后得分也就是所属类别的分值。文章首先研究了主观题文本的预处理，如分词算法、向量空间模型、特征选择等基本理论及两种文本分类算法：最近邻法及 BP 神经网络法，并对各种算法进行分析；其次建立了基于某一学科且能动态调节的专业词库，并提出了将正向最大匹配法与专业词库相结合对文本进行分词的方法，解决了分词技术中歧义切分及未登录词问题；另外又提出了匹配替换法，对分词结果进行优化，解决了分词结果中词单位小、数量多的不足，并使关键字的集合更加接近于学生答案；再就是通过计算关键字的文档频率，对关键字进行特征选择，选取给定数目的关键字，通过向量空间模型将一个文本表示为向量空间中的一个特征向量；最后利用 kNN 和 BP 神经网络法将预处理得到的学生答案分类，分别对主观题文本进行测试，并对这两种分类算法的准确性及有效性进行分析。选取一种准确性较好的分类算法建立主观题评阅辅助系统。本文立足于中文自动分词和主观题文本分类两方面的问题，用 Matlab 进行仿真实验，并通过 Visual C+进行编程。通过两种分类算法的比较，实验证明 BP 神经网络的准确性更好。在计算机判卷初期，通过教师和计算机共同阅卷的方式评阅主观题。由于主观题评阅辅助系统是个自适应系统，它会随着实验样本的增多而不断优化，最后实现计算机独立评阅主观题的目标。关键词关键词：主观题，文本分类，中文自动分词，kNN 法，BP 神经网络基于文本挖掘的主观题评阅辅助系统的研究 ii THE RESEARCH ON THE ASSISTANT SYSTEM FOR GRADING SUBJECTIVE TESTS BASED ON TEXT MINING ABSTRACT A lot of work in grading objective tests has been done, and it gets an achievement. But for the subjective tests, it is not well solved because of its characteristics and complexity of the examination. With the development of data mining, pattern recognition, artificial intelligence and natural language understanding, grading subjective tests automatically with computer will gradually become a reality. While grading subjective tests, the teachers commonly check up the keywords in students answers. The more the number of keywords is, the higher the mark is. Secondly they examine the close degree between the students answers and standard answers. The mark rises proportionately to the close degree. The subjective tests are classified into different sorts according to the text mining (text classification) in this paper. The mark of the subjective test is the sort that the test belongs to has. Firstly, the research on the preprocessing of the subjective tests has been done including word- segmentation, VSM, feature selection and so on. This paper analyses two main algorithms of text classification, kNN and BP neural network. Secondly, based on a fixed subject a professional word- library has been created in the beginning and it is adjustive dynamic. In the process of word- segmentation, the method of combining FMM with professional word- library is advanced and it solves the problems of segmentation ambiguity and unlisted words. Thirdly the results of word- segmentation get well using the matching substitution advanced in this paper. By the algorithm of the matching substitution, the problems of small- units words and large number 河北工业大学硕士学位论文 iii of words have been solved and the keywords got are close to the students answers. The next, by computing the document- frequency of the keywords and having a given number, the keywords are got and the number of the keywords selected must be less than the given number. The text is denoted with a vector by SVM. Lastly, to get the best algorithm that adapts to grading the subjective tests, two algorithms selected are kNN and BP neural network which are used to classify the subjective tests. At experiment, the standard answers of subjective tests are tested. This paper analyses the accuracy and validity of the algorithms according to the results of tests. Choosing a better classification algorithm and creating the assistant system of grading subjective tests is the final work. Based on the Chinese word- segmentation and the text classification, the simulation tests are done using Matlab and the program with Visual C+. It is proved that the accuracy of BP neural network is better. At the beginning of grading the tests with computer, the teachers get the mark of the tests with the help of computer. The system of grading subjective tests is adaptive, so it will become optimized with the increase of the samples, and the computer will grade the subjective tests independently. KEY WORDS: subjective tests, text classification, Chinese automatic word- segmentation, kNN, BP neural network 原创性声明原创性声明本人郑重声明：所呈交的学位论文，是本人在导师指导下，进行研究工作所取得的成果。除文中已经注明引用的内容外，本学位论文的研究成果不包含任何他人创作的、已公开发表或者没有公开发表的作品的内容。对本论文所涉及的研究工作做出贡献的其他个人和集体，均已在文中以明确方式标明。本学位论文原创性声明的法律责任由本人承担。学位论文作者签名：日期：2010.01.07 关于学位论文版权使用授权的说明关于学位论文版权使用授权的说明本人完全了解河北工业大学关于收集、保存、使用学位论文的规定。同意如下各项内容：按照学校要求提交学位论文的印刷本和电子版本；学校有权保存学位论文的印刷本和电子版，并采用影印、缩印、扫描、数字化或其它手段保存论文；学校有权提供目录检索以及提供本学位论文全文或者部分的阅览服务；学校有权按有关规定向国家有关部门或者机构送交论文的复印件和电子版；在不以赢利为目的的前提下，学校可以适当复制论文的部分或全部内容用于学术活动。（保密的学位论文在解密后适用本授权说明）学位论文作者签名：日期：2010.01.07 导师签名：日期：2010.01.07 河北工业大学硕士学位论文 1 第一章第一章绪论绪论 1- 1 研究背景及意义研究背景及意义 1- 1- 1 课题来源课题来源面对越来越多的学生和高负荷的阅卷工作，教师要完全细致地评阅试卷是件很困难的事。长期以来一直沿用无纸化的考试方式，即常见的单选或多选等填空题。在计算机考试中，针对单项选择、多项选择和填空题的自动批改技术已经很成熟，被应用于大型的考试系统中，例如国外知名度较高的程序员考试，国内的全国计算机等级考试等。这种考试方式间接地考察了学生掌握知识的牢固程度，但这样的考试方式有一定的局限性（没有深入了解学生对知识的掌握程度），故主观题的出现有它的