资源预览内容
第1页 / 共16页
第2页 / 共16页
第3页 / 共16页
第4页 / 共16页
第5页 / 共16页
第6页 / 共16页
第7页 / 共16页
第8页 / 共16页
第9页 / 共16页
第10页 / 共16页
亲,该文档总共16页,到这儿已超出免费预览范围,如果喜欢就下载吧!
资源描述
77Research PaperCJLIS Vol. 5 No. 4, 2012 pp 7792 National Science Library, Chinese Academy of Scienceshttp:/www.chinalibraries.netReceived Nov. 27, 2012 Revised Dec. 3, 2012 Accepted Dec. 22, 2012 Translated with a permis- sion from New Technology of Library and Informa- tion Service (in Chinese), 2012, 28: 36A method for improving the accuracy of automatic indexing of Chinese-English mixed documents*Yan ZHAO1,2 String matching; Accuracy of automatic indexing; Cybernetics; Dedicated hepatitis B virus (HBV) databaseWith the development of the Internet and communications technologies, people speaking different languages communicate with one another more frequently and the number of multilingual documents is increasing rapidly. These documents cannot be used without being indexed first. However, the significant differences between languages in terms of grammatical rules and word formation rules bring about difficulties in indexing multilingual documents accurately. Given the importance of indexing in the processing of documents, we conduct a study on the automatic indexing of Chinese-English mixed documents with the aim of finding a method for increasing the accuracy of multilingual document indexing.1 A brief introduction of document indexingDocument indexing means that a document is indexed according to some rules and criteria. By using index terms to describe this document, an originally isolated document will be connected with the existing conceptual systems. Indexing provides convenience for the search, storage and use of the document. According to different ways of classification, document indexing can be classified into assignment indexing and derivative indexing, or manual indexing and automatic indexing, etc. The production of keywords in context (KWIC) index by Luhn of IBM was a great step forward in the development of a modern indexing method characterized by the computer-assisted automatic indexing1. Assignment indexing, where index terms are taken from a controlled vocabulary, is also called controlled indexing2. Indexing is the very first step in text processing and the quality of indexing plays a decisive role in the further steps such as knowledge classification, data mining and knowledge discovery, etc. Due to the changeability and complexity of the controlled vocabulary and target document, it is hard to ensure accuracy of an indexing method. This is especially the case for the multilingual documents indexing (exemplified by Chinese-English mixed documents) due to the significant differences between the languages in terms of grammatical rules and word formation rules.2 Factors affecting the accuracy of automatic indexing of Chinese-English mixed documentsHow to ensure accuracy has long been one of the major problems in computer- assisted automatic assignment indexing. For years, scholars both in China and abroad have been focusing on this research area, but numerous research results 79Yan ZHAO in “rheumatoid factors (RF)”, “rheumatoid factors” was marked with “RF” left behind. Similarly, as for the word “anti-calmodulin antibodies (anti-CaM)”, “anti-calmodulin antibodies” was marked while “anti-CaM” was not.4.2.3 Indexing efficiencyThe whole indexing process took nearly 9 hours. Reasons for long hours are analyzed as follows.Firstly, there are over 35,000 entries, and more than 50% of them are long words, each of which consists of more than 10 characters. Secondly, there are not separate controlled vocabularies for Chinese document and English document. This caused substantially ineffective matching. Thirdly, for the sake of convenience in programming, the present indexing system adopted brute-force (BF) algorithm, which is considered as the easiest and least efficient algorithm in string matching. 4.3 Improvement of the indexing processAs mentioned in Section 4.2, recall of the indexing system is 97.37%, which is acceptable because the controlled vocabulary is relatively complete. However, precision of 88.54% is far from satisfying. Mis-indexing is still the main reason for dissatisfaction. Based on the analysis in Section 3, we will introduce the cybernetics theory into the improvement of indexing quality in the three phases. Moreover, Chinese and English texts need to be processed differently. 4.3.1 Feed-forward controlBefore indexing, we pre-processed the controlled vocabulary and the target database.(i) Pre-processing controlled vocabulary We divided the original controlled vocabulary which has both Chinese and English content into 3 parts: Chinese, English and mixed Chinese-English words. Chinese controlled vocabulary only consists of Chinese characters and English controlled vocabulary has only English characters, Greek and Arabic numbers as the constituents. As for the rest, they were all grouped under the category of Chinese-English mixed controlled vocabulary.Chinese Journal of Library and Information Science Vol. 5 No
收藏 下载该资源
网站客服QQ:2055934822
金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号