GATE功能介绍鲁廷明 2009年6月9日2目录 概览 功能介绍 本次研究的不足之处概览(1) GATE is a General Architecture for Text Engineering Developed by the Natural Language Processing Research Group within the Department of Computer Science at the University of Sheffield概览(2) Language Resources (LRs) refers to data-only resources such as document, corpus. Processing Resources (PRs) refers to resources whose character is principally programmatic or algorithmic, such as tokeniser, POS tagger. Applications model a control strategy for the execution of PRs. There are two main types of pipeline: Simple pipelines Corpus pipelines概览(4)document annottationSetannotationTypefeature功能介绍 Tokeniser实现 分词功能,每个Token标注包括的属性有:kind: Word, Number, Symbol, Punctuation, SpaceTokenorth: upperInitial, allCaps, lowerCase, mixedCapslengthstring Sentence Spliter实现分句功能功能介绍 Gazetteer 辞典lists.def 内容包括 country.lst:location:countrycountry.lst 内容包括 China Chine Chypre Colombia Colombie功能介绍 Part of Speech Tagger词性标注也有标注错误的:I will study hard this year. JJ(adjective,应当为RB adverb)功能介绍 Semantic Tagger就是NE Transducer,命名实体识别 Orthographic Coreference (Orthomatcher)The Orthomatcher module adds identity relations between named entities found by the semantic tagger, in order to perform coreference. Pronominal Coreference将人名、代词联 系起来,比如:John SmithhehimJohnhe功能介绍 Document ResetRemove all the annotation sets and their contents, apart from the one containing the document format analysis (Original Markups).功能介绍 Verb Group ChunkerThe rules cover finite (is investigating), non-finite (to investigate), participles (investigated), and special verb constructs (is going to investigate). Noun Phrase ChunkerMarking noun phrases in text.功能介绍 OntoText Gazetteer与 ANNIE Gazetteer 结果相似,但是算法不同。 Flexible GazetteerThe Flexible Gazetteer provides users with the exibility to choose their own customized input and an external Gazetteer. Gazetteer List Collector指定标注类型的实体插入到指定Gazetteer的相应list中并生成统计文件(实体名$次数)功能介绍 Tree Tagger The TreeTagger is a language-independent part-of-speech tagger. The TreeTagger is a tool for annotating text with part-of-speech and lemma information. It was developed by Helmut Schmid in the TC project at the Institute for Computational Linguistics of the University of Stuttgart. The TreeTagger has been successfully used to tag German, English, French, Italian, Dutch, Spanish, Bulgarian, Russian, Greek, Portuguese, Chinese and old French texts and is adaptable to other languages if a lexicon and a manually tagged training corpus are available. 分析英语文件成功 cd treetaggerbin tag-english.bat news1.txt但是未能集成到GATE中功能介绍 StemmerEach Token is annotated with a new feature “stem“, with the stem for that word as its value. GATE Morphological AnalyzerConsidering one token and its part of speech tag, one at a time, it identifes its lemma and an affix. These values are than added as features on the Token annotation. MiniPar ParserIt takes one sentence as an input and determines the dependency relationships between the words of a sentence.功能介绍RASP Parser RASP (Robust Accurate Statistical Parsing) is a robust parsing system for English. 包括以下四个PR: RASP2 Tokenizer RASP2 POS Tagger RASP2 Morphological Analyser RASP2 Parser: creates multiple dependency annotations to represent a parse of each sentence. RASP is only supported for Linux operating systems.SUPPLE Parser SUPPLE is a bottom-up parser that constructs syntax trees and logical forms for English sentences. Need a Prolog interpreter.Stanford Parser功能介绍 Montreal TransducerMany of the key features introduced in the Montreal Transducer (MT) have now been ported in some form into the standard JAPE transducer. The standard JAPE transducer is likely to be more stable and bugs will be xed more rapidly than with the MT.与 standard JAPE transducer类似,未研究。功能介绍 Chinese PluginThe Chinese plugin contains a simple application for Chinese NE recognition (chinese.gapp).功能介绍 Chemistry TaggerThis GATE module is designed to tag a number of chemistry items in running text. Currently the tagger tags compound formulas (e.g. SO2, H2O, H2SO4 .) ions (e.g. Fe3+, Cl-) and element names and symbols (e.g. Sodium and Na). Limited support for compound names is also provided (e.g. sulphur dioxide) but only when followed by a compound formula (in parenthesis or commas).功能介绍 Flexible Exporter 可以指定一个标注集的若干标注类型,输出带这些标注的文档到文 件,并可以改变输出文件中标注类型的名称。 Annotation Set Transfer 将一种标注集中的一部分标注转移(或拷贝)到另一个标注集中( 然后将这个部分的标注集可以作为其他PRs的输入,再处理)。 For example, we might wish to perform named entity recognition on the body of an HTML text, but not on the headers. After tokenising and performing gazetteer lookup on the whole text, we would use the Annotation Set Transfer to transfer those annotations (created by the tokeniser and gazetteer) into a new annotation set, and then run the remaining NE resources, such as the semantic tagger and coreference modules, on them.功能介绍 Information Retrieval in GATEThe current implementation
