资源预览内容
第1页 / 共24页
第2页 / 共24页
第3页 / 共24页
第4页 / 共24页
第5页 / 共24页
第6页 / 共24页
第7页 / 共24页
第8页 / 共24页
第9页 / 共24页
第10页 / 共24页
亲,该文档总共24页,到这儿已超出免费预览范围,如果喜欢就下载吧!
资源描述
Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words,Dmitry Davidov, Ari Rappoport The Hebrew University ACL 2006,Introduction,Discovering word categories, sets of words sharing a significant aspect of their meaning context feature vectors pattern-based discovery Manually prepared pattern set (ex. x and y) requiring POS tagging or partial or full parsing,Pattern Candidates,high frequency word (HFW) A word appearing more than TH times per million words Ex. and, or, from, to content word (CW) word appearing less than TC times per a million words,Pattern Candidates,meta-patterns obey the following constraints at most 4 words exactly two content words no two consecutive CWs Example CHC, CHCH, CHHC, and HCHC from x to y (HCHC), x and y (CHC), x and a y (CHHC),Symmetric Patterns,In order to find a usable subset from pattern candidates, we focus on the symmetric patterns Example x belongs to y (asymmetric relationships) X and y (asymmetric relationships),Symmetric Patterns,We use single pattern graph G(P) to identifying symmetric patterns there is a directed arc A(x, y) from node x to node y iff the words x and y both appear in an instance of the pattern P as its two CWs x precedes y in P SymG(P), the symmetric subgraph of G(P), containing only the bidirectional arcs and nodes of G(P),Symmetric Patterns,We compute three measures on G(P) M1 counts the proportion of words that can appear in both slots of the pattern M2, M3 measures count the proportion of the number of symmetric nodes and edges in G(P),Symmetric Patterns,We removed patterns that appear in the corpus less than TP times per million words We remove patterns that are not in the top ZT in any of the three lists We remove patterns that are in the bottom ZB in at least one of the lists,Discovery of Categories,words that are highly interconnected are good candidates to form a category word relationship graph G merging all of the single-pattern graphs into a single unified graph clique是一個圖中兩兩相鄰的一個點集,或是一個完全子圖,The Clique-Set Method,Strong n-cliques subgraphs containing n nodes that are all bidirectionally interconnected A clique Q defines a category that contains the nodes in Q plus all of the nodes that are (1) at least unidirectionally connected to all nodes in Q (2) bidirectionally connected to at least one node in Q,The Clique-Set Method,we use 2-cliques For each 2-cliques, a category is generated that contains the nodes of all triangles that contain 2-cliques and at least one additional bidirectional arc. For example book and newspaper, newspaper and book, book and note, note and book and note and newspaper,The Clique-Set Method,a pair of nodes connected by a symmetric arc can appear in more than a single category define exactly three strongly connected triangles, ABC,ABD,ACE arc (A,B) yields a category A,B,C,D arc (A,C) yields a category A,C,B,E,Category Merging,In order to cover as many words as possible, we use the smallest clique, a single symmetric arc. This creates redundant categories We use two simple merge heuristics If two categories are identical we treat them as one Given two categories Q,R, we merge them iff theres more than a 50% overlap between them,Corpus Windowing,In order to increase category quality and remove categories that are too context-specific Instead of running the algorithm of this section on the whole corpus, we divide the corpus into windows of equal size we select only those categories that appear in at least two of the windows,Evaluation,three corpora, two for English and one for Russian BNC, containing about 100M words Dmoz, 68GB containing about 8.2G words from 50M web pages Russian corpus was assembled from many web sites and carefully filtered for duplicates, to yield 33GB and 4G words,Evaluation,Thresholds thresholds TH, TC, TP ,ZT ,ZB ZB, were determined by memory size 100, 50, 20, 100, 100 window size 5% of the different words in the corpus assigned into categories,V=number of different words in the corpus. W=number of words belonging to at least one categories. C= is the number of categories AS= is the average category size,Human Judgment Evaluation,Part I given 40 triplets of words and were asked to rank them using 1-5 (1 is best) 40 triplets were obtained as follows 20 of our categories were selected at random from the non-overlapping categories 10 triplets were selected from the categories produced by k-means (Pantel and Lin, 2002) 10 triplets were generated by random selection of content words,Human Judgment Evaluation,Part II, subjects were given the full categories of the triplets that were graded as 1 or 2 in Part I They were asked to grade the categories from 1 (worst) to 10 (best) according to how much the full category had met the expectations they had when seeing only the triplet,Human Judgment Evaluation,shared meaning average percentage of triplets that were given scores of 1 or 2,WordNet-Based Evaluation,Took 10 WN subsets referred to a
收藏 下载该资源
网站客服QQ:2055934822
金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号