adatabaseapproachtocontent-basedxmlretrieval-

A database approach to content-based XML retrievalDjoerd Hiemstra University of Twente, Centre for Telematics and Information Technology P.O. Box 217, 7500 AE Enschede, The Netherlands d.hiemstrautwente.nlAbstract This paper describes a first prototype sys- tem for content-based retrieval fromXMLdata. The systems design supports both XPath queries and com- plex information retrieval queries based on a language modelling approach to information retrieval. Evalua- tion using the INEX benchmark shows that it is ben-eficial if the system is biased to retrieve largeXML fragments over small fragments.1IntroductionThis paper describes a number of fundamental ideas and starting points for building a system that seam- lessly integrates data retrieval and information re- trieval (IR) functionality into a database system. Wedescribe a first prototype system that is developed ac- cording to these ideas and starting points and report on experimental results of the system on the INEX collection.The current prototype system only sup- port a small part of the functionality that we envi- sion for future systems.In the upcoming years we will build a number of such prototype systems in the CIRQUID(Complex Information Retrieval Queries in a Database) project that is funded by the NetherlandsOrganisation for Scientific Research (NWO). TheCIRQUIDproject bridges the gap between structured query capabilities ofXMLquery languages and relevance-oriented querying. Current techniques forXMLquerying, originating from the database field, do not support relevance-oriented querying. On the other hand, techniques for ranking documents, orig-inating from the information retrieval field, typically do not take document structure into account. Rank- ing is of the utmost importance if large collections arequeried, to assist the user in finding the most relevant documents in a retrieved set. The paper is organised as follows:Section 2 de- scribes our database approach to relevance-oriented querying fromXMLdocuments. Section 3 reports theexperimental results of our first prototype system. Fi- nally, Section 4 concludes this paper.2A multi-model approachA three level design of DBMSs distinguishing a conceptual, a logical, and a physical level providesthe best opportunity for balancing flexibility and effi- ciency. In our approach, we take the three level archi- tecture to its extreme. Not only do we guarantee logi- cal and physical data independence between the three levels, we also map the conceptual data model used by the end users to a physical implementation usingdiff erent data models at different levels of the database architecture: the so-called “multi-model” database ap- proach 26.rewrite rulesExtensionLogical Layer (Moa)Relational storage of XMLOptimisationXPath so “title:INEX” means that the title of the document1Note that most retrieval systems do not distinguish uppercase from lower case, and confuse the acronym “IT” with the very common word “it”.should contain the word INEX. The last query also shows additional term weighting, stating that the userfindsXMLmuch more important thanIR. These examples suggest that at the logical level, our system should support algebraic constructs for prox- imity of terms, mandatory terms, a logicalOR, term weighting, etc.To support proximity operators the system should at least store term position information somehow at the physical level.2.2Moa and Language ModelsParts of a prototype multi-model database system have already been developed with the extensible object al- gebra Moa 14 as the logical layer. An open question in this set-up is how Moa, which provides a highly structured nested object model with sets and tuples, can be adapted to managing semi-structured data. In this paper we will not get into Moa, but direct our attention to the language modelling approach to in- formation retrieval as proposed in 9, 18 to guide thedefinition of the logical layer of our system. The basic idea behind the language modelling ap- proach to information retrieval is that we assign to eachXMLelement X the probability that the element is relevant, given the query Q = q1,qn.Using Bayes rule we can rewrite that as follows.P(X|q1,q2,qn) =P(q1,q2,qn|X)P(X) P(q1,q2,qn)(1)Note that the denominator on the right hand side does not depend on the XML element X.It might therefore be ignored when a ranking is needed. The prior P(X) however, should only be ignored if we as- sume a uniform prior, that is, if we assume that all elements are equally likely to be relevant in absence of a query. Some non-content information, e.g. the num- ber of accesses by other users to anXMLelement, or e.g. the length of anXMLelement, might be used to determine P(X). Lets turn our attention to P(q1,q2,qn|X). Theuse of probability theory might here be justified by modelling the process of generating a query Q given anXMLelement as a random process. If we assume that this page in the INEX proceedings is an XML el