DocDB component

The amount of information and data stored on the web and in personal repositories grows exponentially fast, which is driven by rapidly increasing storage capacity and network bandwidth, and by the emergence of new languages and standards to describe and save data. As a consequence of this, search and navigation in the information space is becoming more and more challenging both for users and for automated computer agents. So far there have been three major ways to provide information access to large collections of data items:

Fixed classification scheme: data items are classified into tree-like topic hierarchies and accessed through hierarchy browsing starting from the root topic down to a specific topic of interest. Typical examples of this approach are the DMOZ web directory, the library Dewey Decimal Classification system, and personal classifications of favorite web pages and emails. Such classification hierarchies have always been used by humans as the most effective and intuitive way to organize their knowledge according to their (subjective) view of a domain of interest. The main advantage of this approach is in that the user is in full control of the hierarchy structure and content placement. The major disadvantage of the approach is in that as the classification structure grows larger, the possibilities for the classification of or search for data items are increased substantially, which makes these tasks very time-consuming and error-prone;

Faceted classification: data items are assigned a set of attributes (or facets) defined on mutually exclusive domains, such that values for these attributes can be represented as a range (e.g. real numbers from 1.0 to 10.0), as a list (e.g. color names), or as a hierarchy which encodes taxonomic relations among them (e.g., a taxonomy of geographic names, which partitions the world into continents, continents into counties, countries into provinces, and so on based on the “located-in” relation). In order to locate a desired data item, the user starts with assigning a value to any of the facets, and then continues with assigning a value for another one and so on until the data item is found. Each time a value is assigned to (or further specified for) a facet, the set of displayed objects is reduced to those which possess corresponding values for the selected attributes. This approach has been successfully applied in e-commerce applications, and two typical examples from this category are the Ebay internet auction and a faceted classification of Boston restaurants. The major advantage of this approach is in that the user can easily locate an item following different perspectives by selecting facets in different orders. The main disadvantage is in that facets are usually used to describe a rather homogeneous set of objects, such that all of them can be described using the same static set of predefined attributes;

Direct search: the user searches for data items by providing a set of keywords which describe (the contents of) the data items. Typical examples from this category are Google and Amazon search engines. The advantage of this approach is in that the user can quickly access data items without the need of browsing topic hierarchies or specifying facet values, and it is proved to be especially effective when the user provides a discriminative set of keywords. The disadvantage of this approach is in that if the user provides a set of keywords which do not discriminatively identify the object or if the user simply does not know how to describe the object, then the search becomes very ineffective. Apart from this, direct search approaches usually make no difference between objects’ data and metadata.

The DocDB system is a hybrid document management system which leverages the three approaches presented above by combining the power of classification schemes to describe the domain of interest, the expressiveness of facets to describe the multidimensional properties of data objects, and the effectiveness of direct search to further complement the user’s information search and browsing experience. By combining the three approaches, DocDB seeks to overcome the disadvantages of each individual approach. DocDB provides support for various kinds of data objects (e.g., HTML pages, email messages, multimedia files) with diverse sets of attributes including those defined by the user. In direct search, DocDB explicitly discriminates between data and metadata and allows the user to specify range, exact match, and pattern queries on metadata. Apart from this, DocDB provides a semantic layer, which allows for new and more meaningful operations on data items for all the three components of the system, i.e., for classification schemes, for facets, and for direct search.

Contact:Uladzimir Kharkevich and Vincenzo Maltese (component managers)