JeromeDL/TechPaper

From Corrib Clan Wiki

Jump to: navigation, search

Contents

JeromeDL - Bringing Digital Libraries to Semantic Web Era

Paweł Bugalski, Mariusz Cygan, Krystian Samp, Adam Westerski


Abstract

In this white paper we describe JeromeDL 2.0, a full fledged open source digital library system, which exemplifies how digital libraries benefit from the Semantic Web. JeromeDL 2.0 is an important step forward from the JeromeDL 1.0, building on the feedback that has been gathered during the past years. A lot of improvements have been made as well as a brand new functionality has been added.

JeromeDL 2.0 is not backwards compatible with earlier releases. Despite this fact, JeromeDL 2.0 offers converters which will allow upgrading from previous versions without lose of data.


Introduction

Internet is a huge source of information. Unfortunately, desired data is not always easy to find. Sometimes you have to spend hours searching. What is more, very often the quality of this information is low.

Dedicated digital libraries can be the solution to all these problems. Using new technologies, like XML or Web Services, digital libraries make searching faster and more accurate. They also possess information, which are well organized and certificated.

Motivations

  • Support for different kinds of bibliographic metadata, like: DublinCore, BibTex, and MARC21 at the same time.
  • Currently created semantic descriptions are based on NLP and statistical analysis of resources, neglecting information rich sources (like MARC21) created by human.
  • User oriented behavior - usually the user has no control over his profile information (including statistical usage data) and the profile is not integrated in any way with other datasets.
  • Interlinking heterogeneous digital library networks - so far libraries tends to be single islands or groups of islands connected with specific protocol.

Outline of the Paper

  • The Anatomy of Digital Library System
  • Semantic Sources of Information in Digital Libraries
  • Distributed Digital Library Systems
  • Semantically Enhanced Information Retrieval
  • Usage of MarcOnt ontology within the library


The Anatomy of Digital Library System

A digital library system contains a user interface and middleware, like classic tree tier architecture. The data engine handles catalog information along with the resources. Additionally there is a special interface for librarians to manage the content and catalog descriptions. To enable interoperability between library systems a communication interface is defined as well.

Architecture

Image:JeromeDL2TechPaper2.jpg


The JeromeDL architecture (see Figure 2) is an instantiation of the presented architecture with special focus on the exploitation of Semantic Web based metadata (RDF, FOAF, and ontologies). The JeromeDL architecture consists of:

  • resource management, where a resource can be described by semantic description according to the Jerome ontology in addition to a fulltext index of the resource’s content and MARC21 and BibTEX bibliographic descriptions. While every user is able to add resources via a web interface, an administrative interface for librarians (JeromeAdmin) allows to manage and add metadata (MARC21, BibTEX bibliographic, and ontological annotations) to the knowledgebase and to check and approve user submissions.
  • searching and browsing (see section 2.3) based on Semantic Web data.
  • users’ profile management (based on FOAF) (see section 2.2)
  • a communication link to the outside world which enables searching in a network of digital libraries.


The content of the JeromeDL database can be searched not only through web pages of the digital library but also from the other digital libraries and other web applications through a special web services interface based on the Extensible Library Protocol (ELP)[8] (see 2.4) in real time.


Semantic Sources of Information in Digital Libraries

FOAFRealm

MarcOnt

Distributed Digital Library Systems

Digital libraries possess all the merits of normal libraries. Information is usually well organized and reliable. Furthermore they make searching process faster and more accurate. It would be a good idea to connect digital libraries together, so that a user could obtain results from many sources.


Standard Communication Protocols Overview

As more and more digital libraries are created, problem with connecting them appears. Dispersion, different ontologies, different metadata formats, all this make communication harder. To solve those issues special communication protocols were developed and introduced.


4.1.1. Z39.50

ANSI/NISO Z39.50 [1] is the American National Standard Information Retrieval Application Service Definition and Protocol Specification for Open Systems Interconnection. ANSI/NISO Z39.50 defines a standard way for two computers to communicate for the purpose of information retrieval. Z39.50 makes it easier to use large information databases by standardizing the procedures and features for searching and retrieving information. Specifically, Z39.50 supports information retrieval in a distributed, client and server environment where a computer operating as a client submits a search request (i.e., a query) to another computer acting as an information server. Software on the server performs a search on one or more databases and creates a result set of records that meet the criteria of the search request. The server returns records from the result set to the client for processing. The power of Z39.50 is that it separates the user interface on the client side from the information servers, search engines, and databases. Z39.50 provides a consistent view of information from a wide variety of sources, and it offers client implementers the capability to integrate information from a range of databases and servers.


4.1.2. DIENST

DIENST is a protocol for communication with distributed digital library servers. Its architecture is build on individually defined services that when combined together create a distributed digital library. The functionality of Dienst includes storage and access to resources (digital objects), deposit of new resources, discovery and browsing of those resources, and user registration. Communication with and among individual Dienst services takes place via an open protocol. The basic services defined in the protocol:


  • Repository Service - stores digital documents, each with unique name; supports multiple versions and different components
  • Index Service - serves queries
  • Query Mediator Service - dispatches queries to appropriate index servers
  • Info Service - return information about the state of a server
  • Collection Service - provides information about services interaction
  • Registry Service - stores information about users


4.1.3. OAI-PMH

The Open Archives Initiative Protocol for Metadata Harvesting provides an application-independent interoperability framework based on metadata harvesting. The OAI-PMH defines two classes of participants:

  • Data Providers - provide free access to metadata and may provide free access to full texts or other resources
  • Service Providers - harvesting and storing metadata


Image:JeromeDL2TechPaper_003.jpg

Distributed Communication in JeromeDL

Extensible Library Protocol

Extensible Library Protocol serves as a mean of communication between libraries. It can be used to distribute queries, gather results and exchange other information. These three simple tasks allow building a network of cooperating libraries, which can act as a one instance. It means that resources (books, articles etc.) can be spread around the world in several libraries, which can represent different interest, and one simple query casted in one place can search among all data. Mentioned protocol is not designed only for JeromeDL. One of its benefits is simplicity and possibility to use in other applications. As a result of this fact ELP can be used to access a network of libraries from any program. It also makes possible to build a network of heterogeneous libraries. ELP acts in this situation as a common, standardized layer. To fulfill above needs ELP uses XML, with well-defined elements, as a format for messages. Natural way of exchanging such information is to use Web Service technology, which hide details of implementation and define how to access communication points. Instances, which expose such endpoints, have to be gathered in some infrastructure. It is very important to choose it carefully because it influences scalability, speed and overall performance. There are many existing solutions and considerations, which are utilized in various P2P programs. JeromeDL uses HyperCuP to create a network of libraries. It constructs the topology reflecting hypercube. Such geometry compromises between amount of traffic being generated during communication, distances between nodes and scalability. It also has other advantages like decentralization and simplicity of use. JeromeDL can create network of instances using ELP. It gives new functionality like distributed search and retrieval. Many frontends was created to provide use of ELP. There is a web interface where user can request distributed processing, e.g. when searching for resources. In addition many servlets are available like RdfQuery.


WSMX - Why and Why not yet?

WSMX (Web Service Modeling eXecution environment) is the reference implementation of WSMO (Web Service Modeling Ontology). It is an execution environment for business application integration. It integrates enhanced web services for various business applications. It provides the means to automatically discover mediate and execute remote business procedures. It differs form other such attempts by trying to execute goal in environment of heterogeneous ontologies using adapters and mediators rather than creating single multipurpose ontology. However WSMX, as shown above, is extremely powerful infrastructure, in world of digital libraries one particular use is especially interesting. The need of integrating many different digital libraries has been present as long as digital libraries them selves are part of global network. There have been many successful attempts to provide applicable protocols. Just to mention couple of them: ELP, DIENST, OAI-PMH, BibTeX, Z39.50, and MARC21. Those protocols have many advantages and one major disadvantage. They are not inter-operable and as a result digital libraries that facilitate them create many heterogeneous not inter-operable library networks. That situation is unwanted and significantly difficult to resolve. The proposed usage of WSMX as “common ground” for those heterogeneous networks to exchange knowledge is clean and scalable solution. The straight-forward solution to problem described above would be to create two-way adapters between each pairs of existing protocols. That task would be not only difficult, but also resulting bindings would be hard to maintain. The WSMX based solution solves those problems. In order to do that for every network based on different protocol, the WSMX node should be developed. Such a node would be a mediation layer, where a protocol query would be raised to a WSMO goal describing that query in terms of ontology describing given protocol. Next the execution of the goal would take place inside WSMX. Any goal-web service incompatibilities between ontologies describing different protocols would be resolved by WSMX mediators and adapters. Than the receiving node would lower the executed operation to its underlying protocol. The results of such an execution would come back other way around. The described use case shows that usage of WSMX to resolve communication issues between heterogeneous library networks is definitely the successful future. However current state of WSMX development does not encourage its usage in production environment. Currently released version of WSMX, which is version 0.2, is highly outdated against code repositories and does not offer useful functionality. Moreover because of dynamic development of next WSMX's release it is hard to follow changes that take action in the project's repository. Additionally much of tools that are planed for release with WSMX are not currently ready. It would be unnecessary effort to code parts of described framework without tools designed to make this process semi-automatic. Considering that WSMX development will be founded for at least two additional years and that current speed of development is more than satisfying it can be presumed that above problems will be soon removed. Inevitably one of coming releases of WSMX will be a good ground to begin development of scalable framework for integrating heterogeneous networks of digital libraries.


OAI-PMH Integration

As described above OAI-PMH is a versatile protocol for metadata harvesting. JeromeDL as a digital library can be abstracted as metadata (and documents) repository. That is the reason why JeromeDL exposes OAI-PMH's repository interface. Most of the features that are specified in the Harvesting Protocol (version 2.0) [2] document, are implemented, except those feature that are not suitable for JeromeDL(e. g. JeromeDL does not keep track of deleted resources, which results in not providing that information to OAI harvester). Also the JeromeDL's response is never broken into parts, instead it is send inside only one HTTP response. From metadata formats declared by the OAI-PMH only Dublin core (name space prefix: oai_dc) is currently supported. Internal architecture of JeromeDL is based on RDF storage. That storage contains MarcOnt bibliographic descriptions of resources. In order to translate those to dublin core metadata the OAI-PMH implementation extensively uses MarcOnt Mediation Services. However result of such transformation is dc+rdf(RDF document the uses dublin core literals as predicates). To finally obtain oai_dc metadata additional custom XSLT template is applied.


Bibster Integration

Semantically Enhanced Information Retrieval

Search Engine

Semantic Social Collaborative Filtering

Social Multifaceted Browsing

Usage of MarcOnt ontology within the library

Introduction

As stated in (3.2) MarcOnt Ontology is designed for storing various bibliographic information. It compromises most popular bibliographic formats becoming itself the most universal one. Because of this feature and the fact that ontology is a perfect way of describing semantic information MarcOnt ontology has been fully integrated with the second release of JeromeDL library. This chapter describes the introduction process of MarcOnt ontology within Jerome digital library. It reveals the ways in which MarcOnt has been connected to existing JeromeDL code and the new features that had to be added to strengthen library semantic functionality.


Book description types

Book content information stored in JeromeDL can be divided into two types: structure information and bibliographic description. The bibliographic description containing author data, title etc. is a part covered by MarcOnt ontology. On the other hand there is a large amount of data containing information specific only for JeromeDL such as localization of book cover or other book connected files on JeromeDL server. This information is called structure description (figure 1).


Image:JeromeDL2TechPaper_004.jpg


Data storage mechanism

Pushing bibliographic data into an ontology friendly format such as RDF is an encouragement to do the same with the structure information for a better unification of storage among the library (figure 2). The best way of retaining that kind of information at the moment of creating the second release of JeromeDL was offered by a RDF database engine called Sesame. It is a fast growing project offering a stable version of its software enabling storing, querying and inferencing for RDF.


Image:JeromeDL2TechPaper_002.jpg


Getting even closer to semantic web- the ways of migrating data into RDF format

Although creating applications utilizing RDF technology from the very beginning of their life cycle is fairly obvious and simple to design using known standards, encouraging new format into existing stable software can involve some serious amounts of work and has to be thoroughly planned. IT is very well known for the dynamicity of its growth and change therefore applications have to be planed in a way to enable modifications without affecting the final user. This is a very important feature when it comes to digital libraries. In the first stable version of JeromeDL all book data were stored using xml database and according to xsd schemas specific for the uses of semantic library. One of the goals of creating the second release was to migrate those data into RDF database without changing any references that are visible for the user of the library. In large systems like JeromeDL that are already in use such operations have to be done with extreme caution and step by step. The whole process can be divided into twp phases. First is creating an engine that will be able to convert information retained in plain xml (or other format specific for the system) to rdf, the second is to embed to engine into the system so that it can change existing data and also in the beginning be used with old user interfaces to store new data .(figure 3).

Creating the converter engine is purely technical work and depends very much on the base format and the ontologies that are used for the RDF solution. The important thing is to create such an interface that provides the opportunity to use it for purposes discussed earlier. The part that should be especially taken care of is integration with the system. RDF solutions cannot be pushed into an application all in one moment, the process has to be gradual. The main reason of this is the fact that a complete solution could be very risky for the stability of the system and developing it could take to much time without any real effects. This is also an important feature when talking about open source applications that require constant changes to encourage users as well as developers to keep interested in their creation. Within JeromeDL the converter engine consisted of two parts: MarcOnt Mediation Services [...] used for converting bibliographic descriptions into MarcOnt ontology and a separate engine for converting structure information. Connecting them with JeromeDL had to involve as stated before creating a conversion tool for old data and also merging the engine with existing user interface for storing and retrieving book data. Since creating a separate conversion tool application is rather simple a lot of attention had to be directed toward changing system interfaces to work with RDF.


Summary /conclusions

RDF format is becoming more and more popular and JeromeDL is a good example how to utilize it to gain specific advantages. Imposing new ways of information storage has made the software more up to date with the standards used in Semantic Web. This chapter stressed out the main concepts of rdf usage in the library however it was also supposed to draw attention to the fact that during the evolution of semantic web technologies storage formats and standards might change and it is required for developers to plan semantic applications in a way to make those changes quick and as simple as possible.


References

[1] http://www.cni.org/pub/NISO/docs/Z39.50-brochure/50.brochure.toc.html

[2] http://www.openarchives.org/OAI/openarchivesprotocol.html

Facts about JeromeDL/TechPaper — Click + to find similar pages.RDF feed
Personal tools

Corrib cluster project is supported by Enterprise Ireland under Grant No. ILP/05/203, Science Foundation Ireland under Grant No. SFI/02/CE1/I131.
Hosted at DERI, NUI Galway.