LIMES: Large Infrastructure in Mathematics - Enhanced Services
Access to Research Infrastructures RTD project HPRI-CT-1999-50002
Fifth Framework Programme of the European Community for Research and Technological Development


 

Description of the Zentralblatt MATH Web interface

software

Claude Goutorbe

November 19, 2001

Version 0.2

 
 
 
 
 
LIMES Project Deliverable
Project Name: Large Infrastructure in Mathematics - Enhanced Services
Project Acronym: LIMES
Project Number: HPRI-CT-1999-50002
Deliverable Title: Description of the Zentralblatt MATH Web interface software
Deliverable Number: part of D1.1, D1.2
Version Number: 0.2
Date: November 19, 2001
Principal Author(s): Claude Goutorbe 
Cellule Mathdoc UJF & CNRS 
e-mail: Claude.Goutorbe@ujf-grenoble.fr 
Deliverable Kind: Documentation
Deliverable Type: Restricted
Abstract: This chapter describes the software that is used to implement the web retrieval and display software of Zentralblatt-MATH

Table of contents

Introduction

This document describes the retrieval and display software that is used in the current world wide web interface to the ZMATH database.

The retrieval and display software can be seen as built of three layers.

The top layer, also called "the application" handles user input, generates a query that is passed to the bottom layer (the search engine) through an intermediate python extension module which is the glue between the application and the database search engine. This glue layer passes back the query results to the application level, who is responsible for presenting them to the user.

Emphasis is put on how the user visible part is implemented, in the section entitled 'The application layer'. We then describe how this user interface is coupled with low level software that actually implements searching ('The glue layer') , and the search engine itself ('The search engine layer'), which is known as 'edbm'.

This documentation is mainly intended as an help for maintaining and modifying the software, and as such is a little technical, but remains a high level description. Readers should refer to the source code for implementation details.

The software needs to implement a number of user visible features, such as:

It also needs to support some administrative features such as: access control, logging of user requests (for example for statistical purposes). One desirable feature is that the software be easily customized and extended to meet new or local requirements, to adapt to changes in the database format, or to adapt to other similar databases.

For this reason most of the input and output processing has been implemented in Python, an interpreted, rapid development language, that can be fully integrated with modules where speed is of concern: the search engine itself.
 
 

1. The Application (Python) layer

This is where request handling and results presentation are actually done, and can thus be considered as "the main program".
 

1.1 Application invocation


The Python level application is currently invoked through a C program that embeds the python interpreter, and not directly. This C program also implements access control. It is essentially a wrapper that is used to launch the real application. Specifically:

Full control is now given to the application by calling its "run" method.

Note that this way of invoking the application imposes a few constraints on its structure. Namely, it must be implemented as some class instance residing in some module, it must define access control files, and it must have a run method. Although these constraints are not very strong, they can be further relaxed by  writing an application  as a standard python program invoked directly through the python interpreter. This can be especially useful for implementing simple, 'quick and dirty'  features that might be needed at a development site, when the full processing machinery described in the next sections is not needed.
This requires an 'edbm enabled' python interpreter, in other words the glue module that implements the search engine interface must either be compiled into the interpreter or available as a dynamically loadable module (this module is not currently distributed).

1.2 Access control

From the point of view of the access control module, an access control file is a text file having the following structure: An IP number specification may be:

1.3 Application structure

As for essentially any web cgi database application, the duty of the application layer is: A lot of logic is common to such applications and is currently implemented in the module "edbmw3.py", which defines base classes and utility functions defining this common behaviour. Derived classes are then used in the actual application to handle specific needs.

The following discusses input processing, query building and output generation, describing what is currently done by the base classes and what should be done by the derived classes.

1.3.1 Input processing

The base application class (edbmw3.Application) implements the following behaviour:
The class constructor:

1.3.2 Query building

The base application class does nothing in this respect, although it probably could and maybe should.

The derived class first opens the database and delegates actual query building to another module, which is specific to the database we are querying (in this case, this is the ZMATH database, and the database specific module is "MATHDB.py").

This database specific module is aware of what fields and indexes exist, and how the data has been indexed. It uses this information to build a command language equivalent of the user's query: based on keys in the cgidata dictionary, it can identify search criteria and also has the duty of editing user input to take into account indexing rules or other special features such as automatic truncation of author names.
Most of the logic to do this is implemented in the edbmw3.Query and edbmw3.AdvancedQuery classes. Derived classes in the database specific modules only define input editing methods that are called transparently.

1.3.3 Running the query

The command language query just built is passed to the search engine via the Python extension module which returns a result set or an error. Based on values in the cgidata and config dictionaries, the application then chooses an output shell to instantiate and run.

1.3.4 Output generation

The base application class also defines a model for output handling. Since we handle many output formats, actual output generation is delegated to separate classes (so called "output shells") that should be defined by the derived application.
 

The base class for all output shells is "edbmw3.OutputShell" (except very simple output routines that are based on a fixed template).
 

The output document is defined by specifying a tree of names, which represent elements to be inserted into the output buffer, some of them being repeatable.
The default tree is defined as:

The only requirement on this tree specification is that the root must be named 'doc'.
 

The base output generation method (edbmw3.OutputShell.do_element) walks down this tree (depth-first, left-to-right), doing the following:
[the term "string" in the following is to be understood as: any object that can be represented as a string using Python's str function. ]

An output shell should initialize itself by defining the dictionaries that are used in the above  process. The entire application object is passed to the shell's constructor, meaning that the output shell has access to the entire application context (http request, configuration, environment,...).

After shell initialization, the application starts the main output loop by calling the shell's 'run' method which essentially amounts to a call of do_element('doc').

Other shell classes in edbmw3.py include HTMLShell, which adds features necessary for html output (for example it defines head_list['doc'] as 'Content-type:text/html\n\n<html>')

Problems with output

1.3.5 Sending the output to the user

Some output routines handle formatting completely by themselves and write their output directly to the process standard output (the user's browser).
In the case of graphical output, however (e.g. dvi output), these routine are only preparing input for an external formatter (e.g. TeX). It is the duty of the application to spawn this formatter once its input has been prepared. This is done by invoking shell scripts whose task is to set up the environment for the external formatter (usually TeX), to run it, and to send its output to the the user's browser.

1.3.6 Last minute tasks

The very last thing that is done is to write information to a log file.
The format of the log file is described in another chaper of this deliverable.

1.3.7 Utility classes and modules

A number of utilities that are generally useful for input as well as output processing are available. They include:

2. The glue layer

This is a standard Python extension module written in C, whose purpose is to wrap the search engine so that it can be accessed from Python code.  Conversely, it knows how to translate search results into Python objects that can be manipulated by the application layer.

This module defines two new Python object types: database and result set.

A database object is created ('opened') by the 'db'  function, which is passed a tuple of directory paths corresponding to the physical database 'slices'.

Database objects have three methods:

Database objects also behave partly as sequence objects: for example db[0] is the first item in the database; db[-1] is the last item in the database; len(db) returns the total number of items in the database.

Result set objects behave like sequence objects, hence indexing a result set yields the corresponding result set item: result[0] is the first item in the result set.

A database item is returned as a triple (fileid, itemid, data) where fileid is the number of the database slice that contains the item, itemid is an internal identifier (usually of no interest to the application), and data is itself a tuple where each element is the value of the corresponding item's field. Null values are returned as the Python object 'None'.

Since all these structures are not immediately apparent in the distributed source code, here is a short example:
 

[goutorbe@math-gobi goutorbe]$ python
Python 1.5.2 (#10, Feb  7 2001, 14:10:32)  [GCC 2.95.2 19991024 (release)] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> import edbm
>>> db = edbm.db(('/home/goutorbe/edbmw3/data/3',))
>>> len(db)
463408
>>> db[0]
(0, 1, ('01633832', 'Kinsey, L.Christine; Moore, Teresa E.', None,
'Symmetry, shape, and space. An introduction to mathematics through geometry.',
'English', 'New York, NY: Key College Publishing/Springer. xviii,
 494 p. \\sterling 48.00; \\$ 54.95; DM 138.99; sFr. 119.94 (2002).
 [ISBN 1-930190-09-3/hbk]', '2002', 'B',
'00A06\00251-01', None, None, 'Review in preparation', None, None))
>>> (first, last, total, result) = db.find('au=kinsey, l*', 1, 5)
>>> total
1
>>> last
1
>>> result[0]
(0, 1, ('01633832', 'Kinsey, L.Christine; Moore, Teresa E.', None,
'Symmetry, shape, and space. An introduction to mathematics through geometry.',
'English', 'New York, NY: Key College Publishing/Springer. xviii,
 494 p. \\sterling 48.00; \\$ 54.95; DM 138.99; sFr. 119.94 (2002).
 [ISBN 1-930190-09-3/hbk]', '2002', 'B',
'00A06\00251-01', None, None, 'Review in preparation', None, None))
>>> result[1]
Traceback (innermost last):
  File "<stdin>", line 1, in ?
IndexError: index out of bounds
>>> result
<edbmresultset object at 80f17f0>
>>>

Coupling the application and the glue layer

It should be noted that almost no data model exists at this level or at the search engine level.:
a database item is simply a tuple of values. It is up to the application to know that certain fields are indeed lists, or references to foreign databases. It would certainly be useful to introduce a data model.
However, it is not clear if it should be implemented in this extension module, or purely as Python coded wrappers around the database and results set objects that are implemented by this module.

As a first step, a next version of the software will probably use such python-level wrappers, as a thin layer between the application and the glue layer that should allow easier and cleaner coding at the application level.

3. The search engine

This is the deepest layer of the software that actually executes a query against a database returning a set of item identifiers that match the query, if any.
 

3.1 edbm databases

We will  describe what is an edbm database, what is a result set and how a query is processed.

A database is a set of one or more 'slices' , which are physically stored in separate file system directories that contain data files and index files. The structure of data files is inherited from a previous retrieval system that is used for the cdrom version of the Zentralblatt-MATH database. It has the advantage of being very simple.

A database item is composed of a fixed number of fields (which may be empty) and is stored as a single string.  A data file can be considered as an ordered set of items (according to a database specific comparison function), and an item's index in this set is used as the item's identifier. Auxiliary files hold mapping information that allows the retrieval of a single item given its identifier.

Index files are implemented as B+-trees (using a third party library, berkeley db). For a given key, the corresponding leaf of the tree holds an ordered array of integers, which are the identifiers of the corresponding database entries.

To support expression searching (proximity), word position information is generated for certain indexes; in this case the leaf corresponding to a given key also includes for each item identifier a pointer into a separate file that holds this position information.

Hence, running a query is the process of generating an array of integers that are the identifiers of matching items (actually generating one array for each database slice). At the C level, a result set is a structure that holds these arrays.
 

3.2 the command language

The search engine receives a query as a command language string.   The query is first parsed into a syntax tree and then executed against each database slice, producing a result set.

Here is the formal syntax of the command language

   find_expression ::= field_expression
                   ::= find_expression bool_op field_expression

   bool_op ::= '&' | '|' | '^' 

   field_expression ::= name '=' '(' bool_expr ')'
                    ::= name rel_op string
                    ::= name '=' string ':' string
                    ::= '(' find_expression ')'

   rel_op ::= '<' | '<=' | '>' | '>='

   bool_expr ::= term
             ::= bool_expr '|' term

   term ::= factor
        ::= term '&' factor
        ::= term '^' factor

   factor ::= string
          ::= '(' bool_expr ')'
and some example queries:
 
cc = 58g* & an = 885* All about section "58G" in volume 885
cc = 58g* & py = 1996:1998 All about section "58G" for years 1996 to 1998
au = (bender,* ^ canfield,*) The author is Bender, without Canfield
au = (bender,* ^ canfield,*) | au = (canfield,* ^ bender,*) The author is Bender or Canfield, but not both
au = ((bender,* ^ canfield,*) | (canfield,* ^ bender,*)) same query as above
au = ((bender,* ^ canfield,*) | (canfield,* ^ bender,*)) & py >= 1997 same query, but the item has been published after 1996
ti = (riemann* & (manifold* | variete*)) 
ti = riemann* & manifold*  Syntax error. The correct query is: ti = (riemann* & manifold*)
ti = riemann* manifold*  Expression search: will find "riemannian manifold", "riemannian manifolds", etc...
bi = (bose einstein | einstein bose)  ...
ti = (fermi dirac ^ bose einstein)  ...

3.3 query execution

The primitive operations that are implemented are the following: These operations return intermediate result sets. Implementing the query language's  operators
(and, or, not) is the process of computing the intersection, union, or difference of these result sets, thus yielding a set of matching item identifiers. The resulting set is sorted by increasing identifier values, so that sequentially accessing the result set yields database entries in the 'natural 'order of the database.

In the case of expression searching, a result set consisting of identifiers of items that match all expression words is first built. Word position information is then used to filter out items that actually match the phrase at hand.

The resulting set  is then reduced to a slice corresponding to the first and last members requested, and returned to the caller, which can now retrieve an actual database item by requesting for example member number  N.