LIMES: Large Infrastructure in Mathematics - Enhanced
Services
Access to Research Infrastructures RTD project HPRI-CT-1999-50002
Fifth Framework Programme of the European Community
for Research and Technological Development
Description of the Zentralblatt MATH Web interface
software
Claude Goutorbe
November 19, 2001
Version 0.2
| LIMES Project Deliverable |
| Project Name: |
Large Infrastructure in Mathematics - Enhanced Services |
| Project Acronym: |
LIMES |
| Project Number: |
HPRI-CT-1999-50002 |
| Deliverable Title: |
Description of the Zentralblatt MATH Web interface software |
| Deliverable Number: |
part of D1.1, D1.2 |
| Version Number: |
0.2 |
| Date: |
November 19, 2001 |
| Principal Author(s): |
Claude Goutorbe
Cellule Mathdoc UJF & CNRS
e-mail: Claude.Goutorbe@ujf-grenoble.fr |
| Deliverable Kind: |
Documentation |
| Deliverable Type: |
Restricted |
| Abstract: |
This chapter describes the software that is used to implement the web
retrieval and display software of Zentralblatt-MATH |
Table of contents
Introduction
This document describes the retrieval and display software that is used
in the current world wide web interface to the ZMATH database.
The retrieval and display software can be seen as built of three layers.
The top layer, also called "the application" handles user input, generates
a query that is passed to the bottom layer (the search engine) through
an intermediate python extension module which is the glue between the application
and the database search engine. This glue layer passes back the query results
to the application level, who is responsible for presenting them to the
user.
Emphasis is put on how the user visible part is implemented, in the
section entitled 'The application layer'. We then describe how this user
interface is coupled with low level software that actually implements searching
('The glue layer') , and the search engine itself ('The search engine layer'),
which is known as 'edbm'.
This documentation is mainly intended as an help for maintaining and
modifying the software, and as such is a little technical, but remains
a high level description. Readers should refer to the source code for implementation
details.
The software needs to implement a number of user visible features, such
as:
-
multiple query forms (corresponding to different level of user expertise
or profiles),
-
query forms and result display in multiple natural languages
-
multiple output formats (e.g. html, dvi, pdf, ...).
It also needs to support some administrative features such as: access control,
logging of user requests (for example for statistical purposes). One desirable
feature is that the software be easily customized and extended to meet
new or local requirements, to adapt to changes in the database format,
or to adapt to other similar databases.
For this reason most of the input and output processing has been implemented
in Python, an interpreted, rapid development language, that can be fully
integrated with modules where speed is of concern: the search engine itself.
1. The Application (Python) layer
This is where request handling and results presentation are actually done,
and can thus be considered as "the main program".
1.1 Application invocation
The Python level application is currently invoked through a C program
that embeds the python interpreter, and not directly. This C program also
implements access control. It is essentially a wrapper that is used to
launch the real application. Specifically:
-
it reads the file whose name is passed as the first argument expecting
lines consisting of (keyword value) pairs, but is actually interested only
in finding the "APPLICATION" keyword. All other keywords are simply put
in the process environment without even looking at them (and can be used
to pass information to the application that will run)
-
the value corresponding to the "APPLICATION" keyword should be a dotted
name "modulename.classname"
-
the module identified by "modulename" is imported
-
the class identified by "classname" is instantiated, with no arguments
(this means that the class' constructor cannot have arguments, or uses
only arguments that have a default value)
-
the instance just built now initializes itself. It should have an attribute
named "acl_files" which is a tuple of access control file names.
-
access control takes place. If it is successfull (the user indeed has access),
a special instance method "set_user_info" is called with the entire acl
file line that matched (it is not an error if this method does not exist,
we just proceed.) If it is not successfull, the method "no_access" is called
and we either stop (if it returns 0) or continue (if it returns 1) This
is used to implement "demo mode".
Full control is now given to the application by calling its "run" method.
Note that this way of invoking the application imposes a few constraints
on its structure. Namely, it must be implemented as some class instance
residing in some module, it must define access control files, and it must
have a run method. Although these constraints are not very strong, they
can be further relaxed by writing an application as a standard
python program invoked directly through the python interpreter. This can
be especially useful for implementing simple, 'quick and dirty' features
that might be needed at a development site, when the full processing machinery
described in the next sections is not needed.
This requires an 'edbm enabled' python interpreter, in other words
the glue module that implements the search engine interface must either
be compiled into the interpreter or available as a dynamically loadable
module (this module is not currently distributed).
1.2 Access control
From the point of view of the access control module, an access control
file is a text file having the following structure:
-
empty lines or lines beginning with '#' are ignored
-
other lines begin with an IP number specification.
-
An IP number specification ends at the first of: end of line, blank, TAB
character, '#' character.
-
The rest of the line is ignored for the purpose of access control.
An IP number specification may be:
-
a full IP number (193.54.238.197)
-
a class A,B,C network number (193.54)
-
a network or machine range (123.99.12-97 or 123.99.100.3-6)
-
a network and mask in CIDR format (194.199.16.0/20)
-
a network and mask in a.b.c.d format (193.54.238.192/255.255.255.224)
1.3 Application structure
As for essentially any web cgi database application, the duty of the application
layer is:
-
to process user input,
-
to build a database query based on this input
-
to send the results back to the user's browser.
A lot of logic is common to such applications and is currently implemented
in the module "edbmw3.py", which defines base classes and utility functions
defining this common behaviour. Derived classes are then used in the actual
application to handle specific needs.
The following discusses input processing, query building and output
generation, describing what is currently done by the base classes and what
should be done by the derived classes.
1.3.1 Input processing
The base application class (edbmw3.Application) implements the following
behaviour:
The class constructor:
-
is called with the process environment as the only parameter
-
parses the user's cgi request and builds a dictionnary like object ("cgidata")
that represents user input
-
reads a configuration file (whose path is obtained from the environment)
and builds a dictionnary ("config") The configuration file is an ascii
file containing (key, value) pairs whose interpretation is up to the derived
application. The derived class ("zmath.mathdbapp") augments this behaviour,
for example by processing certain keywords in the configuration file in
a special way.
1.3.2 Query building
The base application class does nothing in this respect, although it probably
could and maybe should.
The derived class first opens the database and delegates actual query
building to another module, which is specific to the database we are querying
(in this case, this is the ZMATH database, and the database specific module
is "MATHDB.py").
This database specific module is aware of what fields and indexes exist,
and how the data has been indexed. It uses this information to build a
command language equivalent of the user's query: based on keys in the cgidata
dictionary, it can identify search criteria and also has the duty of editing
user input to take into account indexing rules or other special features
such as automatic truncation of author names.
Most of the logic to do this is implemented in the edbmw3.Query and
edbmw3.AdvancedQuery classes. Derived classes in the database specific
modules only define input editing methods that are called transparently.
1.3.3 Running the query
The command language query just built is passed to the search engine via
the Python extension module which returns a result set or an error. Based
on values in the cgidata and config dictionaries, the application then
chooses an output shell to instantiate and run.
1.3.4 Output generation
The base application class also defines a model for output handling. Since
we handle many output formats, actual output generation is delegated to
separate classes (so called "output shells") that should be defined by
the derived application.
The base class for all output shells is "edbmw3.OutputShell" (except
very simple output routines that are based on a fixed template).
The output document is defined by specifying a tree of names, which
represent elements to be inserted into the output buffer, some of them
being repeatable.
The default tree is defined as:
-
'doc' -> 'header', 'body', 'footer' '
-
body' -> 'item' (repeatable: represents a result set item)
The only requirement on this tree specification is that the root must be
named 'doc'.
The base output generation method (edbmw3.OutputShell.do_element) walks
down this tree (depth-first, left-to-right), doing the following:
[the term "string" in the following is to be understood as: any object
that can be represented as a string using Python's str function. ]
-
if the element is defined as repeatable, a special method is called to
get the next "value". This usually applies only to the next result set
item, and is done transparently in the base class.
A derived class may however define field editing methods that should
be named 'edit_<fieldname>' that will be called after reading the next
result set item: this is typically used to split a field that is logically
a list into its components. (e.g. author names)
-
get a list of strings that should be inserted into the output just before
the element itself. This is done by first calling a derived class method
called "start_<elementname>" (e.g start_foo for the element whose name
is "foo"), if that method exists. Otherwise, lookup the "head_list" dictionary
(head_list['foo']) to find out what to put in front of the element, if
anything.
-
if the element is a leaf : get a list of strings that should be inserted
into the output as the value of this element. This is done by first calling
a derived class method called "do_<elementname>" (e.g. do_foo), if that
method exists. Otherwise if the element has been statically defined in
the "elements" dictionary, use that (i.e. insert elements['foo'] into the
output).
Statically defined elements are useful for things like logos, images
or fixed text that do not depend on the current database item.
-
otherwise recursively process the element's subtree
-
similar processing occurs to get what should be inserted after the element
itself (e.g. call end_foo or lookup tail_list['foo'])
An output shell should initialize itself by defining the dictionaries that
are used in the above process. The entire application object is passed
to the shell's constructor, meaning that the output shell has access to
the entire application context (http request, configuration, environment,...).
After shell initialization, the application starts the main output loop
by calling the shell's 'run' method which essentially amounts to a call
of do_element('doc').
Other shell classes in edbmw3.py include HTMLShell, which adds
features necessary for html output (for example it defines head_list['doc']
as 'Content-type:text/html\n\n<html>')
Problems with output
-
The output routines use information that comes from a lot of places (http
request, environment, configuration,...) It is not very clear how to simplify
this, but probably something should be done.
-
the output routines in turn generate input, for example in the form of
html links or buttons that send a query to the application when clicked
by the user. These automatically generated queries must be consistent with
the processing that is done during input handling: a change in the way
that some field is indexed must potentially be made in two places, which
is easily overlooked.
-
Since the raw data that comes from the database is basically unstructured,
output shells usually need to parse it in order to make use of it.
This is currently done in the output routines themselves either directly
(via 'edit' methods) or by calling some parsing routines in a specialized
module (mathutil.py). This leads to duplicate code and maintenance problems.
The task of parsing data should be wrapped up in an object that encapsulates
the raw result set and delivers already transformed database items to the
output shells. This would be further simplified by changing the database
scheme so that it is less often necessary to "peek" at data to decide how
to handle it.
1.3.5 Sending the output to the user
Some output routines handle formatting completely by themselves and write
their output directly to the process standard output (the user's browser).
In the case of graphical output, however (e.g. dvi output), these routine
are only preparing input for an external formatter (e.g. TeX). It is the
duty of the application to spawn this formatter once its input has been
prepared. This is done by invoking shell scripts whose task is to set up
the environment for the external formatter (usually TeX), to run it, and
to send its output to the the user's browser.
1.3.6 Last minute tasks
The very last thing that is done is to write information to a log file.
The format of the log file is described in another chaper of this deliverable.
1.3.7 Utility classes and modules
A number of utilities that are generally useful for input as well as output
processing are available. They include:
-
edbmutil.py: contains functions to convert strings to and from different
encodings (e.g. TeX to iso-8859-1). These routines are currently slow and
should be moved to a C module. This module also defines a Buffer object
that provides temporary storage on behalf of an output shell.
-
mathutil.py: contains a number of functions that know how to parse the
unstructured data stored in the database into more structured objects that
can be manipulated more easily by the output routines.
-
htmlfmt.py: contains classes and functions useful in the context of generating
html output.
2. The glue layer
This is a standard Python extension module written in C, whose purpose
is to wrap the search engine so that it can be accessed from Python code.
Conversely, it knows how to translate search results into Python objects
that can be manipulated by the application layer.
This module defines two new Python object types: database and result
set.
A database object is created ('opened') by the 'db' function,
which is passed a tuple of directory paths corresponding to the physical
database 'slices'.
Database objects have three methods:
-
find(query, first, last) where query is a command language
query string, first and last define the result set's slice that should
be returned.
This method returns a tuple (first, last, total, result): first and
last define the actual result set slice that is being returned (last may
be different from what was asked for, since the total number of matching
items may be smaller than what was requested); total is the total number
of matching items; result is a result set object (defined below) containing
the requested slice.
-
count(query) returns the number of items that match query.
This
is just a convenience method since the same result can be obtained using
the find method.
-
close() : guess what it does.
Database objects also behave partly as sequence objects: for example db[0]
is
the first item in the database; db[-1] is the last item in the
database; len(db) returns the total number of items in the database.
Result set objects behave like sequence objects, hence indexing a result
set yields the corresponding result set item: result[0] is the
first item in the result set.
A database item is returned as a triple (fileid, itemid, data) where
fileid is the number of the database slice that contains the item,
itemid
is
an internal identifier (usually of no interest to the application), and
data
is
itself a tuple where each element is the value of the corresponding item's
field. Null values are returned as the Python object 'None'.
Since all these structures are not immediately apparent in the distributed
source code, here is a short example:
[goutorbe@math-gobi goutorbe]$ python
Python 1.5.2 (#10, Feb 7 2001, 14:10:32) [GCC 2.95.2 19991024 (release)] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> import edbm
>>> db = edbm.db(('/home/goutorbe/edbmw3/data/3',))
>>> len(db)
463408
>>> db[0]
(0, 1, ('01633832', 'Kinsey, L.Christine; Moore, Teresa E.', None,
'Symmetry, shape, and space. An introduction to mathematics through geometry.',
'English', 'New York, NY: Key College Publishing/Springer. xviii,
494 p. \\sterling 48.00; \\$ 54.95; DM 138.99; sFr. 119.94 (2002).
[ISBN 1-930190-09-3/hbk]', '2002', 'B',
'00A06\00251-01', None, None, 'Review in preparation', None, None))
>>> (first, last, total, result) = db.find('au=kinsey, l*', 1, 5)
>>> total
1
>>> last
1
>>> result[0]
(0, 1, ('01633832', 'Kinsey, L.Christine; Moore, Teresa E.', None,
'Symmetry, shape, and space. An introduction to mathematics through geometry.',
'English', 'New York, NY: Key College Publishing/Springer. xviii,
494 p. \\sterling 48.00; \\$ 54.95; DM 138.99; sFr. 119.94 (2002).
[ISBN 1-930190-09-3/hbk]', '2002', 'B',
'00A06\00251-01', None, None, 'Review in preparation', None, None))
>>> result[1]
Traceback (innermost last):
File "<stdin>", line 1, in ?
IndexError: index out of bounds
>>> result
<edbmresultset object at 80f17f0>
>>>
Coupling the application and the glue layer
It should be noted that almost no data model exists at this level or at
the search engine level.:
a database item is simply a tuple of values. It is up to the application
to know that certain fields are indeed lists, or references to foreign
databases. It would certainly be useful to introduce a data model.
However, it is not clear if it should be implemented in this extension
module, or purely as Python coded wrappers around the database and results
set objects that are implemented by this module.
As a first step, a next version of the software will probably use such
python-level wrappers, as a thin layer between the application and the
glue layer that should allow easier and cleaner coding at the application
level.
3. The search engine
This is the deepest layer of the software that actually executes a query
against a database returning a set of item identifiers that match the query,
if any.
3.1 edbm databases
We will describe what is an edbm database, what is a result set and
how a query is processed.
A database is a set of one or more 'slices' , which are physically stored
in separate file system directories that contain data files and index files.
The structure of data files is inherited from a previous retrieval system
that is used for the cdrom version of the Zentralblatt-MATH database. It
has the advantage of being very simple.
A database item is composed of a fixed number of fields (which may be
empty) and is stored as a single string. A data file can be considered
as an ordered set of items (according to a database specific comparison
function), and an item's index in this set is used as the item's identifier.
Auxiliary files hold mapping information that allows the retrieval of a
single item given its identifier.
Index files are implemented as B+-trees (using a third party library,
berkeley db). For a given key, the corresponding leaf of the tree holds
an ordered array of integers, which are the identifiers of the corresponding
database entries.
To support expression searching (proximity), word position information
is generated for certain indexes; in this case the leaf corresponding to
a given key also includes for each item identifier a pointer into a separate
file that holds this position information.
Hence, running a query is the process of generating an array of integers
that are the identifiers of matching items (actually generating one array
for each database slice). At the C level, a result set is a structure that
holds these arrays.
3.2 the command language
The search engine receives a query as a command language string.
The query is first parsed into a syntax tree and then executed against
each database slice, producing a result set.
Here is the formal syntax of the command language
find_expression ::= field_expression
::= find_expression bool_op field_expression
bool_op ::= '&' | '|' | '^'
field_expression ::= name '=' '(' bool_expr ')'
::= name rel_op string
::= name '=' string ':' string
::= '(' find_expression ')'
rel_op ::= '<' | '<=' | '>' | '>='
bool_expr ::= term
::= bool_expr '|' term
term ::= factor
::= term '&' factor
::= term '^' factor
factor ::= string
::= '(' bool_expr ')'
and some example queries:
| cc = 58g* & an = 885* |
All about section "58G" in volume 885 |
| cc = 58g* & py = 1996:1998 |
All about section "58G" for years 1996 to 1998 |
| au = (bender,* ^ canfield,*) |
The author is Bender, without Canfield |
| au = (bender,* ^ canfield,*) | au = (canfield,* ^ bender,*) |
The author is Bender or Canfield, but not both |
| au = ((bender,* ^ canfield,*) | (canfield,* ^ bender,*)) |
same query as above |
| au = ((bender,* ^ canfield,*) | (canfield,* ^ bender,*)) &
py >= 1997 |
same query, but the item has been published after 1996 |
| ti = (riemann* & (manifold* | variete*)) |
|
| ti = riemann* & manifold* |
Syntax error. The correct query is: ti = (riemann* & manifold*) |
| ti = riemann* manifold* |
Expression search: will find "riemannian manifold", "riemannian manifolds",
etc... |
| bi = (bose einstein | einstein bose) |
... |
| ti = (fermi dirac ^ bose einstein) |
... |
3.3 query execution
The primitive operations that are implemented are the following:
-
lookup a key in some index, returning the list of item identifiers that
match (e.g 'au = foobar'' may return item numbers 6, 78, 23456)
-
lookup a prefix in some index (e.g. 'au = foobar*'' may return item numbers
6, 78, 23456, 35678)
-
lookup a key range ('py = 1989:2000')
-
lookup keys that compare greater, greater or equal, smaller, smaller or
equal to a given key ('py >= 1998')
These operations return intermediate result sets. Implementing the query
language's operators
(and, or, not) is the process of computing the intersection, union,
or difference of these result sets, thus yielding a set of matching item
identifiers. The resulting set is sorted by increasing identifier values,
so that sequentially accessing the result set yields database entries in
the 'natural 'order of the database.
In the case of expression searching, a result set consisting of identifiers
of items that match all expression words is first built. Word position
information is then used to filter out items that actually match the phrase
at hand.
The resulting set is then reduced to a slice corresponding to
the first and last members requested, and returned to the caller, which
can now retrieve an actual database item by requesting for example member
number N.