Computer-assisted data extraction from the
taxonomical literature
Jim Diederich*, Renaud Fortuner**, and Jack Milton*
*Department of Mathematics, University of California, Davis, CA 95616,
USA
dieder@math.ucdavis.edu
milton@math.ucdavis.edu
**11, place de la Frézellière, 86420 Monts sur Guesnes, France. Correspondant du Muséum National d'Histoire Naturelle, Paris, France. fortuner@wanadoo.fr
Key words: biological character data, computer-assisted data extraction, published data, identification, biology, nematodes
Jim Diederich
This paper may be cited if a proper reference is given:
Diederich, J., Fortuner, R. & Milton, J. (1999). Computer-assisted data
extraction from the taxonomical literature. Virtual publication on web
site: http://math.ucdavis.edu/~milton/genisys.html.
Abstract
This article presents some problems associated with the acquisition of
morphological data from printed descriptions of taxa and our solution to
these problems.
An electronic tool, the Terminator, was used in 1993 for testing our approach.
After an article has been scanned into electronic form and run through
OCR processing, the Terminator aids the operator to identify all the characters
present in the description, record them in a standard format, and store
them prior to the creation of a database. Some difficulties one can expect
in the creation and the population of such a database are discussed. The
prototype has been implemented primarily for use with descriptions of plant-parasitic
nematodes. The complexity of our set of characters, which necessitated
new concepts to handle the set, has required changes to the tools originally
built to create and manage the set. When this task is completed, the Terminator
will have to be modified accordingly, and it will be reformulated into
a generic tool, as we see no inherent reasons that it cannot be adapted
for other biological domains.
Introduction
We have previously (Diederich et al., 1997a) proposed a representation
of morpho-anatomical characters based on a decomposition of traditional
systematic characters into: i) a biological structure (taken in a hierarchy
from the whole organism to systems, organs, tissues, cells, cell organites,
and molecules); ii) the aspects of this structure that is being described
(taken from a list of about 20 basic properties, see Diederich, 1997);
iii) and the state or value taken by the basic property in a particular
species or individual.
Once a character set has been created for a particular biological group
using this representation, it can be used to create a morpho-anatomical
database that will house the descriptions of taxa (species, genera, families,
etc.) in this group. The question remains as to how to populate this database.
This article proposes a possible approach that has been prototyped and
tested with a set of nematode descriptions.
Creating a character database
New data vs. published data
The first question to be answered is whether we should create a character
database from new data gathered for this purpose or rely on published data.
The first option would solve many problems such as those presented by missing
data and format ambiguities. However, this would not be possible for taxonomic
groups of more than a few species. Recording new data for, e.g., the 8,000
populations of plant-parasitic nematodes that have been described over
the last hundred years would require, first, to conduct a worldwide sampling
expedition, then to extract and process 8,000 samples (actually much more
because it would take more than one sample to find each targeted population),
then to record the data, to the tune of more than one day per sample. This
is just not an option for practicing nematologists.
The alternative is to use published data. Existing published data are far
from perfect, but they exist. While problems raised by missing data and
differences in format are unavoidable, a great amount of published data
has been recorded by some of the best (i.e., reliable) past or still active
taxonomists. We cannot be certain that our new data would be more accurate.
Some of the data from the literature is probably better than anything a
systematic worldwide survey would provide, depending on the expertise of
the surveyors.
The creation of a general database with published data is the only practical
option, but the question remains as to how to do it . This can be done
by hand with the operator reading the data directly from the printed documents
and entering it in the proper place in existing forms, or data extraction
and entry can be assisted by computer, using several possible methods discussed
below.
This article describes the approach we propose for the creation of such
a database, as well as some difficulties we can expect to encounter during
this task. The task remains daunting, but we have designed electronic tools
to help in this endeavor. While ours is by no means an automated system,
it makes the task at least possible.
The tools were prototyped a few years ago during the NEMISYS project (Nematode
Identification System), but they are of a sufficiently general nature that
we see no major problems in extrapolating them to other biological domains.
We have therefore begun the GENISYS Project (General Identification System),
which makes the construction of similar databases in other areas closer
at hand.
Possible approaches to published data acquisition
There are two basic methods that can be used to acquire the data: manual
data entry, and electronic data entry via a scanner and optical character
recognition (OCR) processing. With the second option, one has three options:
(i) store only the text without trying to extract and store character data
in a database, (ii) use natural language processing to aid placement of
the data in the database, or (iii) use some other means short of natural
language processing.
Manual data entry
Manual entry can be cost effective if the amount of data to be entered
is reasonably small. Even if the amount is substantial it can be entered
manually if it can be done in a very systematic way by entering data for
the same few characters repeatedly. Furthermore, if the data is simple,
the errors can probably be kept to a minimum, though one cannot escape
the tedium of the continual typing. Another advantage of manual entry is
that it eliminates the need for any pre processing such as scanning, but
the saving would not be great, as most of pre-processing can be done by
data-entry personnel.
Unfortunately, these conditions are rarely met in biological descriptions.
The descriptions, outside of a few basic measurements, are not written
in any systematic fashion and so one must often jump from one character
to a much different character to enter the data. A typical nematode description
contains about 50 to 80 characters out of a list of over 5,000 characters
that currently exist in our system. A few characters are found in all descriptions,
but others were used only by one or a few authors. The data are complex,
often involving measurements, thus making it likely there will be entry
mistakes. With a vast literature the list of known characters is not stable,
and more characters must be added to it with each processed description,
particularly early in the task. The general morphological database we are
proposing to construct will be so large that manual data entry would likely
be made, not by a taxonomist but by a data entry operator, i.e., someone
who would be unfamiliar with the domain and who would not be able to make
expert judgments about biological data. However, some character names may
not be easy to find in the published descriptions, and only experts may
be able to decide exactly what character is being described. This would
make it impossible to entrust to a data entry operator who is not a taxonomist
the task of extracting the characters directly from a printed description.
The creation of a general data matrix (i.e., one that includes all possible
characters for all existing species in a large group) is an almost impossible
task using manual entry, which probably explains why nobody has ever succeeded
in such an endeavor.
Scanning published descriptions
Using the method of electronic scanning and OCR processing presumes
a large amount of data, enough to warrant the investment in a scanner and
OCR software, although the costs are quite low. This initial part of the
process can be done by persons without any scientific training, if they
are given explicit scanning instructions. OCR processing will introduce
errors into the text, either spelling errors or scanning errors in measurements,
such as substituting the letter l or o for the digits one or zero, or missing
or adding decimals, but there are ways for handling such problems. The
time required to scan, process OCR, and spell-check is quite reasonable,
as discussed below. It should be noted that an ever increasing amount of
taxonomic descriptions are already available in an electronic form from
the author or the publisher, which will simplify the process for such new
data.
After the text of the description has been transferred to an electronic
medium, the three electronic options for data extraction listed above are
available, and the choice depends to some extent on the intended use of
the data.
Full text storage
The first option (simply storing and retrieving the electronic text)
only requires correcting errors introduced by the OCR. Spelling errors
can be corrected through a word processor, and this requires only that
a list of terms from the application domain have been entered in the user
defined dictionary. Correcting the numerical errors could be tedious without
some preprocessing software with a good interface to help make the corrections.
The corrected text could then be retrieved by keyword methods (Salton,
1989), or in the future by text-based intelligent systems (Jacobs, 1992)
when they have been significanlty developed. However, for uses of the data
for such purposes as identification this would be of limited value and
would be very inefficient unless the user had already guessed the probable
identity of the organism and needs only to check the description of the
corresponding species. The user would still have to retrieve all of the
articles containing a description of the candidate species and then locate
the descriptions in each article. For many uses the data are required in
a structured form, for example for similarity comparisons, and full text
would insufficient for such a tasks.
It would be much more difficult to retrieve descriptions with specific
values for particular characters. For this kind of search, keyword methods
in themselves do not work very well without some kind of enhancement. Some
data is extremely cryptic, and it is unlikely that searching the descriptions
would find the desired data. For example, published descriptions of nematode
species often include abbreviations. For example, in nematology, the traditional
character Ratio a (the ratio of length to width of Body) often is
represented simply by the letter a. It is unlikely that text retrieval
systems could differentiate between the article a and the ratio
a and find the latter in descriptions. Also, distributing the copies
of a published article raises copyright concerns. Copyright protection
for data is a hotly debated topic (Samuelson, 1992; 1996).
Natural language processing
The second option is to use some natural language processing (NLP Ð
see Special Issue on NLP, 1996) to help place the characters in a database.
NLP has moved in this direction in recent years due to the hugh quantities
of available on-line text which needs to be processed to be useful as there
is too much information available to be read in detail. This work is done
under the rubric of Information Extraction (IE) (Cowie & Lehnert, 1996)
and the general thrust of this work is to summarize text for the intended
audience such as news reports and financial reports and transactions (Jacobs,
1992).
The preparation of the text after scanning would be the same as in the
first option, though it could be assumed that the spell checker would rely
on the terms already entered in the lexicon prepared for NLP. However,
the time and expertise required to create a lexicon and knowledge base
for the terms in a domain would be considerable, the lexicon would have
to be updated as new articles were processed, and new terms would have
to be added. Though NLP can do well in restricted domains, it is unclear
how well they could do in handling biological descriptions. The output
would have to be checked carefully by a domain expert, since reading articles
requires a lot of expert judgment. New lexicons would have to be built
for each additional biological domain or for processing descriptions in
other languages. It is unclear what later savings in time would result
from building a lexicon. We are unaware of any inexpensive systems that
could be used for NLP, and it is unclear that the necessary enhancements
to such a system, discussed below, would be included if such a package
becomes available.
Use of a schema
The last option is based on the schema, i.e., a formal list of morphologicalcharacters and related information organized for use in a database system. This schema originates from the character set created according to the principles and guidelines set by Diederich (1997) and Diederich et al. (1997a). The schema is used as the launching point for acquiring the data itself. The schema, as we have defined it in the previous two references, contains great deal of the word combinations one expects to find in the descriptions, a significant part of the form information takes in the sciences (Harris et al, 1989).
This approach requires the text of the descriptions to be transferred
to an electronic medium, but the terms for the spell checker would be available
from the schema and many of the problems with handling OCR errors with
measurements can be finessed. Given that a lexicon is not constructed,
the system would have to be keyword based. This makes the transition from
one biological domain to another or from one language to another easier
than with NLP systems. While basic keyword systems may need some enhancements
to give reasonable performance, the developers of NLP systems readily admit
"they are fast, portable, relatively inexpensive, and relatively easy
to learn" while "By contrast, natural language processing can
be slow brittle, and expensive" (Jacobs, 1992). Also,it is crucial
that the interface is well designed to make alternative actions quick and
efficient when the keyword approach fails. It is this option that we describe
in this paper, and we feel that it is a reasonable balance between options
one and two. Perhaps the main point here is that a well constructed list
of characters and a good interface provide a solid foundation for data
aquisition requiring otherwise fairly simple concepts for processing.
The Terminator: a tool for semi-automated data extraction
The Terminator, so named because it is based on key words, or terms,
is a set of tools for (i) reading electronic versions of descriptions and
aiding with the decomposition of characters and the placement of the data
in records, (ii) reading tables of data after having been rearranged in
a simpler format, (iii) reviewing, changing, and recovering the data, and
(iv) storing the data in a form that can be used to import the data into
a commercial DBMS. In conjunction with these activities, a schema tool
assists in the creation and management of the schema. The schema creation
and management functions are of critical importance, as without a good
schema the data would be nearly useless: fraught with redundancy, lacking
uniform structure and meaning, and containing information that cannot easily
be found and used. At this point neither a biological database management
system nor a biological knowledge base management system is commercially
available to support the complex relationships among biological characters
needed in a large identification system. However, we have taken an important
step in this direction by building the prototype of a schema tool to manage
large sets of characters.
Prototypes of the Terminator and the schema tool were built in the early
90's and used for testing our approach in 1993 (see below). However, full
size tools have not been prepared, not because of any intrinsic problem
with our concepts, but because the size of our schema is so large and the
difficulties so fundamental that we have spent the time since then working
on concepts and practical considerations to solve these (Diederich, 1997;
Diederich et al., 1997a; 1997b).
Extracting data from the printed literature using the Terminator includes
several main steps: (i) retrieving the appropriate articles from the literature,
(ii) scanning them electronically using commercial optical character recognition
(OCR) software, (iii) spell-checking, (iv) dividing the text into blocks
(normally one block consists of the description of one population), and
(v) running the Terminator tools to extract the data either from text or
from tables. The domain expert identifies the appropriate articles, and
then steps i, ii, and iii can be done by operators with very little knowledge
of the domain. Steps iv and v should be done by a trained domain expert,
at least in the present state of the prototype.
The Terminator Interface, Processing text
Basic Actions
Since interface design is a critical element in creating a usable and
efficient system, it is important to give some idea of how the Terminator
operates. Indeed, throughout the development of the Terminator prototype,
interface considerations proved to be a very important driving element
of the overall design of the system (Diederich & Milton, 1993). User
interactions also help to illustrate additional information which must
be captured in the course of processing species descriptions. What is described
here provides the essential elements of a specification of the system.
It must be noted that the description below is based on the 1993 prototype,
as no current version of the tool exists. The future version of Teminator
will integrate new concepts defined since 1993 and its operating principles
and interface will be somewhat different but the 1993 prototype gives a
good illustration of the principles and philosophy we put into the tool.
Figure 1 shows the window for the Terminator.
The scanned text appears in pane 12. To begin to process this text, the
operator clicks the mouse with the cursor on the button labeled Next
in the collection of buttons numbered 11. The Terminator proceeds through
the text in chunks (roughly sentences or phrases separated by punctuation
within a sentence), as this or other buttons are repeatedly pushed. The
chunk highlighted in pane 12 reads spermatheca round with round sperms.
Following suggested interpretation by the Terminator, a list of the suggested
candidate characters appears in pane 10. The best candidate character (Spermatheca,
shape, round) has been highlighted by the system. It happens to be
the correct character in this example. Once a candidate character is highlighted,
pushing the Accept button 11 causes it to appear in pane 7 as an
accepted character. The operator can also proceed to the next chunk by
pushing the Accept/next button which both accepts the current choice
and moves to consider the next chunk. During description processing, a
majority of the operations performed by the operator involve the accept/next
button alone.
If a chunk is related to more than one character the Terminator usually
comes up with suggestions for all characters, and they are displayed in
the list in pane 10. The operator can accept as many as are appropriate.
There is no limitation, and several records can be created from a single
chunk.
Alternative Actions
Sometimes the Terminator will not find the correct character at all.
For example, during the processing of the chunk incompletely areolated
on tail (Figure 2), the system was
unable to find the correct character. However, two of the candidates, Areolations/kind
on tail/irregular areolations and Areolations/kind on whole body/irregular
areolations, do include the correct structure, Areolations.
If item 1 in pane 10 is selected, a menu item for that pane called Get
path causes a path through the hierarchy to be highlighted in panes
1 through 5: Exoskeleton, to which Areolations belongs, is highlighted
in pane 1; the main structure, Cuticle, is highlighted in pane 2; the Areolations
itself is highlighted in pane 3; and the list of corresponding properties
appears in pane 4. The operator can scroll the list of those properties,
select the correct one (kind on tail in this case), and finally
select the correct state (irregular areolations in this case)
from pane 5.
Sometimes the characters appearing in pane 10 are too far from the mark
and do not even include the correct structure name. In this case the operator
can highlight a word in the chunk, e.g., part of a structure, and use the
Find command. The system finds all places where this word appears
in the schema and produces a list. When the operator selects one of the
characters in the list, it is displayed in panes 1 through 5 and it can
be accepted as previously. If all else fails the operator has the option
of scrolling and selecting the correct biological system in pane 1, structure
in pane 2, substructure in pane 3, property in pane 4, and state in pane
5 or value in pane 6. This happens very infrequently.
Pane 6 is used to handle numerical values found within a chunk when problems
arise, e.g., when there has been a scanning error. The operator uses pane
6 to enter a new set of values or modify an existing one.
The biological structure, basic property, and state/value panes have some
menu items that assist in understanding the schema, such as definitions
or synonyms for the biological structures and basic properties in
the schema. The status panes (panes 8 in Figure 1) are used to record important
information about the data, such as the stage being described (female,
male, juvenile, cyst, etc.), the type of specimens (holotype, paratypes,
syntype, etc., or non-type material), the errors, if any, in
the original description (not valid), the difficult characters that
the operator wants to be checked by the expert (review), the diagnostic
characters, and general remarks made by the operator. Qualifiers that are
found in the text (sometimes, usually, often, rarely, etc.) can
also be entered.
The Terminator Interface, Processing tables
Some articles contain data in tables. Tables cannot be processed in
the same way as text, because (i) the chunks are not free-flowing and are
not delimited by commas, which requires a different approach to processing,
(ii) in a scanned document converted to ASCII the rows and columns often
do not line up, which makes them difficult to read, (iii) sometimes different
columns represent different populations that must be kept separate, (iv)
character names and other information in the headers often are more cryptic
than in written text, which requires additional interpretation, and (v)
there are many different formats for tables, with headers as different
populations, different species, or characters, although characters are
the row headers in most tables.
We resolved some of these problems during the scanning stage and others
with modifications to the Terminator prototype, although tables are handled
in the current version of Terminator by doing more manual work than would
be ideal. For example, delimiters are entered manually to indicate the
beginning of each header and character. Also, all of the various formats
found in the literature are reduced manually during pre-processing to the
single format that is most commonly found (columns = populations; rows
= characters). This means that, at the expense of some manual processing,
the current Terminator can be used to extract data from tables. The Terminator
must accommodate the fact that each column refers to a different population
or a different species, but at least the rows are most often characters.
Description Processing by the Terminator
In this section we discuss how the Terminator determines the list of
characters to propose in the Suggested pane in the window (pane
10 of Figure 1). Again, this is based on the protptype version of 1993
and the operating principles of a future updated version may be somewhat
different.
When the text is first processed, the system attempts to determine where
the true periods are that mark the end of sentences, as opposed to periods
that appear in measurements, in abbreviations, or as stray marks misinterpreted
by the OCR as periods. Once a sentence is identified it is decomposed into
chunks, phrases that are separated by commas and semicolons. Some errors
in measurements can be automatically corrected, whereby l's and o's (ells
and ohs) can be converted to 1's and 0's (ones and zeros).
The system uses what is primarily a keyword approach. The keywords are
those that appear in the hierarchical schema of structures along with their
properties and states. However, the schema does not have to have a hierarchy
of structures and substructures to work. A flat schema would work as well.
Common words such as articles and prepositions can easily be eliminated
from the list of keywords by creating a fairly small list of common words
to be excluded.
Our approach does not rely on exact word matching, nor does it require
that roots of words be specified. For example, the words annulus, annuli,
annule, annulated, and annules are all sufficiently close that
it is not necessary to specify or determine whether annul is a root.
Terms that do not match closely enough can be put into a global list of
synonymous terms for explicit matching. Also, synonyms are allowed in the
schema, so many synonyms are automatically handled in the hints, as discussed
below. It is useful for a few variations in spelling, such as oesophagus
vs. esophagus, to have a global list of synonyms so that they do
not need to be listed as synonymous everywhere in the schema where one
or the other appears.
Keyword matching alone does not work particularly well. Generally, proposed
enhancements to keyword-based systems include means for weighing and ranking
the selected candidates and statistical methods for associating terms in
a dictionary. We have found that we get reasonable performance using keywords
with just some simple heuristics.
One of the main features used in the system is a set of hints, which are
automatically generated by the system after the schema is created. (While
the hints comprise the main elements used by the system, other information
in the schema plays a role, such as whether the property is quantitative
or qualitative so that quantitative properties are not considered if measurements
are not found in the sentence.) A hint has the form [word, pattern, schema-element].
For example, ['tail', ('tail' 'end' 'mucro'), <structure element>]
is one hint for the word tail when it is encountered in a sentence.
The pattern consists of the words comprising the name or a synonym of the
name of the schema-element, e.g. ('tail' 'end' 'mucro') in this example.
The <structure element> is a structure, property of a structure,
or state of a property of a structure that is specific to this hint. Likewise
['tail', ('tail' 'end' 'annuli'), <structure element>] is also a
hint for the word tail.
For each sentence, the system attempts to determine which biological structures
are possible candidates before treating individual chunks. This is done
by first selecting those structure hints whose first entry, i.e., the word,
matches sufficiently well with some word in the sentence. Some of the hints
selected are then discarded if their patterns fail to match other words
in the sentence sufficiently well. For example, the sentence: The end
of the tail is rounded generates several hints, including the two hints
in the example above, because words in the sentence match two of the three
words in their pattern. In this particular case, however, the two resulting
<structure element>s would be rejected due to other criteria discussed
below. No other words in this example hint at other structures. For example,
the word rounded does not match any structure names, but it does
match several state names, and it will be used at a later stage when chunks
are processed. The surviving hints for the structure elements are used
to form a hierarchy of nodes for the sentence. The nodes of the hierarchy
represent structures and substructures of those elements well matched with
the words in the sentence. The system then proceeds with the analysis of
each chunk of the sentence.
Each chunk within the sentence inherits a copy of the hierarchy of nodes
for the sentence. Then, the system selects hints for properties and states
that match words within the chunk sufficiently well. If the corresponding
properties and states are associated with structures already in the hierarchy
of nodes for the sentence, they are added as nodes below their respective
structures. The other hints are discarded, unless they belong to substructures
of other structures in the hierarchy. For example, for the chunk Lip
region rounded, ... the original hierarchy of nodes includes a node
for the structure Lip region. In the schema, the property state
rounded is associated with a structure named Outline that
is not in the original hierarchy of nodes because it does not appear in
the sentence. However, Outline is a substructure of the structure
Lip region, which is a node in the hierarchy. Consequently, the
hint associated with the state rounded is used to add a node for
Outline to the hierarchy for the given chunk.
A number of simple strategies are used to compensate for the system's lack
of linguistic knowledge. The example in the previous paragraph shows how
the system deals with implicit references. Another strategy is to designate
some words as must words. For instance, in the examples of hints
previously given, the words mucro and annuli are designated
as structure words that must appear in the sentence for the hint
to be added to the hierarchy of nodes for the sentence or a chunk of the
sentence. It is easy for the domain expert to establish a list of structure
names that are must words, that is, words that one expects to see in most
situations when the corresponding structure is discussed. Must words are
also given for character states. For example, in lemon- shaped,
the word lemon is a must word, while shape is not. The net
effect is to eliminate hints like ['tail', ('tail' 'end' 'mucro'), <structure
element>] if the must word mucro is absent from a sentence or
chunk, as in The end of the tail is rounded, but to retain it if
mucro is present. If a word like end, which is not a must
word, is absent from the sentence the hint would not automatically be discarded,
but would be evaluated for overall matching as previously discussed.
After all the nodes that have matched sufficiently well have been added
to the chunk's hierarchy the lowest level elements in the hierarchy become
possible candidates. A candidate may consist of a structure, one of its
properties, and a state, e.g., Body, posture, weak C. Some structures
and properties may match well, but it may happen that the system does not
find any state hint if the state used in the description is not currently
in the schema. For example, in a sentence such as The body posture is
a widely open C, if widely open C is not yet part of the schema
we would still want to maintain Body, posture as a candidate. Each
candidate is scored on how well it matches at each level in the node hierarchy
and these scores are combined to form an overall score. The candidates
are rank-ordered, first by simple heuristics, and then by their scores
within a given rank. For example, candidate characters having hints for
states or values are ranked above those that do not. If two states for
the same property are candidates, only the one with the highest score is
maintained in the list.
There are a few other elements in the system, such as Exact Hints
and Table Hints, which help in specific expected situations. For
example, measurements are often given in the form <character abbreviation>=
<measurement>, such as a = 12.4, which gives the value for
ratio a. If the same character appears in a table, the row header
might simply be the letter a. A special table hint would allow the
system to recognize this character.
Performance
In this section we examine the efficiency of the various activities
associated with the extraction of published data, from pre-processing the
printed text (i.e., scanning the document, optical character recognition
(OCR), and spell checking) through extracting the data (i.e., running the
Terminator), paying particular attention to how much time it takes to accomplish
each step, according to tests made in 1993. We did not try to conduct formal
replicated tests with different operators. In fact, the operator can take
many different courses of action with a highly interactive system like
the Terminator, and the time to carry out some actions is very much dependent
on a complex context that is difficult to characterize. For example, a
schema change may take only a few seconds when it requires only a simple
action (e.g., adding a state or a basic property to an existing structure)
or it may require a lengthy study of a complex situation by the domain
expert. Our goal in 1993 was to demonstrate proof of concept, that is,
that this is a practical tool for the job of building this type of database.
We tested the system on sixteen descriptions as part of an initial effort
to obtain data for approximately fifty species of plant-parasitic nematodes
of interest to the California Department of Food and Agriculture. The descriptions
were scanned and spell-checked by a student assistant and the Terminator
was used to process 16 description blocks, representing 12 text blocks
and 4 table blocks. The descriptions were not processed with the idea of
obtaining the fastest possible times, and they reflect the nematologist
(RF) working efficiently but carefully as time allowed over several weeks
to obtain error-free data.
On the average it took 12 minutes per page (with a range of 7 to 28 min/page)
to pre-process descriptions (16 min/page for data presented in tables)
prior to data extraction by the Terminator. The average description (text
block) was 1.3 pages long. This pre-processing time included scanning the
article, sometimes rescanning in case of a poor scan, running the OCR processing,
and spell checking using a word processor. It should be noted that the
time needed for OCR processing (using AccutextTM software) varied greatly
depending on whether a training function (the Verifier) was on or
off, with the verifier taking significantly more time than straight OCR
operations (i.e., after training is completed and the verifier is off).
Since we were processing from different journals, the verifier was often
on, which slowed the process. Descriptions that included data arranged
in tables took longer because of the need to reduce the various formats
used for printed tables into the single format that can be read by the
Terminator.
Data Extraction Times
Tables 1 and 2 give the times for data extraction from 16 descriptions
using the Terminator. Table 1 gives
the results for processing descriptions without any data tables (12 descriptions),
and Table 2 gives results for 4 data
tables processed separately. The processing time included the time the
Terminator takes to process the text as well as the interactive time the
operator takes to accept or input the data. The number of characters extracted
from the description and the number of schema changes were also recorded.
On average for the 16 descriptions, there were 64 characters per description
and approximately 3 characters were processed per minute. The average time
was 21.5 minutes per description. Processing the descriptions without data
tables (Table 1) took 26.3 minutes per description, with 2.80 characters
per minute and 74 characters per description. The schema had to be modified
4.9 times per textual description. This rather large number is probably
due to the fact that the tests were made at a time when the concepts of
schema building were not fully developed and the nematode schema used included
many missing characters and improperties. Changes would be far less extensive
with the current schema. The number of schema changes and the time for
processing the text was greater than for tables.
The pre-processing operator was paid $5.58/hr and the scientist extracting
the data was paid $25/hr, which means that it cost about $12.50 to completely
extract the data from one description. It would cost about $100,000 for
extracting the data from the 8,000 published descriptions of plant parasitic
nematodes, a sizable sum but well within funding possibilities of large
organizations.
Qualitative Performance
In addition to processing times, we also recorded the qualitative performance
of the Terminator by checking how well it found the correct characters
on its own (without the help of the operator) in one of the descriptions,
(i.e., Table 1, line 2) selected at random. Since the test involved running
the Terminator a second time on a description that had already been processed,
there were no characters missing from the schema. Therefore, this gives
a better indication of the Terminator's performance when the schema is
nearly complete (i.e., it includes all the characters that are present
in the description being processed). It should be noted that the Terminator
was tested at a time when we were still grappling with problems associated
with the decomposition and representation of traditional characters. Solutions
described in Diederich (1997) and Diederich et al. (1997a) will make it
much easier to build an initial and partially complete schema, even before
starting with description processing.
In this test, if the correct candidate character, i.e., correct structure,
property, and state/value, appeared in the display (among the first five
suggested candidates), we called this a perfect match. In such cases, the
operator only had to accept the character, possibly after selecting the
correct one from among the displayed list of five candidates. Since these
are very fast operations, we gave them the maximum score.
If the correct structure and/or property appeared in the first five characters
but had the wrong state or value, we called this an incomplete match. In
this case the operator had to select the candidate, get a path through
the schema hierarchy, select the correct state/value or input the correct
value, and accept the character from the hierarchical display. These are
fast operations too, but not as fast as accepting the correct character
directly from the initial pane.
Finally, in some cases, the state/value was missing because of our design
decision to display only one candidate among all those that have the same
property but different states and to print only the first measurement with
each candidate even when there are multiple measurements in a chunk. In
other cases, the data were missing because the Terminator takes a conservative
approach to numerical problems introduced by the scanner. For example,
the measurement I 8 (I7-2o) would not be corrected to 18 (17-20)
and recognized as an average and range. Instead the Terminator recognizes
the range and the number 8 as separate values and the operator needs to
enter the correct values in pane 6.
We called it a mismatch when the correct candidate was not among the first
five candidates. When this happens, the operator had to do a Find
on a selected word to get to the correct character. An alternative would
be to scroll through the list of other candidates. Choosing one method
over another is a question of personal preference. Doing a Find
and accepting from the path is approximately as fast as scrolling, while
scrolling is more demanding because it requires reading through the candidates.
The results of this test were as follows. For measurement characters, usually
simple and easier to recognize, the system achieved a perfect match 67%
of the time, an incomplete match 25% of the time, and a mismatch the remaining
8% of the time, to yield an overall score of 76%. Even with the more difficult
descriptive characters the Terminator did quite well, as the percentages
were 55, 31, and 14, respectively, for a score of 71% on the descriptive
characters only.
It should be noted that, even in those cases in which the Terminator did
not find the correct character on its own, the alternative actions described
above allowed the operator to process the data. In this sense, the overall
success rate for the tool together with its human operator was 100%.
Discussion
The Terminator offers a viable means of capturing and supplying data
from the published literature for use in populating general morphological
databases for any biological group. It may also be useful in other areas,
perhaps outside biology. This is an important advance, since other approaches
either would not work at all (e.g., when manual data extraction and entry
is not a practical option) or would require extensive effort with the resulting
costs possibly outweighing the benefits. As we indicated, natural language
processing in biological domains might require considerable independent
effort in each domain. Not only was the Terminator successfully used to
extract data in nematology, but it proved to be fast, easy, flexible, and
reliable. Much of this is due to its interface and related functionality.
The Terminator would also be particularly well suited to the concept of
an electronic journal as an on-line database. Electronic publication is
making progress, but the electronic journals we have seen implemented or
tested so far propose full-text materials, that is, materials in which
the data are not formatted and are embedded in extraneous material. This
makes queries difficult and many types of direct use virtually impossible.
A typical keyword search often returns unwanted elements resulting from
query string mismatches, which represents a down side of this approach.
Alternatively, an electronic journal could consist of one or several on-line
databases. Authors would use Terminator to represent their data in the
format of these general databases prior to adding the formatted data to
the general pool of descriptions.
The Terminator interface has been refined and tuned many times, not only
to make actions fast, but to make the layout and the nature of the actions
easy on the operator to perform tasks, considerably reducing the fatigue
factor. Easy usage relies first and foremost on both obvious and subtle
aspects of the interface that require a great deal of tuning. Though we
have not fully used the concept of Visual plan (Diederich &
Milton, 1993), which relies on the arrangement of the various panes in
the tool for suggesting the most likely course of action, it has influenced
our design. It has made how to use this complex tool relatively easy to
remember, even when one has not used it for a considerable time.
Easy usage helps with reliability too, since the operator can then be more
attentive to the data that is being entered. However, smooth operation
can actually be a problem in that it can lull the operator into a false
sense of security. Fortunately, warnings such as those we have implemented
for the change of type and stage will help overcome this potential problem.
A tool is also available that is used to review the results, review data
that has been tagged by one or more of the various indicators discussed
earlier, and show which chunks have not had any data accepted yet in a
description.
Another aspect of the tool is that it allows the operator ways to handle
unusual or difficult situations. Without such capabilities it might become
difficult to manage large sets of descriptions. Just keeping track of what
needs to be reexamined could be troublesome. For example, there may be
problems in deciding where to place a piece of data. In Globodera pallida
the live females are white, but after they die they become dark brown cysts
full of eggs. In the chunk Some populations pass through a four to six
week cream stage before turning brown, it is not clear whether the
cream color should be attributed to the female or to the cyst. The Terminator
solution to this type of problem is to rely on the flexibility built in
the schema and in the knowledge base that will be attached to it. In case
of ambiguous statements or characters, the operator chooses one of the
options (e.g., cyst, color) and explains the nature of the ambiguity in
a memo attached to the character. Later, appropriate relationships will
link the two characters so that an entry in one can be compared to an entry
in the other.
The Future Version of Terminator
As indicated above, the 1993 prototype is no longer available and funding
is being seeked to create a new version of the tool. The major changes
and principals we determined for designing the characters will necessitate
reformulating the design of the terminator from the top to the bottom.
For example, the prototype showed that even a relatively simple Terminator
could be used to do the job but it did not try to "learn" from
the data it had processed. The new one would.
Obviously, it is too early to indicate what this tool will be, how it will
operate, and what its interface will be. These points will be defined in
the design phase of the implementation of the tool, once it is funded.
At this moment, we can only indicate some avenues we will explore.
The interface : easier to use and more intuitive
We will try to implement fully the visual plan concept.
The data : uniform representation
We will integrate the new concepts defined since the time the prototype
was built: basic properties, structures, views. For example, the prototype
top 5-panel display will be replaced by a 3-panel display with structures-basic
properties-states/values.
Extraction and formatting engine : smarter
Operating principles will take advantage of the new concepts. For example,
the search of terms in the prototype put on the same footing all the terms
in the schema. As a consequence, the list of suggestions at times was quite
wrong. For example, if the text included "Tail round", the tool
would also propose "Spermatheca/shape, round", because of the
similarity in shape. The final version will give priority to structures
over basic properties and values.
Acknowledgements
This work was supported in part by grant of the California Department of Food and Agriculture, Exotic Pest Research N° 3498.
References
Anon. (1996). Communications of the ACM, Special Issue on Natural
Language Processing. Vol. 39, No.1, Jan. 1996.
Cowie,J. and Lehnert,W. (1996). Information extraction. Comm. ACM, 39, 80-91.
Diederich,J. (1997). Basic properties for biological databases: character development and support. Journal of Mathematical and Computer Modelling (in press).
Diederich,J., Fortuner,R. and Milton,J. (1997a). Construction and integration of large character sets for nematode morpho-anatomical data. Fundamental and Applied Nematology (in press).
Diederich,J., Fortuner,R. and Milton,J. (1997b). A general structure for biological databases. In: Bridge,P., Jeffries,P. Morse,D.R. and Scott,P.R.(eds)Information Technology, Plant Pathology and Biodiversity. CAB International, Wallingford, UK. (in press).
Diederich,J. and Milton,J. (1993). Expert workstations: a tool-based approach. In: Fortuner,R. (ed.)Advances in computer methods for systematic biology: artificial intelligence, databases, computer vision . The Johns Hopkins University Press, Baltimore, pp. 103-123.
Harris,Z., Gottfried,M., Ryckman,T., Mattick, Jr.,P, Daladier,A, Harris,T.N. and Harris,S. (1989). The form of information in science. Kluwer Academic Publishers, Dordrecht, The Netherlands.
Jacobs,PS. (1992). Text-based intelligent systems: current research and practice in information extraction and retrieval. Lawrence Erlbaum Associates, Hillsdale, N.J.
Salton,G. (1989). Automatic text processing : the transformation, analysis, and retrieval of information by computer. Addison-Wesley, Reading, Mass.
Samuelson,P. (1992). Copyright law and electronic compilation of data. Communications of the ACM , 35, 27-32.
Samuelson,P. (1996). Legal protection for database contents. Communications of the ACM, 39, 17-23.