Memory Based Learning and the Interpretation of Numbers in Archeological Reports Hans Paijmans
Sander Wubben
University of Maastricht; Institute for Knowledge and Agent Technology
University of Tilburg (student BDM)
[email protected]
[email protected]
ABSTRACT
very dierent if we want to re ognize numbers, but the se-
This paper des ribes the pra ti al appli ation of Memory Based Learning algorithms to the retrieval of do uments in a spe i domain: ar heologi al papers and reports in the Dut h language.
It fo uses on the re ognition of phrases
in texts that ontain numbers, either in arabi or roman numerals, as ardinals, or as domain-spe i terms ('Middle Ages'). The numbers then are interpreted as hronologi al dates, measures or geographi al lo ations and in orporated in a do ument retrieval system.
Keywords Ma hine Learning, Information Retrieval, Named Entity Re ognition, ar heology
1.
INTRODUCTION During the last de ades of the twentieth entury mu h
resear h has been done in text interpretation and data ex-
le tion of training instan es and subsequent pro essing is rather dierent.
1
As a ase, let us onsider an institution su h as the RACM , where a great number of do uments and reports about ar heologi al ex avations, site surveys and similar do uments are stored ele troni ally (and used). A
ess to the information in the reports is either by straightforward keyword sear h or by a separate database in whi h relevant attributes of the do uments are entered by human operators. There exist proje ts to reate a more involved XML markup based on CIDOC/CRM, e.g, by J. Holmen and his ollaborators, [11℄, but here no automati extra tion from instan es in the text into the tags is envisaged. In general, keyword sear h in the do uments an be automated in a number of ways, but sear hing for do uments or pages that refer to spe i semanti ontent is all but impossible if the system is indexed by keywords only. The generation of a database with attributes by humans is a labourious task, and in pra ti e large ba klogs o
ur and/or data will be omitted.
This is obviously the result of the fa t that im-
So, on the one hand, we have ompli ated ontologies that
mense quantities of (ma hine readable) text and do uments
are extremely di ult to satisfy, and on the other super-
tra tion.
have been produ ed over the last twenty years. It was not
ial keyword a
ess. Our approa h is to start with the key-
unnatural that the methods that promised to extra t infor-
word a
ess, and add the semanti ontent in stages.
mation from su h large quantities be ame popular; among
this, we are onstru ting an information retrieval system,
them Memory Based Learning, that was able to lassify texts
Open Boek, that automati ally extra ts, translates and in-
or parts thereof.
For
dexes su h attributes from the text. However, this is only
Of ourse there is a ertain lag between the development
a prelude for a mu h more di ult and ambitious task: the
of s ienti methods, and the appli ation in pra ti e, and as
identi ation of phrases and images that refer to individual
far as ommer ial appli ations are on erned, su h appli a-
obje ts.
tions are often ' losed sour e'. This means that neither the
In this paper we present our progress in the re ognition
exa t method nor the sour e ode of the programs are visible
and interpretation of numbers and numeri al data. The sys-
to the user or the interested developer. This in turn makes it
tem, Open Boek, is a 'use ase' for the CATCH proje t ,
di ult to apply su h methods to new domains or to rene
more in parti ular for RICH
the performan e. Also, the systems des ribed in the litera-
also opportunities for other resear h groups in the proje t,
ture as often as not are tailored for English texts. Finally,
su h as KICH . The ar heologi al framework is that of of
the energy of main resear h seems to have been entered
the National Referen e olle tion(NR ).
2
3
4
and MITCH , but it oers
5
on Named Entity re ognition dire ted at the re ognition of entities like, well, names([2℄). Of ourse the method is not
2. OPEN BOEK: THE CURRENT SYSTEM The design of Open Boek
6
hinges on a few requirements.
1
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission from the author. DIR 2007, March 28-29, Leuven, Belgium. Copyright 2007 Hans Paijmans and Sander Wubben.
Rijksdienst voor Ar heologie, Cultuurlands hap en Monumenten
2 Continuous A
ess to Cultural Heritage 3 Reading Images in the Cultural Heritage 4 Mining for Information in Texts from the Cultural Heritage 5 KennisInfrastru tuur CultuurHistorie 6
The te hni al report on Open Boek an be found online as http://www.referentie olle tie.nl/Openboek/ob_rapport.pdf
To start with, the system should be both immediatly usable
3.1 The corpus
from the beginning of the proje t, and remain open for additions and hanges. Therefore we adopted a tight modular approa h, where the modules ('experts') ommuni ated in ASCII les, using Unix text tools where possible. This may have had some impa t on performan e, but makes it easier to inspe t intermediary results. Also, it invites experimenting with dierent modules that have similar fun tionality. Then there is the format of the original do uments. Although some do uments in the olle tion are written in Word, it was found that this format is not amendable to manip-
Considering we are applying Natural Language Pro essing te hnology (NLP) in the eld of ar haeology, we need to have data that is relevant for this spe i domain.
8
random from the site E-Depot Nederlandse Ar heologie , dealing with ar haeologi al ex avations and histori al resear h.
to limit ourselves to that format. As a rst step, the pdfs are onverted to individual HTMLpages and separate images, keeping as mu h of the original layout as possible. Then, the text proper was extra ted from the HTML. One typi al database, of 750 pdf do uments, yielded 50 MB text, ontained in 30.000 pages and as many images. At retrieval time, the HTML pages are displayed, but a link to the original pdf-le is at all times present. Of
ourse, it is not ne essary to have a pdf le; the do uments
an be delivered in HTML dire tly, but most do uments in databases like we envisage are in pdf format anyway.
Together, these 75 reports ontain 420,369 words
and pun tuation marks.
3.2 The classes
ulation by third party developers. pdf-les are, and as the majority of the do uments was available in pdf, we de ided
To
obtain this data, we sele ted 75 ar haeologi al reports at
To automati ally make expli it the information embedded in numeri s in these ar haeologi al reports, we rst need to extra t and lassify these numeri s.
Before we an do
this, we should dene the lasses we want to be found. The CIDOC/CRM model [5℄, a well-known standard in the eld of ultural heritage, was used as a referen e. A table that shows the lasses we sele ted is in 2. The majority of the numeri s we found an be tted into the CIDOC model. The remaining items proved to be numeri s only relevant inside the individual do ument they were in, su h as image numbers and page numbers; these are found in the lasses 'Referen e' and 'Other'.
3.3 Extraction
SMART The extra ted texts were indexed using the venerable SMART
7
program, developed between 1965 and 1970 by Salton , that is however still doing very well in the TREC ontests [3℄. SMART is an implementation of the Ve tor Spa e Model[15℄, whi h essentially retrieves do uments on keyword ombinations and, most importantly, ranks them a
ording to some measure of relevan e. The SMART program oers several distin t weighting methods for individual words and we are still debating whi h is the best for this parti ular purpose. In any ase, it serves as a fast and reliable indexing and retrieval engine and so an be the basis for a very usable do ument retrieval system. The reation of the keyword indexes for a database of this size is typi ally a matter of a few minutes on a modern PC running SuSE Linux.
Memory Based Learning
We rst tokenized the text, seperating words from pun tuation (the pun tuation hara ters are also onsidered tokens). A feature extra tor was then programmed to extra t the numeri s and equivalent phrases from the texts, together with four words of leading and four words of trailing ontext. This means every instan e has nine features that an be used in the MBL stage. A simple rule-beased NE (Named Entity)
2 and 100 were extra ted, but also the written, the arabi and the roman ardinals and ordinals like twee, tweede, 2-de or 2e
extra tion omponent made sure not only items like
(two, se ond, 2nd, et etera). A downside of this NE re ognizer is that words that begin with a numeri omponent or are similar to numeri s need to be spe i ially dened (or at least the mali ious part of it) so they don't get extra ted. Dening 'a hter' a morpheme that is not to be extra ted will prevent words like
a hternaam, a hterbank and a hter from
being extra ted when sear hing for ombinations beginning with
a ht (eight).
An english equivalent would be 'ten' be-
Then, we use MBL (Memory Based Learning) te hniques
ing a prex in 'tend'; the re ognizer would then extra t 'ten'
to extra t 'interesting' phrases from the texts, e.g. hrono-
and 'tenthousand' but not 'tender' or 'tenden ies'.
logi al dates, measures and similar onstru ts (see se tion
We tested the feature extra tor on a small dataset on-
3 below). These phrases then are parsed a
ording to their
sisting of random pages from random ar haeologi al reports.
lass, and the disambigued and normalized information is
This set ontained 9,158 tokens.
inserted as tags in the HTML-le. New indexes are reated
found 506 numbers in this set whereas the feature extra tor
The human annotator
from these tags. The nal result is that the system knows
found 518. Of these ndings, 498 were identi al. Assuming
that a parti ular day or year lies within the 'Middle Ages', or
the human annotator made no mistakes, this results in a
in the twelfth entury, or in the XII-th entury or whatever
pre ision of 0,96 and a re all of 0,98 for the feature extra -
phrase is used in the do ument, and so is able to present
tor.
texts, relevant for that parti ular date. The generation of these indexes and markups is slow; for the database men-
3.4 Manual annotation
tioned here it would take two or three days to re ognize the phrases and add the markups, but there ertainly is room for improvement.
It is not very hard to extra t numbers from texts, but knowing what these numbers mean is quite something else. Considering the fa t that for example timespans an be written in dozens of dierent ways, MBL was our rst hoi e. To
3. 7
OPEN BOEK: MACHINE LEARNING
For more information on SMART and a tutorial see: [14, 13℄
be able to perfom MBL, a set of examples is needed. These examples an be generated by manually annotating the ex-
8
http://edna.itor.org/nl/
..lead 2
lead 1
numeri
trail 1
trail 2..
boren
tot
1
m
onder
steentijd
)
8800
-
4900
ook
Bijlage
III
)
.
van
put
7
zijn
geen
werden
ook
dertien
fragmenten
kwarts
•
Modied value dieren e metri :
a method of looking
at o-o
uren e of values with target lasses to determine how similar the values of a feature are, using exemplar weights [4℄.
•
Information Gain feature weighting:
measures the dis-
rimination value , i.e. how mu h ea h isolated feature
ontributes to the orre t lassi ation. Table 1:
Examples of numeri s with leading and
trailing ontent
•
Five nearest neighbours used for extrapolation:
ve
nearest neighbours are used for lassifying obje ts based on losest training examples in the feature spa e. tra ted numeri s from the ar haeologi al reports. To do so, annotations we onstru ted a database of 25,070 instan es,
Exponential De ay weighting with parameters a = 1.1 and b = 1.1: gives exponentially less weight to more
ea h onsisting of nine features a
ompanied by the lass
distant neighbours for lassi ation.
a web-based environment was reated.
With the manual
•
assigned by the human annotator. These settings were used to perform a tenfold- rossvalidation
ode
lass
E42
Obje t identier
E47
Spatial oordinates
E52
Time-span
E54 E54
per .
F-s ore
18.7 %
0.92
1.2 %
0.95
16.0 %
0.96
Dimension: depth/height
2.9 %
0.86
Dimension: length/width
1.7 %
0.90
E54
Dimension: diameter
0.8 %
0.85
E54
Dimension: thi kness
0.7 %
0.79
E54
Dimension: surfa e
1.7 %
0.62
E54
Dimension: volume
0.2 %
0.62
E54
Dimension: weight
0.2 %
0.85
E54
Dimension: other
1.5 %
0.88
E60
number
7.6 %
0,90
Referen e
6.2 %
0.93
40.6 %
0.95
Other
test on the remaining data (22,563 instan es). If we were to
lassify every instan e as belonging to the biggest lass, being 'Other', our a
ura y would be 40.6 per ent. Therefore, we an set the baseline for our system to 40.6 per ent, as every per entage higher than this number would be an improvement in the performan e of our system. F-S ore beta = 1 mi roav.
0.93
F-S ore beta = 1 ma roav.
0.87
AUC, mi roav.
0.96
AUC, ma roav.
0.92
overall a
ura y
0.94
Table 3: F-S ores, Area under the ROC-Curve and overall a
ura y obtained with TiMBL
Table 2: Breakdown in lasses with odes, per entages and F-s ores
3.5 Memory based learning For the MBL we used TiMBL 5.1 [6℄
9
, a de ision-tree-
based implementation of k-nearest neighbour lassi ation (KNN). KNN lassi ation is a method of lassifying obje ts based on the losest training examples mapped in the feature spa e. TiMBL uses indexes in the instan e memory extensively and therefore an handle dis rete data and large numbers of various examples well [6℄. The settings that proved most e ient for a development set onsisting of 2,507 random instan es were:
•
IB1 lassi ation algorithm: learning algorithm.
a simple instan e-based
It simply stores all example in-
stan es and nds the losest one.
larity fun tion where the instan es are des ribed by
With a total of 94 % of the instan es lassied orre tly,
attributes. The algorithm is quite similar to the k-
the MBL- omponent performs well above the baseline of
n
9
Figure 1: Information gain of the nine features
IB1 uses a simi-
nearest neighbour algorithm. However, IB1 normalizes
40.6 % Implementing it is therefore a
eptable.
its atributes' ranges, pro esses instan es in rementally
that don't perform very well are, not surprisingly, gener-
Classes
and assumes missing values to be maximally dierent
ally the smallest lasses as shown in table 2.
from the value present. A drawba k is that the storage
does not have many examples from these lasses, and thus
The system
of all the items an take up large amounts of mem-
does a poorer job at lassifying them. Another thing to take
ory [1℄.
into a
ount is the fa t that these lasses are all sub lasses from 'E54 Dimension', and are more akin and therefore more
Available from http://ilk.uvt.nl
likely to be onfused with ea h other.
As demonstrated in earlier resear h [2℄, gure 1 shows us that the loser the word is to the fo us, the higher its infor-
that the proje t has two very distin t goals.
The rst is
to build a text retrieval system, that in orporates modern
mation gain gets. This means that words loser to the fo us
te hniques for the interpretation of ertain semanti lasses,
(in our ase a numeri ) are more important for the lassi-
su h as hronology or geography.
ation of that numeri .
we have the following tasks before us:
When at equal distan e, words
To omplete this stage,
in front have a slightly higher information gain value than trailing words.
Something that makes our situation inter-
•
At the moment we are onsidering (but have not yet
esting, is the fa t that the information gain of the numeri
implemented) stand-o representation [7℄, where dif-
itself is a tually
ferent tag-stru tures for HTML and XML are stored
lower than the information gain of its dire t
neighbours. This means that the numeri itself ontributes
in separate les.
less to its lassi ation than the words dire tly in front or
and will prevent many problems with oni ting tag
in the ba k of it. These results stress the ambiguity of nu-
stru tures.
This will result in faster indexing
meri s in ar heaologi al texts and the need to use ontext to disambiguate the numeri s before the numeri al information
•
The se ond task, s heduled for early 2007 is the development of a system to re ognize and disambiguate
therein an be made expli it.
spatial referen es, and add the orre t oordinates for
3.6 Making the information explicit
intera tive use, in ooperation with KICH
11
. We will
Now that we have a database with lassied examples,
use mu h the same approa h as for the numeri data.
we an start to make the numeri information expli it by
In fa t, the urrent system is already able to re ognize
implementing a parsing and tagging system. The text needs
spatial oordinates and display the orresponding area
to be tagged with the lassi ations done by TiMBL on the
using Googlemaps.
numeri s generated by our feature extra tor. This is not the
•
only task of our system: the tagged information needs to be represented in a standardized form. Consider the senten es
de eeuw)('From
van 1601 tot 1700
and
sele ted and implemented, to redu e the number of
In de 17-
keywords.
1601 to 1700' and 'In the 17th entury').
•
Although their stru tures vary, they have the same meaning.
In the rst senten e,
1601
and
1700
will both have
ventory must be ompiled of other, similar wishes.
17th will have been lassied as being
•
a Timespan. However, we want our parser to tag both into De ember the 31st 1700.
•
This an be a hieved by looking at the ontext. In this
Currently most index les are plain ASCII, and are
senten e is essential. For this task we programmed a rule-
pro essed by the standard Unix text utilities.
based parser using a list of keywords to ombine with the
lassied instan es. It inserts HTML-tags for the lassi atimespan-tags in the HTML-le. While doing this, it was a number- omponent, su h as the 'Middle Ages' or the 'Iron Age' with their orre t dates
10
. In the rare o
asion where
'1601' would have been lassied as 'Timespan' and '1700' as something else and the tagger expe ts a 'Timespan', it will treat '1700' as a timespan and in lude it in the tag.
If and when performan e be omes a problem, indexes and other data should be stored in a SQL-database.
example the word 'eeuw' (meaning entury) in the se ond
minor addition to also add histori al periods that have no
The system an also be trained on the re ognition of other entities, su h as (of ourse) proper names.
stan es as Timespans spanning from January the rst 1601
tions of the numeri s and the expli it hronologi al data for
The hronology and the spatial oordinates are of obvious interest for the ar heologi al ommunity. An in-
been lassied by our system as 'Timespans', whereas in the se ond senten e only
Also, an Open Sour e stemmer for Dut h should be
Still more interesting (and more di ult) is the nal task set before us: to arry the interpretation of NL text to the point that we an identify phrases, passages and images that refer to (ar heologi al) obje ts mentioned in the text. We will des ribe our ideas and approa h to that problem in a dierent paper[12℄.
5. ACKNOWLEDGEMENTS This work was supported by NWO/CATCH under grant
1692
01/01/1692 AD - 31/12/1692 AD
640.002.401. No Mi rosoft software was used for the exper-
1674 - 1680
01/01/1674 AD - 31/12/1680 AD
iments mentioned in the paper or for the preparation of the
13de en 14de eeuw
01/01/1201 AD - 31/12/1400 AD
paper itself.
1 januari 1988
01/01/1988 AD - 01/01/1988 AD
6. REFERENCES Table 4:
[1℄ D. W. Aha, D. Kibler, and M. K. Albert.
Some examples of generated data in the
Instan e-based learning algorithms.
Timespan-tags
Ma hine Learning,
7:3766, 1990. [2℄ S. Bu hholz and A. van den Bos h. Integrating seed
4.
OPEN BOEK: COMPLETION
names and n-grams for a named entity list and
lassier. 2000.
We mentioned already that the re ognition of numeri val-
[3℄ C. Bu kley. Looking at limits and tradeos: Sabir
ues is only the rst step in the reation of Open Boek, and
10
It must be kept in mind that the Iron Age in Western Europe is a very dierent period from the Iron Age in Gree e or Egypt
resear h at TREC 2005. In Harman and Voorhees [10℄.
11
KennisInfrastru tuur CultuurHistorie is a sister proje t of RICH, and spe ially interested in spatial referen es.
[4℄ S. Cost and S. Salzberg. A weighted nearest neighbor algorithm for learning with symboli features.
Learn., 10(1):5778,
Ma h.
1993.
[5℄ N. Crofts, M. Doerr, T. Gill, S. Stead, and M. Sti. Denition of the ido on eptual referen e model 4.2. Te hni al report, CIDOC, 2005. [6℄ W. Daelemans, J. Zavrel, K. van der Sloot, and A. van den Bos h. Timbl: Tilburg memory based learner, version 5.1, referen e guide. ilk te hni al report 04-02. Te hni al report, Tilburg University, 2004. [7℄ S. Dipper. Xml-based stand-o representation and exploitation of multi-level linguisti annotation. In
Berliner XML Tage 2005, 2005. Pro eedings of the Se ond International Conferen e on Language Resour es and Evaluation (LREC 2000), 2000. D. Harman, editor. The Fourth Text REtrieval Conferen e (TREC-4). National Institute of
[8℄ L. Dybkjaer, editor.
[9℄
Standards and Te hnology, 1999.
The Fourteenth Text REtrieval Conferen e (TREC-14). National
[10℄ D. Harman and E. Voorhees, editors.
Institute of Standards and Te hnology, 2005. [11℄ J. Holmen, C.-H. Ore, and O. Eide. Do umenting two
CAA 2003 - Computer Appli ations and Quantitative Methods in Ar haeology. Bar International Series 1127 histories at on e: digging into ar haeology. In
2004, 2003. [12℄ J. Paijmans. Re ognizing related data in ar heologi al reports, 2007. [13℄ J. J. Paijmans. Tutorial for the smart information retrieval system. http://pi0959.kub.nl/Paai/Onderw/Smart/hands.html, augustus 1996.
Explorations in the Do ument Ve tor Model of Information Retrieval. PhD thesis,
[14℄ J. J. Paijmans.
Katholieke Universiteit Brabant, 1999.
Introdu tion to Modern Information Retrieval. M Graw-Hill New York [et . ℄ -
[15℄ G. Salton and M. J. M Gill. 448 pp., 1983.