Dir-2007

October 2019
PDF

Download

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA

Overview

Download & View Dir-2007 as PDF for free.

More details

Words: 4,097
Pages: 5

Preview
Full text

Memory Based Learning and the Interpretation of Numbers in Archeological Reports Hans Paijmans

Sander Wubben

University of Maastricht; Institute for Knowledge and Agent Technology

University of Tilburg (student BDM)

[email protected]

[email protected]

ABSTRACT

very dierent if we want to re ognize numbers, but the se-

This paper des ribes the pra ti al appli ation of Memory Based Learning algorithms to the retrieval of do uments in a spe i domain: ar heologi al papers and reports in the Dut h language.

It fo uses on the re ognition of phrases

in texts that ontain numbers, either in arabi or roman numerals, as ardinals, or as domain-spe i terms ('Middle Ages'). The numbers then are interpreted as hronologi al dates, measures or geographi al lo ations and in orporated in a do ument retrieval system.

Keywords Ma hine Learning, Information Retrieval, Named Entity Re ognition, ar heology

1.

INTRODUCTION During the last de ades of the twentieth entury mu h

resear h has been done in text interpretation and data ex-

le tion of training instan es and subsequent pro essing is rather dierent.

1

As a ase, let us onsider an institution su h as the RACM , where a great number of do uments and reports about ar heologi al ex avations, site surveys and similar do uments are stored ele troni ally (and used). A

ess to the information in the reports is either by straightforward keyword sear h or by a separate database in whi h relevant attributes of the do uments are entered by human operators. There exist proje ts to reate a more involved XML markup based on CIDOC/CRM, e.g, by J. Holmen and his ollaborators, [11℄, but here no automati extra tion from instan es in the text into the tags is envisaged. In general, keyword sear h in the do uments an be automated in a number of ways, but sear hing for do uments or pages that refer to spe i semanti ontent is all but impossible if the system is indexed by keywords only. The generation of a database with attributes by humans is a labourious task, and in pra ti e large ba klogs o

ur and/or data will be omitted.

This is obviously the result of the fa t that im-

So, on the one hand, we have ompli ated ontologies that

mense quantities of (ma hine readable) text and do uments

are extremely di ult to satisfy, and on the other super-

tra tion.

have been produ ed over the last twenty years. It was not

ial keyword a

ess. Our approa h is to start with the key-

unnatural that the methods that promised to extra t infor-

word a

ess, and add the semanti ontent in stages.

mation from su h large quantities be ame popular; among

this, we are onstru ting an information retrieval system,

them Memory Based Learning, that was able to lassify texts

Open Boek, that automati ally extra ts, translates and in-

or parts thereof.

For

dexes su h attributes from the text. However, this is only

Of ourse there is a ertain lag between the development

a prelude for a mu h more di ult and ambitious task: the

of s ienti methods, and the appli ation in pra ti e, and as

identi ation of phrases and images that refer to individual

far as ommer ial appli ations are on erned, su h appli a-

obje ts.

tions are often ' losed sour e'. This means that neither the

In this paper we present our progress in the re ognition

exa t method nor the sour e ode of the programs are visible

and interpretation of numbers and numeri al data. The sys-

to the user or the interested developer. This in turn makes it

tem, Open Boek, is a 'use ase' for the CATCH proje t ,

di ult to apply su h methods to new domains or to rene

more in parti ular for RICH

the performan e. Also, the systems des ribed in the litera-

also opportunities for other resear h groups in the proje t,

ture as often as not are tailored for English texts. Finally,

su h as KICH . The ar heologi al framework is that of of

the energy of main resear h seems to have been entered

the National Referen e olle tion(NR ).

2

3

4

and MITCH , but it oers

5

on Named Entity re ognition dire ted at the re ognition of entities like, well, names([2℄). Of ourse the method is not

2. OPEN BOEK: THE CURRENT SYSTEM The design of Open Boek

6

hinges on a few requirements.

1

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission from the author. DIR 2007, March 28-29, Leuven, Belgium. Copyright 2007 Hans Paijmans and Sander Wubben.

Rijksdienst voor Ar heologie, Cultuurlands hap en Monumenten

2 Continuous A

ess to Cultural Heritage 3 Reading Images in the Cultural Heritage 4 Mining for Information in Texts from the Cultural Heritage 5 KennisInfrastru tuur CultuurHistorie 6

The te hni al report on Open Boek an be found online as http://www.referentie olle tie.nl/Openboek/ob_rapport.pdf

To start with, the system should be both immediatly usable

3.1 The corpus

from the beginning of the proje t, and remain open for additions and hanges. Therefore we adopted a tight modular approa h, where the modules ('experts') ommuni ated in ASCII les, using Unix text tools where possible. This may have had some impa t on performan e, but makes it easier to inspe t intermediary results. Also, it invites experimenting with dierent modules that have similar fun tionality. Then there is the format of the original do uments. Although some do uments in the olle tion are written in Word, it was found that this format is not amendable to manip-

Considering we are applying Natural Language Pro essing te hnology (NLP) in the eld of ar haeology, we need to have data that is relevant for this spe i domain.

8

random from the site E-Depot Nederlandse Ar heologie , dealing with ar haeologi al ex avations and histori al resear h.

to limit ourselves to that format. As a rst step, the pdfs are onverted to individual HTMLpages and separate images, keeping as mu h of the original layout as possible. Then, the text proper was extra ted from the HTML. One typi al database, of 750 pdf do uments, yielded 50 MB text, ontained in 30.000 pages and as many images. At retrieval time, the HTML pages are displayed, but a link to the original pdf-le is at all times present. Of

ourse, it is not ne essary to have a pdf le; the do uments

an be delivered in HTML dire tly, but most do uments in databases like we envisage are in pdf format anyway.

Together, these 75 reports ontain 420,369 words

and pun tuation marks.

3.2 The classes

ulation by third party developers. pdf-les are, and as the majority of the do uments was available in pdf, we de ided

To

obtain this data, we sele ted 75 ar haeologi al reports at

To automati ally make expli it the information embedded in numeri s in these ar haeologi al reports, we rst need to extra t and lassify these numeri s.

Before we an do

this, we should dene the lasses we want to be found. The CIDOC/CRM model [5℄, a well-known standard in the eld of ultural heritage, was used as a referen e. A table that shows the lasses we sele ted is in 2. The majority of the numeri s we found an be tted into the CIDOC model. The remaining items proved to be numeri s only relevant inside the individual do ument they were in, su h as image numbers and page numbers; these are found in the lasses 'Referen e' and 'Other'.

3.3 Extraction

SMART The extra ted texts were indexed using the venerable SMART

7

program, developed between 1965 and 1970 by Salton , that is however still doing very well in the TREC ontests [3℄. SMART is an implementation of the Ve tor Spa e Model[15℄, whi h essentially retrieves do uments on keyword ombinations and, most importantly, ranks them a

ording to some measure of relevan e. The SMART program oers several distin t weighting methods for individual words and we are still debating whi h is the best for this parti ular purpose. In any ase, it serves as a fast and reliable indexing and retrieval engine and so an be the basis for a very usable do ument retrieval system. The reation of the keyword indexes for a database of this size is typi ally a matter of a few minutes on a modern PC running SuSE Linux.

Memory Based Learning

We rst tokenized the text, seperating words from pun tuation (the pun tuation hara ters are also onsidered tokens). A feature extra tor was then programmed to extra t the numeri s and equivalent phrases from the texts, together with four words of leading and four words of trailing ontext. This means every instan e has nine features that an be used in the MBL stage. A simple rule-beased NE (Named Entity)

2 and 100 were extra ted, but also the written, the arabi and the roman ardinals and ordinals like twee, tweede, 2-de or 2e

extra tion omponent made sure not only items like

(two, se ond, 2nd, et etera). A downside of this NE re ognizer is that words that begin with a numeri omponent or are similar to numeri s need to be spe i ially dened (or at least the mali ious part of it) so they don't get extra ted. Dening 'a hter' a morpheme that is not to be extra ted will prevent words like

a hternaam, a hterbank and a hter from

being extra ted when sear hing for ombinations beginning with

a ht (eight).

An english equivalent would be 'ten' be-

Then, we use MBL (Memory Based Learning) te hniques

ing a prex in 'tend'; the re ognizer would then extra t 'ten'

to extra t 'interesting' phrases from the texts, e.g. hrono-

and 'tenthousand' but not 'tender' or 'tenden ies'.

logi al dates, measures and similar onstru ts (see se tion

We tested the feature extra tor on a small dataset on-

3 below). These phrases then are parsed a

ording to their

sisting of random pages from random ar haeologi al reports.

lass, and the disambigued and normalized information is

This set ontained 9,158 tokens.

inserted as tags in the HTML-le. New indexes are reated

found 506 numbers in this set whereas the feature extra tor

The human annotator

from these tags. The nal result is that the system knows

found 518. Of these ndings, 498 were identi al. Assuming

that a parti ular day or year lies within the 'Middle Ages', or

the human annotator made no mistakes, this results in a

in the twelfth entury, or in the XII-th entury or whatever

pre ision of 0,96 and a re all of 0,98 for the feature extra -

phrase is used in the do ument, and so is able to present

tor.

texts, relevant for that parti ular date. The generation of these indexes and markups is slow; for the database men-

3.4 Manual annotation

tioned here it would take two or three days to re ognize the phrases and add the markups, but there ertainly is room for improvement.

It is not very hard to extra t numbers from texts, but knowing what these numbers mean is quite something else. Considering the fa t that for example timespans an be written in dozens of dierent ways, MBL was our rst hoi e. To

3. 7

OPEN BOEK: MACHINE LEARNING

For more information on SMART and a tutorial see: [14, 13℄

be able to perfom MBL, a set of examples is needed. These examples an be generated by manually annotating the ex-

8

http://edna.itor.org/nl/

..lead 2

lead 1

numeri

trail 1

trail 2..

boren

tot

1

m

onder

steentijd

)

8800

-

4900

ook

Bijlage

III

)

.

van

put

7

zijn

geen

werden

ook

dertien

fragmenten

kwarts

•

Modied value dieren e metri :

a method of looking

at o-o

uren e of values with target lasses to determine how similar the values of a feature are, using exemplar weights [4℄.

•

Information Gain feature weighting:

measures the dis-

rimination value , i.e. how mu h ea h isolated feature

ontributes to the orre t lassi ation. Table 1:

Examples of numeri s with leading and

trailing ontent

•

Five nearest neighbours used for extrapolation:

ve

nearest neighbours are used for lassifying obje ts based on losest training examples in the feature spa e. tra ted numeri s from the ar haeologi al reports. To do so, annotations we onstru ted a database of 25,070 instan es,

Exponential De ay weighting with parameters a = 1.1 and b = 1.1: gives exponentially less weight to more

ea h onsisting of nine features a

ompanied by the lass

distant neighbours for lassi ation.

a web-based environment was reated.

With the manual

•

assigned by the human annotator. These settings were used to perform a tenfold- rossvalidation

ode

lass

E42

Obje t identier

E47

Spatial oordinates

E52

Time-span

E54 E54

per .

F-s ore

18.7 %

0.92

1.2 %

0.95

16.0 %

0.96

Dimension: depth/height

2.9 %

0.86

Dimension: length/width

1.7 %

0.90

E54

Dimension: diameter

0.8 %

0.85

E54

Dimension: thi kness

0.7 %

0.79

E54

Dimension: surfa e

1.7 %

0.62

E54

Dimension: volume

0.2 %

0.62

E54

Dimension: weight

0.2 %

0.85

E54

Dimension: other

1.5 %

0.88

E60

number

7.6 %

0,90

Referen e

6.2 %

0.93

40.6 %

0.95

Other

test on the remaining data (22,563 instan es). If we were to

lassify every instan e as belonging to the biggest lass, being 'Other', our a

ura y would be 40.6 per ent. Therefore, we an set the baseline for our system to 40.6 per ent, as every per entage higher than this number would be an improvement in the performan e of our system. F-S ore beta = 1 mi roav.

0.93

F-S ore beta = 1 ma roav.

0.87

AUC, mi roav.

0.96

AUC, ma roav.

0.92

overall a

ura y

0.94

Table 3: F-S ores, Area under the ROC-Curve and overall a

ura y obtained with TiMBL

Table 2: Breakdown in lasses with odes, per entages and F-s ores

3.5 Memory based learning For the MBL we used TiMBL 5.1 [6℄

9

, a de ision-tree-

based implementation of k-nearest neighbour lassi ation (KNN). KNN lassi ation is a method of lassifying obje ts based on the losest training examples mapped in the feature spa e. TiMBL uses indexes in the instan e memory extensively and therefore an handle dis rete data and large numbers of various examples well [6℄. The settings that proved most e ient for a development set onsisting of 2,507 random instan es were:

•

IB1 lassi ation algorithm: learning algorithm.

a simple instan e-based

It simply stores all example in-

stan es and nds the losest one.

larity fun tion where the instan es are des ribed by

With a total of 94 % of the instan es lassied orre tly,

attributes. The algorithm is quite similar to the k-

the MBL- omponent performs well above the baseline of

n

9

Figure 1: Information gain of the nine features

IB1 uses a simi-

nearest neighbour algorithm. However, IB1 normalizes

40.6 % Implementing it is therefore a

eptable.

its atributes' ranges, pro esses instan es in rementally

that don't perform very well are, not surprisingly, gener-

Classes

and assumes missing values to be maximally dierent

ally the smallest lasses as shown in table 2.

from the value present. A drawba k is that the storage

does not have many examples from these lasses, and thus

The system

of all the items an take up large amounts of mem-

does a poorer job at lassifying them. Another thing to take

ory [1℄.

into a

ount is the fa t that these lasses are all sub lasses from 'E54 Dimension', and are more akin and therefore more

Available from http://ilk.uvt.nl

likely to be onfused with ea h other.

As demonstrated in earlier resear h [2℄, gure 1 shows us that the loser the word is to the fo us, the higher its infor-

that the proje t has two very distin t goals.

The rst is

to build a text retrieval system, that in orporates modern

mation gain gets. This means that words loser to the fo us

te hniques for the interpretation of ertain semanti lasses,

(in our ase a numeri ) are more important for the lassi-

su h as hronology or geography.

ation of that numeri .

we have the following tasks before us:

When at equal distan e, words

To omplete this stage,

in front have a slightly higher information gain value than trailing words.

Something that makes our situation inter-

•

At the moment we are onsidering (but have not yet

esting, is the fa t that the information gain of the numeri

implemented) stand-o representation [7℄, where dif-

itself is a tually

ferent tag-stru tures for HTML and XML are stored

lower than the information gain of its dire t

neighbours. This means that the numeri itself ontributes

in separate les.

less to its lassi ation than the words dire tly in front or

and will prevent many problems with oni ting tag

in the ba k of it. These results stress the ambiguity of nu-

stru tures.

This will result in faster indexing

meri s in ar heaologi al texts and the need to use ontext to disambiguate the numeri s before the numeri al information

•

The se ond task, s heduled for early 2007 is the development of a system to re ognize and disambiguate

therein an be made expli it.

spatial referen es, and add the orre t oordinates for

3.6 Making the information explicit

intera tive use, in ooperation with KICH

11

. We will

Now that we have a database with lassied examples,

use mu h the same approa h as for the numeri data.

we an start to make the numeri information expli it by

In fa t, the urrent system is already able to re ognize

implementing a parsing and tagging system. The text needs

spatial oordinates and display the orresponding area

to be tagged with the lassi ations done by TiMBL on the

using Googlemaps.

numeri s generated by our feature extra tor. This is not the

•

only task of our system: the tagged information needs to be represented in a standardized form. Consider the senten es

de eeuw)('From

van 1601 tot 1700

and

sele ted and implemented, to redu e the number of

In de 17-

keywords.

1601 to 1700' and 'In the 17th entury').

•

Although their stru tures vary, they have the same meaning.

In the rst senten e,

1601

and

1700

will both have

ventory must be ompiled of other, similar wishes.

17th will have been lassied as being

•

a Timespan. However, we want our parser to tag both into De ember the 31st 1700.

•

This an be a hieved by looking at the ontext. In this

Currently most index les are plain ASCII, and are

senten e is essential. For this task we programmed a rule-

pro essed by the standard Unix text utilities.

based parser using a list of keywords to ombine with the

lassied instan es. It inserts HTML-tags for the lassi atimespan-tags in the HTML-le. While doing this, it was a number- omponent, su h as the 'Middle Ages' or the 'Iron Age' with their orre t dates

10

. In the rare o

asion where

'1601' would have been lassied as 'Timespan' and '1700' as something else and the tagger expe ts a 'Timespan', it will treat '1700' as a timespan and in lude it in the tag.

If and when performan e be omes a problem, indexes and other data should be stored in a SQL-database.

example the word 'eeuw' (meaning entury) in the se ond

minor addition to also add histori al periods that have no

The system an also be trained on the re ognition of other entities, su h as (of ourse) proper names.

stan es as Timespans spanning from January the rst 1601

tions of the numeri s and the expli it hronologi al data for

The hronology and the spatial oordinates are of obvious interest for the ar heologi al ommunity. An in-

been lassied by our system as 'Timespans', whereas in the se ond senten e only

Also, an Open Sour e stemmer for Dut h should be

Still more interesting (and more di ult) is the nal task set before us: to arry the interpretation of NL text to the point that we an identify phrases, passages and images that refer to (ar heologi al) obje ts mentioned in the text. We will des ribe our ideas and approa h to that problem in a dierent paper[12℄.

5. ACKNOWLEDGEMENTS This work was supported by NWO/CATCH under grant

1692

01/01/1692 AD - 31/12/1692 AD

640.002.401. No Mi rosoft software was used for the exper-

1674 - 1680

01/01/1674 AD - 31/12/1680 AD

iments mentioned in the paper or for the preparation of the

13de en 14de eeuw

01/01/1201 AD - 31/12/1400 AD

paper itself.

1 januari 1988

01/01/1988 AD - 01/01/1988 AD

6. REFERENCES Table 4:

[1℄ D. W. Aha, D. Kibler, and M. K. Albert.

Some examples of generated data in the

Instan e-based learning algorithms.

Timespan-tags

Ma hine Learning,

7:3766, 1990. [2℄ S. Bu hholz and A. van den Bos h. Integrating seed

4.

OPEN BOEK: COMPLETION

names and n-grams for a named entity list and

lassier. 2000.

We mentioned already that the re ognition of numeri val-

[3℄ C. Bu kley. Looking at limits and tradeos: Sabir

ues is only the rst step in the reation of Open Boek, and

10

It must be kept in mind that the Iron Age in Western Europe is a very dierent period from the Iron Age in Gree e or Egypt

resear h at TREC 2005. In Harman and Voorhees [10℄.

11

KennisInfrastru tuur CultuurHistorie is a sister proje t of RICH, and spe ially interested in spatial referen es.

[4℄ S. Cost and S. Salzberg. A weighted nearest neighbor algorithm for learning with symboli features.

Learn., 10(1):5778,

Ma h.

1993.

[5℄ N. Crofts, M. Doerr, T. Gill, S. Stead, and M. Sti. Denition of the ido on eptual referen e model 4.2. Te hni al report, CIDOC, 2005. [6℄ W. Daelemans, J. Zavrel, K. van der Sloot, and A. van den Bos h. Timbl: Tilburg memory based learner, version 5.1, referen e guide. ilk te hni al report 04-02. Te hni al report, Tilburg University, 2004. [7℄ S. Dipper. Xml-based stand-o representation and exploitation of multi-level linguisti annotation. In

Berliner XML Tage 2005, 2005. Pro eedings of the Se ond International Conferen e on Language Resour es and Evaluation (LREC 2000), 2000. D. Harman, editor. The Fourth Text REtrieval Conferen e (TREC-4). National Institute of

[8℄ L. Dybkjaer, editor.

[9℄

Standards and Te hnology, 1999.

The Fourteenth Text REtrieval Conferen e (TREC-14). National

[10℄ D. Harman and E. Voorhees, editors.

Institute of Standards and Te hnology, 2005. [11℄ J. Holmen, C.-H. Ore, and O. Eide. Do umenting two

CAA 2003 - Computer Appli ations and Quantitative Methods in Ar haeology. Bar International Series 1127 histories at on e: digging into ar haeology. In

2004, 2003. [12℄ J. Paijmans. Re ognizing related data in ar heologi al reports, 2007. [13℄ J. J. Paijmans. Tutorial for the smart information retrieval system. http://pi0959.kub.nl/Paai/Onderw/Smart/hands.html, augustus 1996.

Explorations in the Do ument Ve tor Model of Information Retrieval. PhD thesis,

[14℄ J. J. Paijmans.

Katholieke Universiteit Brabant, 1999.

Introdu tion to Modern Information Retrieval. M Graw-Hill New York [et . ℄ -

[15℄ G. Salton and M. J. M Gill. 448 pp., 1983.

Dir-2007

Overview

More details

Related Documents

Dir2007-02