0l
ASHGATE
oil cl/\l!)zl t!q
ll!o!vt!Dq! J=trl\JJF
=;i jjg/ !,J!-='( :,tttC
-)9iJrr F.:ut;ig.tty Third Edition
,
L.\lX Reuiew, 19_;_
::i-,lrS: a
CaSe StUdr
:
:€xrCh
- Part 7 O,:
ii
5ased CD-RO[{s: .
.:
ace Project
-
Indexing and searching languages
fina
- .I)-ROMs. Onlinr , r->:tot: (]ower.
: rnformation set.-: :'r .lr 161-96. ':1 .lrturnal of th? ,1.
-. ..t. Annttal ReL.it:
I'ITRODUCTION
: . Huntan-Comfu..
-
r'aries UnlimiteC --. re elTective onl:-
-:
I
r
-:t11.
,.-- and lnndon: (
-.:. ., ,.,
-',:
o.f
-
Docwmettto::
R. Biddisconrl:. ormation prort,
- :::laces to onlinr : 12 (2),78_98.
..:-rrudinal stud\.
-.:s
chapter introduces a range of principles associated with the concept and of indexing and searching languages. The primary focus of this chapter is :' indexing languages that are used to represent subjects. Accordingly the .apter starts by exploring the concept of a subject. Next the types of indexing .::guzrges are introduced and, finally, a range of principles that apply to the .-:lication and use of indexing languages. At the end of this chapter you will:
...
I
recognize the complexities associated with naming subjects and the need to identily relationships between subjects
r
be familiar with different approaches to indexing and with associated concepts such as speciflcity and exhausitiviff understand the difference between natural and controlled languages be aware of the principles of vocabulary control understand the principles of thesaurus construction be familiar with the search facilities that are available to support post-coordi-
I o
I o
nate searching.
APPROACHES TO SUBJECT RETRIEVAL
'
.ers often approach information sources not with names in rnind, but with
a
-iestion that requires an answer or a topic for study. Users seek documents or :-,ormation concerned with a particular subject. This is a common approach information sources and, in order to provide for it, the document or document
,
'
IZJ
ACCESS
representation must include enough data to ensure that items on specific -sub,are retrieved. S4rat is a specific subject? A rabbit is a rabbit; but is it? Europeans will . in mind the European rabbit, Americans the cottontail; they belong to ditr.. genera. A rabbit is a concrete entity - that is, we can see it and pick
-
ther
-.
:lLtbr
.>-.ir
-:
L
adcc
.
(preferably not by its ears) and define it by its physical characteristics ears, furry, weighs around a kilogram) and behaviour (hopping movemenr, . burrows, breeds freely). Abstract concepts can be more difficult to pin c Some are fairly straightforward, like Music (encyclopaedia definition: ,the r,:-. .. ized movement of sounds through a continuum of time'); some, like Geog:.: ('the science that deals with the distribution and arrangement of all the eler:-.-, of the earth's surface') look straightforward until we think of the vasr sc, : , the subject; while Games defles definition - the philosopher witrgensrerrcluded after a long study that the subject could only be defined throu; examples. Not only may subjects be in themselves dfficult to define, \\-e :,-, remember that they do not exist in isolation in the way that named entiri.. If we are looking for information on william Shakespeare, Mount Ever..Microsoft, we can be sure when we have found it that we have come . right place as these are all 'classes of one'. common subjects, on the other . form networks of conceptual relationships with other subjects. If we are :: to identify a rabbit, we may not be entirely sure that what we saw was : hare; the reader in a library looking for the geography books is rikelv directed to at least three widely separated sets of shelves; and library rr: every'where are notorious for asking for the games section when what th.looking for is information on chess. Any system of subject retrieval mus: have a mechanism for directing users to other, closely relaterl, subjects. .
WHY INDEX?
Now that cheap online storage and retrieval of fulI text are commonplac. value of indexing has been questioned. If the individual words in a r..,.' immediately accessible in any combination, why go to the trouble of consr:, indexes at all? why not simply search for combinations of words from ri.. The assumption behind this attitude is that a text is 'about'what it mr-Fairthorne put his finger on the weakness of this assumption: Moby Dick is about a whale, othello is about a handkerchief, and about othrr.
'lhe difficulties are to identify which of the things mentioned refer to relevan: ancl how to deal with topics of the document that are not n.rentionecl expli.
Parts of the docurnent are not always what the entire document is about. :. document usually about the sum
124
-
,r
IAT
lS
' , ::ltal - :..-- ,,-1.i t1a -.:-.-.1-.: -- t-
:-
.
-.:-a.
,
:,:---l-.
,- -:.-1
- :..:
INDEXIN(} AND SEAITCHIN(} I.A.NGTIAGES
fems on specific subject:
-r other words, this paragraph has just mentioned a whale and a handkerchief, .ut nobody would suggest that it is about those things. It is the indexer's job to
it? Europeans will har. tliey belong to differer. n see it and pick it u .al characteristics (lur -
:irSUrc that a document's overall topic and, perhaps, its major constituent themes
.:e adequately represented.
_
,,pping nrovement. dig. :'c difficull to pin don: iia definition: 'the orga:
': sonre. like Geograpl :1r,-r)t of all the elemer'. :nk of the vasl scop,p1rt-r Wittgenstein,' r, defined through ..rrlt to define, we nl-,
,VHAT
IS A DOCUMENT ABOUT?
iformation retrieval is in general concerned with what a text is about rather -an what it means. What is meaning? One point of view holds that that meaning . inferential: a scientific paper may be about a statistical correlation between Dacco smoking and lung cancer; what it means is that the one may cause the ,ler. Another argument states that indexers should adopt a neutral position .:-d not attempt to impose upon the reader their views on what a document -..ans. There is also the point of view - grounded in literary theory - that - .:aning is interactive (and to that extent subjective), the result of the interaction
:hat named entities
:trv€efl the text and the individual reader. Perhaps the most powerful argument
:are, Mounl Everes' :. .r'r have cO[l€ lu
.:ainst indexers attempting to represent the meaning of documents is economic:
'--l:,
on the other h
.
,
-ri-t'ts. I[ we are tr. : . '.,.: at we saw was n . trrrokS is likely' i - -: and library rea.l-ir when what they a:. '.-t retrieval musf then iated, subjects.
: CornmOnplace, '.'.,rrcls
the
in a text are
rble ofconstructing i' ,rds from the text?
.
n-hat
it
.
ar closer study, as well as requiring expert subject knowledge.
APPROACHES TO INDEXING
\n indexing language
can be defined as the terms or codes that might be used points A searching language can be defined as the terms in an index. as access that are used by a searcher when specifying a search requirement. If the terms or codes are assigned by an indexer when a database is created, then the indexing language is used in indexing. The same terms or codes may also be used as access points to records during searching. \\4rile the indexing language may be distinct from the searching language, clearly, if retrieval is to be successful, the two must be closely related. Indexing languages may be of two different ffpes: controlled-indexing languages or assigned-term systems, and natural-indexing languages or derived-term systems. Each of these is briefly discussed below
mentions.
:t:
Controlled-indexing languages (assigned-term systems)
. altout other things. '-:J' to relevant topics, - . r,rned explicitly. . . ::. is about, nor is a .
i.iorre,
.,.'ould simply take too long to do. A trained indexer can grasp what a document
, about by scanning it rapidly. 'lo attempt to extract its meaning would involve
19tj9, p. 29)
With these languages a person controls the terms that are used as index terms. Controlled-indexing languages may be used for names and other labels but much emphasis is placecl upon languages with terms that describe subjects. Normally an authority list identifies the terms that may be assigned. Inclexing involves a person assigning terms from this list to specific documents on the 125
ACCESS
basis
of subjective interpretations of the concepts in the document; in th:.
process the indexer exercises some intellectual discrimination in choosing appr-
-
,:ing
inr
,:a natufi
priate terms.
,.-:,pear i
There are two types of subject-based controlled-indexing languages: alph,betical-indexing languages and classification schemes. In alphabetical-indexir.: languages, such as are recorded in thesauri and subject headings lists, subje. terms are the alphabetical names of subjects. Control is exercised over whic:-. terms are used, and relationships between terms are indicated, but the ternt. themselves are ordinary words. In classification schemes each subject is reprrsented by a code or notation. Classification schemes are particularly concerne: to place subjects in a framework that crystallizes their relationships one t, another. More generally though, classification is implicit in all indexing. A document in which content is wholly or partially specifled in the index term RABBn is thereby classed with other documents to which the same specification ha= been applied. Controlled-indexing languages take the process of classificatior. one stage further, by displaying semantic links, between rabbits and hares fo: example. Formal bibliographic classification schemes, such as the DDC anc. LCC classifications, display these relationships in a systematic manner. They arc able, in addition, through their notation to exclude particular connotations o: meaning: thus DDC's 599.322 denotes rabbits as zoological entities, but not as pets (which would be 636.9322). Thesauri have always been a feature of the document management system: that have been designed to manage larger collections. They are increasingll' featuring in OPACs and other information retrieval environments, and their applicability for Internet applications is of interest. Thesauri typically show the controlled indexing term, with related, narrower and broader terms, as show'n in Figure 5.6. They may be displayed in a window during search strategl' formulation, to aid a user in the selection of terms. Often terms can be selectecl from the thesaurus listing simply by clicking on them. Hypertext links in thesauri listings can be used to move between different occurrences of the same term in the list. Another application of thesauri is as a basis for automatic indexing. A11 terms in the documents that appear in the thesaurus will generate an entrlin the inverted index. Related applications of thesauri are in the creation of
.:aditionr ,:xt of th ::ing tht
semantic nets and semantic knowledge bases. Natural-indexing languages (uncontrolled or derived-term systems)
These languages are not really a distinct or stable language in their own right. but rather are the 'natural' or ordinary language of the document being indexed. Strictly, natural language systems are only one type of derived-term system. A derived-term system is one where all descriptors are taken from the document
126
i ,tlle
n-l
.,, tll -ur15
:ased
u1
\aturai :1-the cc -r,-im a li 1,
.1at hav€
\atura .:r-e1). in
.D-RO\1
:r
oniine
be mo .tarcher. ..')
:i-stems
(
:: the co :.
r,rhibitir
-:S
Yaluat
:,) not ne
.iort
is
:ranage
l
iatabase 1r-rth alp
iupport
FEATUF EXHAUS] \\d:
i
:
.ingle u .i,rccifiet
.l
the s-
j -l - ^ , ! Id>>
tJt't,,nnt
INDEXING AND SEAITCHING I-ANGUAGES
i ,-:lent: in
l--.-
-:-^ dllr ^----r:rrtx
L
_
:l- . i:qes: all, --i, -l: : . --al-inclr:'',,. [.-.,i:S. SLi:_--' l': . ','rf ',',-:- , { -. ,lr iriL.i - : -' :S :-::.-
-
!:-
-,-:S
ln: 0tlr f,-
t: .,--, __ l'.'.' :-:: : -: - :, -:-
i
:.i-l--:,
lll,i
Its
,
.'--:-,
-
ffi .' - -, .--. .--E:-
n r'flmr,
E p;t; ;.
Fffilr
-
being indexed. Thus, author indexes, title indexes and citation indexes, as well as natural language subject indexes, are derived-term systems. Any terms that appear in the document may be candidates for index terms. Emphasis has traditionally been on the terms in titles and abstracts, but increasingly the full text of the document is used as the basis for indexing. Natural language indexing using the full text of the document may be very detailed, and in some systems some mechanism for deciding which terms are the most important in the indexing of a given document may be appropriate. Such mechanisms are often based upon statistical analysis of the relative frequency of occurrence terms. Natural language indexing can be executed by a human indexer, or automatically by the computer. The computer might index every term in the document, apart from a limited stopJist of very common terms, or may only index those terms that have been listed in a computer-held thesaurus. Natural language indexing and controlled language indexing are used extensively in many information retrieval applications. Both are used in retrieval on CD-ROM, via the online search services, in document management systems and in online public access catalogues. Controlled-indexing languages are claimed to be more consistent and therefore more efficient and straightforward for the searcher, but research has failed to prove this convincingly. The dilemma facing systems designers is that to offer anything other than natural language indexing in the context of the huge databanks available through the Internet would be prohibitively expensive. On the other hand, controlled language indexing is seen as valuable in a supportive environment for inexperienced users because they do not need to navigate all the variations inherent in natural language. Significant effort is being directed towards the development of system interfaces that manage this variability, either implicitly or explicitly, on behalf of the user. Many
r!ilii'
databases include terms from controlled indexing languages (often including both alphabetical indexing languages and classification schemes) and also
mr
support searching on the text of the record, thus covering all options.
lffi-
](. fH-'
ll,
FEATURES OF RETRIEVAL SYSTEMS EXHAUSTIVITY AND CONTENT SPECIFICATION
It was suggested above that indexes attempt to specifu content by means of single words or phrases. Clearly, the whole of the subject content cannot be hlrr
hr
m lrt
specified by anything less than the complete text (and may well require more, in the way of footnotes and other glosses, as with the heavily annotated editions of classic writers). Indexing has to try to sum up the salient points, while ignoring the non-essentials. This can be done at a number of levels, which, even 127
ACCESS
, a docun
i'!{fi{i:::#::::,:Lansuases :
'
o
'..an.Iple. R
...ion is c .lich conr
il:l:'T;?'lffil:':ifiJ"'l',1'll,x,, gves pr
in retrievins individuarterms
Ifll#XilcitY
names or persons. orsani.
bxhaustivig gives potential for high recall. Does not apply to tiile-only databases
Disadvantages of lJncontrolted tndexing Languages particurarrv wit-htu'"tn'i#;:man]
:
-
-.
|nfftfi:fou|','Joi:ffi'rnt'
i. itrilr$ ;fn';1il.'ffiql#_'-, rrrrr,;JJHr'i:rfs syntax probrems.
svnonvms and severarspecies
-
rmall1'h
- ieres
an
-.:.tr11- alu-:
'.:a\es to " .:lle of in -, ',\-i1son
;: :::::,,,,,,
rhroush incorl
t=qtrft'11,:f,-J*'i:::*,,,*,,** -
control of synonyrns
a
display of broader, nar - ii{i{lii:tiii'J'Htr:,
o
:
-
and re,a,ed,erms
expresses concepts elt
overcomes s1,ntu,.
p,.ou#f';tJ'r'r'rl|:ir,
il,:'#iffilfi J#il-rff
terms and other devices
ross th roush over-exhaustivitv
!^"ii[if:x,i,::li:""o:*"n*tndexinsLansuases
i ilil#itif,j,ffi:Tffi?xTapp,ication
. J-::-
.:
t>
or index terms can
'ccur berween'lerms
I ''#ti''f'r'i."'*",ffi ffi Figure
5.i
-
a:1
-:-
#,"'
Comparing unconlrolled and conlrolled indexing languages
though they are presented here as distinct strata, form a continuum . Exhaustit of indexing is the name given to the depth of indexing which it is the policr
a given indexing system to employ. Exhaustivity is therefore a rnanagemr:. decision. The level of exhaustivity at which a sysrem operates can either be bu into the system (for example, by restricting the number of fields availabie index terms), or it can be controlled operationally, by giving instructions indexers.
summarization refers to the process of conveying the overall subiect cont-
128
-: --.
-
'- - - | ^ 4., l, ti :":-
.
INDEXING AND SEARCHING T.ANGUAGDS
rf a document in a single word or a short phrase or structured heading: for
orsani-
,;;:"Persons'
example, RABBITS, or BREEDS oF RABBI'irs - BRTEDS. Indexing at the level of summarization is commonly applied to graphic material - photographs and the like ',rhich convey information perceptually; and also - particularly - to books, which rormally have their own detailed indexing systems in the form of back-of-book ndexes and contents lists. Library catalogues and published bibliographies are nearly always indexes at the level of summarization. So, too, are some published indexes to periodical literature: for example, British Humanities Index and the range of indexes - Humanities Index, Edwcation Index, etc. - published by the H. \\-. Wilson Company. A second level of exhaustiviff is found in many databases that are indexes to
::'*''severarspecies
:oilections or to periodical literature, and select the most significant subjects in ,he text - often around six to twelve controlled descriptors. In addition, the -,vords in the title and abstract are available for searching. Contents lists operate at this level. Even more exhaustive are back-of-book type indexes: indexes to individual Cocuments, which should list every subject discussed in the text (Figure 5.2). The ultimate level of exhaustivity is provided by the text itself. In fuIl-text :etrieval systems any word or phrase is potentially available for searching. (Most s1'stems have a stopJist of very common words that have not been indexed and :annot therefore be retrieved.) At this point we have a concordance rather than an index.
gen lefms
Figure 5.3 gives examples of indexing the same journal article at different -evels of exhaustivity. Compare the Abstract, Descriptors, and Identifiers in each. (NB: ERIC and Psyclnfo deflne 'Identiflers' differently.) ERIC's abstract has 44 words; Psyclnfo's ras 96. For Descriptors and Identiflers the word counts are 20 and 46. Specificity
lng languages
Speciflcity is an aspect of controlled language systems. It refers to the vocabulary
!,ntinuum. Exhaustiuitl hich it is the policy oi
--ontent when indexing. Dewey Decimal Classification, for example, specifies :abbits as domestic animals at class 636.9322. This class is, however, unable to
refore a management rtes can either be built
speci{y individual breeds of rabbit: there are no subclasses for lop-eared or -\ngora rabbits or any other breed. Neither can this class distinguish between :abbits kept as pets and rabbits grown for meat or for their fur. A specialist lanual on keeping pet Angora rabbits has to be classed with all other works on :abbits as domestic animals. This clearly makes searching less precise, as the .earcher has to sift through a number of marginally relevant items all classed
-,f
of fields available for gil'ing instructions to rr-erall subject content
the system, and denotes the extent to which we are able to specify subject
129
AC CESS
1. Summarizalion
:rrmnlP' :,j tne nv
buOject neaotng:
lndexing (supplied by the Library af Congress)
Title:
Indexing books
ERIC:
Series title:
Chicago guides to writing. editing and pubiishing
The Role ,'ccken b
:,521
2. Most signilicanl subiects
1;1
B.
9.
'10.
Pa
isr
iSSN: 02
Arailable 'i-an
guag t
Socumen
I
Jsurnal
Ai:slract: :.- ?':)a'
Format and layout ol the index Editing the index TooJs
etrill-
-::ine
L Introduction to book indexing 2. The author and the index 3. Getting started 4. Structure of entries 5. Arrangement of entries 6. Special concerns in indexing 7. Names, names, names
Chapter headings:
883
:'-
for indexing
;,
:al
tCt'
" '':'"
iescriplc
3. getailed subiect specitication Index
{parl):
:entifief
Abbreviations
zsyclnfo;
lidT's#ffi,;#,,rff
l;
double-posting, 130 explaining, 12. 70 spelling out, l2B-29
tor staies
z .,,-ze
Ti?'l,i'lllJ-1i,lJl,
5 6' 7
221
acr0nyms alphabetizing. l30
4. The lull tsxt Alohabetizrng of Abbreviations and Acronyms
Abbreviations and acronyms should be alphabetized in the same way as the other entries in the jndex. whether letter-by-letter 0r word-by-word. They are not usually alphabetized as if they were spelled out. many publishers allow is that the abbreviation U.S. may be atphibetized as though :fupti:Tlut spelled out This allows a term like lJ.S. Bureau ot Reclamation to interfile with other entries such is
tl
LJnited States Coast Guard.
Figure
130
::ihor;
,l
l_,lhor
Aj
::::-l_ SSlt:
5.2
Levels 0l exhauslivity within a single work Source: Mulvanev, 1994.
..
l::umer
;;36ial I :::gt'C T - :ilcLti
:.-
! l:-1'?
Text:
:j:
S::cia1
7
?'
:
fole
.:,ifnall
ln us-. lzs-zo
access points c0nverting subentries to main headings, 219 main heading as primary.17, 211
,,iliiXl
-
--:
C
ll
INDEXING AND SEARCHING T.ANGUAGES
Figure
5.3
Indexing at diflerenl levels of exhaustivity
131
ACCESS
at the same place. Specificity thus improves the precision of a search: that is its ability to sift out unwanted material. Special systems (i.e. systems confined to one subject area or other field) often use differential levels of specificity. Topics that are central to the subject flelc are indexed at a higher level of specificity than peripheral subjects. For example. if Domestic Animals is a system's principal subject field, it would be quite likelrto make specific provision for the various breeds of rabbit. If the subject fleld is something remote, however, there might be no speciflc provision even for rabbits: we might have to include them under a more general term, like Pets. Specificity and exhaustiviff are related to the extent that in practice greater exhaustivity needs to be matched by greater specificity in the indexing terms. Most book indexes, for example, are both specific and exhaustive. The combination of specificity and exhaustivity is often referred to as depth of indexing.
An in and t have
The coro :ncreasec
:actors ir
o Acces. o Ease USCTS
Chap capat comp
error Complex topics
A final set of deflnitions concerns the way
complex topics are handled. A document may not be simply about rabbits or apples or chess; it may very well deal with some more precise aspect, like breeds of rabbits, or the effect of prestorage heat treatment on the shelf life of apples. The traditional method oi dealing with complex topics has been to encapsulate as much of the topic as possible into a single heading, MBBITS - BREEDS for example, or AppLES - sHEr.r LLFE - EFFECT oF- HEATTREATMENT Indexes of this kind are known as pre-coordinate indexes, because the topics that comprise a heading are strung together or coordinated by the indexer in advance of any searches that may be carried out on any of the topics represented within the heading. These indexes require elaborate rules for the consistent construction of headings, and will be considered in Chapter 6. Because of this lack of flexibility many systems employ a quite different method of handling complex topics. Here the subject of a document is represented by a number of one-concept terms - these are the descriptors in Figure 5.3 - and the searcher is able to combine as many or as few of them as are required, using Boolean logic: for example, cHILD ATTITUDES AND uerurel coNFLIcr, Systems employing this method of indexing and searching are known as post-coordinate systems. The earliest of these systems were card-based, using specialized stationery and other equipment, but near11' all systems in use today are computerized. User-friendliness
In
1960 Calvin Mooers exnounded his famous law:
132
r
Sys/ei
and
,
autol
o
Fornt surrc CONV(
u'ith
t
Dela;
SEARCI S
EARCH
-:,llt-h ----'.-4.
INDEXING AND SEARCHING TANGLIAGES
An information system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to
I
have it.
lril
--.e
1,
!tl'
r Ll
,lil
I
corollary of this is that the use of information systems and services will be -'reased if steps are taken to improve their user-friendliness. Some of the :--!,orS influencing user-friendliness are:
il
ii .r'lliu
t t
"'
rilr,
r
'i ,
illul
ill
e lli,.
o
t
lri'
Accessibility: the service should be physically accessible to users. Ease of use: the service should be within the intellectual capabilities of its users. shile great strides have been made in improving user interfaces (see
Chapter 4), there is always a trade-off between ease of use and system capabilities. The more powerful the functionality of a system, the more complex the instructions and protocols for using it, and its proneness to error on the part of the user. system erron as opposed to user error. This includes system malfunctions, and output errors caused by inadequacies in the system, as with some automatic indexing and retrieval systems. Form of output: output may be in the form of actual documents, or document surrogates; if the latter, output may or may not be downloadable. The least convenient output is non-downloadable document surrogates, for example, rvith manually searched catalogues. Delay: whether in accessing the service, or in obtaining the search results.
SEARCH FACILITIES
IN POST.COORDINATE SEARCHING
!:ARCH LOGIC
-: :rch logic is the means of speci{ying combinations of terms that must be l
llitl
.imm
.,ched for successful retrieval. Boolean search logic is employed in searching st systems. It may be used to link terms from either controlled- or natural' .:ring languages, or both. The logic is used to link the terms that describe - r concepts present in the statement of the search. As many as 20 to 30 or :'e search terms may be linked together by search logic in order to frame ; S€?rch statement. Search logic permits the inclusion in the search statement .-i synonyms and related terms, and also specifles acceptable and unacceptable .:ch-term combinations. Search strategies often need to be more complex -.- natural language terms, in order to accommodate all the potential spelling ,. .-rtions and near-synonyms. In an online search the search statements are -;ecl one at a time, and feedback is available at each stage. The searcher . --,fies a search statement and the computer responds with the number of 133
ACCESS
relevant records. with this tlpe of search facility, the search strategy can i refined to yield a satisfactory output. The Boolean logic operators are AND, OR and NOT
AND reduces the number of items retrieved: cHTLDREN AND penrNrs retrieves items in which both terms occur. OR increases the number of items retrieved: cHTLDREN OR penBNrs retrieves items in which either term occurs. NOT subtracts the second term from the first. cHTLDREN Nor pannNrs retrieves items in which only the first term occur: The operators are subject to some variation. A few systems use AND Nor o: ANDNOT Also, operators may often be abbreviated, so that on Dialog * can bc used to represent AND and + for OR. It is common to use more than one operator in a search statement, as in fc,: instance: cHTLDREN AND pARENTS AND coNnucr oR DISCoRD. once more than on. operator has been introduced, the priority of execution needs to be considered. h the example above it is necessary to specify whether the search should b. conducted as (cHir-oneN AND peruNrs AND colplrcr) oR DrscoRD or as cHTLDRE), AND penBNrs AND (coNn-rcr oRlrscono). This latter is the expected order o: execution, and must be specifled by the use of parentheses. The use of parentheses in formulating a search statement is often known as nesting. Each software package (or search service) has its own priority rules (for example, AND may always be processed before oR), and successful searching depends on heeding these rules, and making appropriate use of parentheses. Nesting forces priority, and offers a clear specification from the searcher's perspective.
RELEVANCE (CoNFIDENCE) RANKTNG AND BEST MATCH SEARCH
Loct0
A weakness of Boolean searching is that it returns straight hit or miss responses, and items that partially fulfilthe search specifications are excluded. For example.
a search on CHTLDREN AND peraNrs AND (coNnucr oR nrscom) would not return items containing the terms pARENrs AND coNFlrcr but not cHTLDREN. Many search systems now relevance rank results, listing items matching any oi the search terms, with the best matches first. This can be done in a varieh of ways, e.g.:
AND cHTLDREN AND cHTLDREN AND cHTLDREN AND cHTLDREN
134
peruNrs AND coNpucr AND orscono
penpNrs
coxrncr orsconn
aff: 'ti-__
aHll : -
filLi
-..{REl
.,.SC
(
INDEXING AND SI]ARCHING TANGUAGES
F
.::ategy can be
AND coxrr-rcr AND usconn cHTLDREN AND percNrs PARENTs
PARENTS
CHILDREN
!r-T
-
PARENTS
--cur.
CONFLICT
An 1,. htl"
DISCORD
. -,1-s.
.irl occurs.
A variation common in Web search engines is to use implicit OR, then relevance rank the results so that AND combinations are ranked before OR combinations,
\OT,:
and adjacency before either. (This is one reason for the huge search sets generated by many simple Web searches.) The user could for example simply
-, <
!ali
i_a
CNTCT CHILDREN PAREN'IS CONFLICT DISCORD.
In most search statements it is possible to designate certain concepts as being more significant than their neighbours. In its role in formulating search profiles, weighted-term logic may be introduced either as a search logic in its own right, or as a means of reducing or ranking (relevancy ranking) the search output from a search whose basic logic is Boolean. In an application where weighted-term logic is the primary search logic, each search term in a search profile is allocated a weight. These weights can be allocated by the searcher, but more commonly are allocated automatically. Automatic allocation of weights is usually based on the inverse frequency algorithm which weights terms in accordance with the inverse frequency of their occurrence in the database. Thus common words are not seen to be particularly valuable in uniquely identi{ying documents. A further refinement considers both
- i.e., words in important positions (titles, headers, early in the document) are given a higher ranking than words appearing elsewhere. If the weights are assigned by the searcher, they are associated with a relevance rating on a document which is found containing that term as a search term. Search proflles combine terms and their weights in a simple sum, and items rated as suitable for retrieval must have weights that exceed a specified threshold weight. A simple Selective Dissemination of Infor-
the frequency and the positioning of the terms
mation (SDI) type proflle showing the use of weighted-term logic is shown below:
Search description: The use of radioactive isotopes
in measuring the
pro-
ductivity of sorl. A simple search proflle (which does not explore all possible synonyms) might be:
8 Soil 7 Radioisotopes
4 Plants 3 Food 135
ACCI]SS
7 Isotopes 6 Radioactive 5 Radiation
2 Environment
5 Agricultural chemistry
1 Water
2 Agriculture 1 Productivity --rn
):{n
A threshold weight appropriate to the specificity of the searcher's enquirr' must be established. For instance, a threshold weight of 12 would retrier.r documents with the following combinations of terms assigned, and thesr documents or records would be regarded as relevant: Soil and Plants Soil and Radioisotopes Soil and Agricultural chemistry Radioisotopes and Agricultural chemistry Soil and Food and Agriculture
Documents with the following terms assigned would be rejected on the grounds that their combined weights from each of the terms identified in the records did not exceed the pre-selected threshold: Productivity and
Water
,
\FI.
Food and Soil Radioactive and Agriculture
Alternatively no threshold weight may be used, and then users will simpl1' be presented with records in ranked order, and can make their own choice as to how far down the list they choose to scan. Weighted-term search logic may also be used to supplement Boolean logic. Here weighted-term logic is a means of limiting or ranking the output from a search that has been conducted with the use of a search proflle that was framed in terms of Boolean logic operators. In the search, and prior to display or printing, references or records are ranked according to the weighting that they achieve, and records with sufficiently high rankings will be deemed most relevant, and be selected for display or printing. In this application, relevancy ranking is most often achieved through an analysis of the number of occurrences of search terms or hits in the document. The inverted indexes that need to be created to support Boolean searching, and relevance ranking, respectively, are different. An inverted index may be stored in the form of a large matrix, with each row corresponding to an individual term, and each column to an individual record. A Boolean search simply requires that each of these cells in the matrix have a value of 1 or 0. A mechanism that uses some type of term-weighting scheme will require the cell of the matrix to have a value z, where z is the result of a more complicated function of a number
136
\I
INDEXIN(; AND SI'ARCHING T"{NGUAGES
of variables. These values may be calculated on the basis of term occurrences. Each record may be considered as a vector or sequence of values.
# u-
tr
[:
t: h
i:archer's enquir], -l would retrieve i:-{ned, and these
:._ccted on the identified in
'-..
:
SEARCH FACILITIES
Standard retrieval facilities are available in most information retrieval applications. These facilities have been developed to cater for a text-based environment, where the user does not know what documents are available and/ or does not know the terms by which records can be retrieved. In other database applications, where records can be retrieved through pre-assigned codes, many of the facilities listed below are not necessary. These facilities cater for the uncertainty in document-based systems, such as those of the external online search services, document management systems, CD-ROM applications and online public access catalogues. In command-based systems these faciltities are accessed through the use of an appropriate command; in GUIs the options are likely to be embedded in pull-down menus, or buttons and check boxes on dialog boxes. Set-up lacilities
,r'ill simpll'',r-n choice . ,gic. Here
- :. a search .. -ianted in : printing. -:'.' achieye. "- :',-ant. and
"--i
lS mOSt
- - scarch ---,'hin--, . --.-ci
.'..
L iJt
. .-.ltlua
---...;-
'.-..-.... '.-- ---:,
These facilities set up the environment in which th6 search will proceed and are therefore environment-dependent. Help and news are common, as well as connection facilities sometimes in the form of logon and logoff facilities. Webbased interfaces often also offer access to information about the search service, its databases and customer service arrangements. The selection of database is usually a further preliminary.
Selecting search terms
Identification of search terms can be assisted by the display of search-term or index listings. The display may show index or search terms and, sometimes, their number of postings. Entering search terms
Once a search term has been selected it must be entered. This may merely involve clicking on the term in a search-term listing, ffping the term in, or using the term as a component in a more complex search statement entering the term itself, or a specific command might need to be issued. The system responds by creating a set of records indexed by that term and display the search term and the number of records in the set. 137
ACCESS
Gombining search lerms Search terms may be combined into search statements with the aid of a search logic as discussed previously in this chapter. Boolean search logic or relevance
ranking is common. Entering phrases
Many search engines and OPACs allow the user to enter a search phrase, such as PURCHASING BooKS oN THE INTFTRNL"I As discussed above the system will often treat this as an implicit OR search, although some search engines may proces,s phrases as if each of the terms were linked together with AND. Thus the above phrase would be searched as: eURCHASTNc AND BooKS AND nremsr
l.r v
m
-:
Specifying sections ol documents 0r fields in records to be searched
The ability to search for the occurrence of terms in a specific section of
a
document or in specific fields in a record facilitates more precise searching. For example, through the specification of whether a search might be conducted on a subject field or author field it may be possible to differentiate between documents on an a person (say sriexrslEAR-E) as subject and as author. In order to be able to specily appropriate field labels, it is necessary to know the fields in a given database and which fields are indexed for successful fleld-based searching. Often it may be possible to search on a combination of fields or sections.
Truncation and search-term strings
Truncation supports searching on word stems. Ily using the truncation character at either end of a word, the system can be instructed to search lbr a string of characters, regardless of whether that string is a complete word. For example, if the user asks for a search on COUNTR" this would retrieve records including words such as Country, Countries, Countryside and Countrywide. The use of truncation eliminates the need to specify each word variant, and thus simplifies search strategies. This is particularly useful in natural language information retrieval systems where word variations are uncontrolled. The most basic truncation is right-hand truncation where characters to the right of the character string are ignored. Left-hand truncation can be useful in circumstances where a variety of prefixes might occur. This is particularly useful in searching chemical databases. For example, -cHLORIT)E might retrieve records of 'chloride' with various pre{ixes. Truncation, or masking as it is called in this 138
o
.-.'
a-' o,' -
o'
INDEXING AND SEARCHING T.{NGUAGES
: r -.id of a search . .. -- or relevance
pi-
. phrase, such
t: .::tn will often p: :: 111?| prOCeSS E
*,:rs the above
$-
:.::_T.
llE:
m- . section of a n" .:,irching. For t :' - ,nducted on h- rt\-een docuf;-- : In order to k :he fields in l--- -.
E,
context. is sometimes also available in the middle of words. Here truncation can be useful to cater for alternative spellings. So, for example, NA.IoNAL will search for records with National and Nacional. In order to control the array of word variants that might be retrieved as the result of a truncation, in some systems it is possible to speci{y the number of characters that are to appear after the truncated string. For example, EMpLoy??? might select terms with a maximum of three additional characters.
field-based
,f fields or
Proximity, adiacency and context searching
Often a subject is best described by a phrase of two, three or more words. Subjects such as Information Retrieval and Competitive Advantage need two words to describe them. It is useful if a search can be performed for such phrases. One obvious option is to search for the hvo words ANDed together, for example, INFoRMATIoN AND RETRIEVAL. This should retrieve records containing the phrase but will also retrieve other records where these two words appear, but where they do not appear next to each other. This method, then, only allows crude phrase searching. Another option is to store such terms as phrases, possibly by inserting hyphens to mark phrases. Then, for example, INFoRMATIoN-RETRIEVAL would be stored as one term in the inverted file. This method is-satisfactory but is primarily applicable to controlled indexing; phrases must be marked at input, and searchers must enter the term in exactly the form in which it was originally entered.
A more flexible option is the use of proximity operators. There are various different kinds of proximity operators. These can require that:
a n . :t character r* a string of r- " r example, !!ft- .s including F - lle use of t1 - sirrplifies ;,-- :Jormation 6.":- rrS to the !, . :usefulin E .,.rly useful !l - *r'.'t' reCOrdS lil ,..cd in this
two words appear next to each other; for example
o o o
(N)
'INFORMATION RETRIEVAL',
on the search system two words appear within the same field, sentence or paragraph. The first of these is available on most search systems, and is obligatory on some two words be within a specified distance of one another, with the maximum number of words to come between the two words indicated by the user; e.g. Dialog has IxnoRnmloN (W.3) RETzuEVAt two words be within a specified distance of one another, with the maximum number of words to come between the two words set by the system. The operator NEAR is found on many Web search engines. INFORMATTON
ADJ nmruBval,
INFORMATION
nelruEVAL, depending
139
ACCESS
Range searching and limiting
Range searching is particularly useful when selecting records on the basis ,- numeric or data flelds. They might, for instance, be used to select record. according to a price field or publication data field. Fairly common range operator: dl E.
EQ NE GT NG
LT NL W OL
equal to not equal to greater than not greater than
less than
not less than within the limits outside the limits.
Although range searching is not appropriate in these contexts, examination c,, the contents of specific field may allow searches to be limited by document type. language or source. Displaying search or results sets Shows the user how many documents, search terms and referen€s were found.
and thereby indicates whether
it
might be appropriate to further refine thc
search.
Displaying records
Once a successful search has been performed, it is necessary to display the records. OPACs first display one-line records and then allow the user to displar the full record. Online search services offer a variety of commands for displaying records on the screen, offline printing and downloading. Default formats are the norm, but user-defined formats are becoming more common. These allow users to speci{y the range of fields to be displayed in the records, and other features of the display. In addition to speci{ying the record format, users need to be able to specify which records are to be displayed. OPACs tend to let users select records and display them one at a time. CD-ROM and online search services have commands
or options that allow the set of records for display to
be
indicated.
Storing search sets
Many retrieval systems store sets of search specifications by assigning them a running number. The sets can then be reused within the same search session. This permits the user to construct the search in stages, combining search sets by number:
140
INDEXING AND SEARCHING T.ANGUAGES
Set .
n the basis of select records
:ange operators
.:.
. examination of ' ,locument type, i
k- -::.,-ES were fOUnC, ! r ,.rrher refine the
Set
1: 2:
coNFLIsr DrscoRD
Set3: 1OR2
4: 5: Set 6: Set 7: Set Set
cHTLDREN pARENTS
AND 3 AND 4
5
6
This may look long-winded, but it reduces typographical errors and gives the searcher the opportunity to reuse any of the terms in different combinations in ,-rrder to refine the search. A search statement in the form cHrLlnBN AND :.\RENTS AND (coNnr-rcr oR DrscoRD) will generate only one search set, and the terms cannot be reused in other combinations. with systems that do not store search sets, a search has to be entered as a single statement, and once the user moves on to the next search the old search is lost. There may be a facility for refining a search (by performing a search on the set of search results instead of on the whole database), or for storing searches for reuse.
Search management
F:*rj]- to display
the
k
' .-e user to display m--,rds for displaying [n
.'-,t
lbrmats are the
-- lhese allow users ":. .ld other features m*.
,
-.r:s
:,
.
lr-:.t
r-:
need to be able
r let users select search services
ror display to
Search management includes opportunities to review the search strategy that has been adopted and, permanently or temporarily, to save a search profire for subsequent use. Search proflles may be saved temporarily of permanently. Temporary saves are useful for searches where a searcher might wish to reflect on a search, or otherwise come back and complete the search at a later point in time. Permanent saving of the search profile is usually associated with current awareness of selective dissemination of information. The search profile will be run on behalf of the user at regular intervals in order to identify new material, and this will be sent to the user as current awareness notifications. Intelligent agents and other push technologies that refine profiles over a period of time are a recent innovation in this area.
be
Advanced display oplions Records in full-text databases are long, and a full record usually occupies several
-.-assigning them a .:ne search session. ::rbining search sets
screens. In such circumstances, special display facilities can support browsing through relevant portions of the text. The ability to stop as soon as the screen is full is useful, as are facilities for moving backrrrards and forwards through the document. If the text is divided into numbered paragraphs, it is possible to select paragraphs for display. Another approach is to use a KWIC facility, which shows 141
ACCESS
relevant index terms with bits of adjacent text in small windows. Another optic. that might prove useful is the ability to sort a set of records into order befordisplaying. Numeric or financial data may be best displayed in reverse c: descending order. Some financial databases offer statistical presentation an analysis.
Multi-file searching
Where,
as with the online search services, there are a number of databases tha might generate relevant records in response to one search, multi-flle searcL facilities are beneficial. The most user-friendly multi-file search option is wher. other databases can be searched without reformulating the strategy. This requires the system to make the appropriate adjustments in search terms anc fields to be searched. The most refined multi-file searching then goes on t, produce an integrated set of records drawn from several databases, and witL duplicate records eliminated. Many online search services have a database o: databases, such as Dialog's Dialindex (File 411). Available databases are groupec by subject. The searcher specifies a subject group to be searched and enters a search speciflcation. The system returns the number of hits for each database.
Displaying the thesaurus
Shere a controlled-indexing language has been used to provide index terms, a thesaurus will often be available in both printed and online formats. This thesaurus displays the controlled vocabulary used and relationships between terms, and is therefore a useful tool in narrowing or broadening searches. It is useful if the thesaurus can be displayed in a window to assist users as thelattempt to develop a search strategy. Free-language thesauri that show relation-
ships between terms may be available on some systems, but these take considerable effort to set up. GUIs offer fascinating opportunities for the displaiof graphical thesauri, showing multiple tree structures and explode options. Hypermedia
Many systems, including the WlAlM, offer hypermedia searching. True hypertext searching relies upon an indexer establishing conceptual links between documents. Creators of Web pages often do this when they indicate which terms are to be used as links to other pages. However, in a large database this is very labour-intensive. An alternative is to rely upon the content, including the text and other objects in the record and use the occurrence of objects or terms as the basis for hypermedia links. Thus if the same term or object appears in two
142
N:
INDEXING ANI) SEARCHING IANGUAGES
f
>.
!--
.:to order before
Fi-
[.
Another option
I in reverse or
::--ordS or documents, the user may move from one record to another by, say, ,,.cking on the term or object and without explicitly returning to the index.
-:esentation and PRINCIPLES OF LANGUAGE CONTROL
h"l::l{-
J:
-latabases -
tu: tu ro
l:. lnu:
F l" I
that
search
-lti-flle :tion is when .,rategy. This - f terms and
-
qoes on to .-:>. and with
.
iatabase of ' ::e grouped .-.d enters a
-
.. database.
r.i its simplest, a controlled language is a list (known as a thesauru.s) of permitted :r'rriS afld terms which are semantically related. All controlled languages have alphabetical display sequence. There may also be a systemalic (classified)
.r
i.quence, either as the principal display or as an adjunct to the alphabetical :isplay.
The vocabulary of a controlled language comprises the available terms used - ,r indexing. Such terms describe the content of a document, and so are called
..tscriptors. These may be words, or they may be coded into the notation of a --lassification schedule, where the notation translates the concepts behind the ,'.'rrrds. In either case, any relationships between the terms are fixed and per-
:lanent. How does vocabulary control work in practice? Consider the term wagon. The a plural, u)agons, as well as an alternative spelling, waggon(9, so we lust opt for one or other spelling, and have some means of alerting those who
',,,-ord has
.rse other forms of the word. A wagon is a wheeled vehicle for transporting ireight; but it can denote a range of specific vehicles, according to whether the iransport is by rail or by road and, if the latter, whether it has an engine or is drawn by a horse or tractor. So for indexing purposes we may well wish to limit ,-rur deflnition to just one of these. If road vehicles, they are often known by their manufacturer's name and perhaps the name of the model. Finally, there are other words or phrases whose meaning is synonymous or nearly so (cart, :ruck, lorry), or which belong to the same category but have a wider or narrower nreaning (road uehicle, pick-up truck), or which, while not strictly belonging to the same category, are an essential part of the deflnition of a wagon (freight, goods). With this in mind, let us look at the formal rules of vocabulary control. The basic source is the International and (identical) British Standard Guide to Establiskment and Deuelopment of Monolingual Thesauri (ISO 2788:1986, BS c723:1987
-
ISO, 1986).
METHODS OF VOCABULARY CONTROL
The methods of vocabulary control are:
o
The form of a term (e.g. its grammatical form and spelling) is controlled.
143
ACCESS
o A choice is made between two or more svnonvms or
near synonyms to
express the same concept. A decision is made on whether to admit proper names.
o . A term may be deliberately restricted in meaning to
the most effective
meaning for the purposes of the thesaurus.
A thesaurus uses a range of symbols to indicate semantic relations. The commonest list is:
SN USE UF BT NT RT
Scope Note
Use lanother term in preference to this onel Used For Broader Term
Narrower Term Related Term
CONSTRUCTION OF DESCRIPTORS
Terms used in indexing (descriptors) conform to one of the following types:
o
Concrete entities:
things and their physical parts: Reptiles, Feet, Microforms, Tropical regions
o
materials: Soluents, Leather. Iron. Abstract entities: actions or events: Frost, Walking, Sleep abstract entities and properties.. Hardness, News, Feminism, poaerty disciplines or sciences: Archaeology, Physics units of measurement: Kilometres.
If a candidate term does not conform to one of these types, it should not be used as it stands. In many cases, a term can be made to conform by being modifled in accordance for controlling word forms. These rures are:
o o
o
o o
Avoid verbs: :use Cookery, Opposition not Cook, Opbose Do not use attributes (adjectives or adverbs) on their own, but only to help define an entity: Yellow feuer, Fast food; but not yertow or Fast on their own. very occasionally an attribute may be found on its own as a descriptor, if a noun is implied, e.g. Baroque fstyle]. Avoid adjectives or adverbs of degree, unless they have a technical meaning: Small /irms, Very high frequency. use nouns and noun phrases, including adjectival and prepositional phrases as appropriate'. Wornen workers, Prisoners of war
use the plural number for'count'nouns (how many?): Buildings, paintings;
144
IN DEXING
AND SI'ARCHING IAN(]LIAC;I.]S
.lso for substances or materials treated as a class with more than one :nember: Plastics, Poisons , se singular for non-count nouns (how much?): Snow, Painting (notice that :: is possible to use both singular and plural if their meanings are distinct), P/r-lsics (which is not a plural!); also for parts of the body which occur singly: .\louth, Respiratory system,but Lips, Lungs --se the most widely accepted spelling: Romania, not Roumania. However, '',videly accepted' begs the question: by whom? Some readers will already rave noticed the British (rather than American) English spellings of Archae,logy and Kilometres above. L'se slang or jargon only if well established and there is no acceptable alternative: Hippies have been with us for long enough to become established except that they now seem to be transforming themselves into New Age rravellers), but the Yuppies of the 1980s fell victim to economic depression. The application of this rule requires a fine judgement. The English language is full of neologisms, which journalists are quick to seize on. Indexers l.iowever are a cautious breed, loath to admit a new word that may drop out ,ri fashion after a year or two. Art index d\d not accept Art nouueau as a heading until the 1950s, and the Library of Congress indexed computers as Electronic calculating machines until 1973. These are extreme cases, but serve to illustrate the fact that controlled vocabularies are poor at keeping up with changes in terminology. L'se abbreviations and acronltns only if they are unambiguous and in common use within the subject field. Words like Radar have ceased to be regarded as acronyms, and it saves space and time to list bodies such as LTNICEF as acronyms;
but WHO or BP can only lead to ambiguity
and
nrisunderstanding. Again, there are grey areas: CD-ROMs? URk? Differentiate homopgraphs by a qualifier within parentheses: Cranes (lifiine equiprnent); Cranes (birds). Homographs have the same spelling but different nreanings. In a specialist thesaurus only one meaning might apply, and will be clear from the context. Lise a scope note (SN) to exclude possible alternative meanings; or where tfie meaning is not immediately apparent; or to instruct indexers how a term is to be used. Scope notes in subject headings lists often start with a stock phrase, such as 'Here are entered . . .', as in this example from LCSH:
Conditionality (International relations) Here are entered works on the requirement that nations meet certain conditions, such as restructuring their economies or respecting human rights, to be eligible to receive foreign aid or loans, or to have normal relations with other nations. 145
ACCESS
classiflcation schedules may use SN or some other formula. This example shows some of the kinds of instructional note that may appear after a DDC heading:
Grammar Descriptive study of morphology and syntax Including case, categorial, generative, relational grammar class here grammatical relations; parts of speech; comprehensive works on phonology and morphology, on phonorogy and syntax, or on all three Class derivational effmology in 412
o
Do not invert phrases: Storage batteries not Batteries, storage. (This particular example also carries the risk of ambiguity.)
The inversion rule is quite explicit. compound terms, that is, multi-word con_ cepts, have for many years been the bugbear of controlled language systems. The problem is whether to invert phrases and, if so, to what extent. In rnanually searched indexes, inversion brings a useful collocation, as the eye can run down a list from (say) Dogs to Dogs, gun, Dogs, sporting or Dogs, working.The problem is that this construction could be extended to Dogs, hot,which would benefit nobody. Inverted headings are inherently unpredictable, and a distraction to searching. Today's rules of thesaurus construction presuppose machinesearched indexes, which with keyword access are indifferent to word order and are not intended for sequential searching. compound terms then should not be inverted. other restrictions also apply, notably that they are to be avoided altogether if a noun phrase can be factorized down: Garage doors Coal mining
Animal behauiour
use instead use instead use instead
Garages aNo Doors
Coal tNo Mining
Animals
aNo Behauiour
Factorizing is used when each separate word retains its original meaning. Factorizing should not be used for terms where the original meaning has been lost (Deck chairs), where a different flzpe of entity is denotecl (silkfloweri, where one term is used metaphorically (Elbow joints) or where one or both terms are semantically empty on their own (Family froblems). l\4rere compound topics rike coal mining are admitted, they are in effect pre-coordinatecl terms, and are described as having a high pre-coordination leuel. Normal principles of thesaurus
construction require compounds to be reducecl to their constituent elements (e.g. Coal mining is retrieved by a Boorean search on coAr. AND r,IrNrNc). occasionally a single descriptor may pre-coordinate two or more concepts, as with I'ife satisfaction (a search on LrFE AND sRrrsrucrroN would be likely to
146
INDEXING AND SEARCHING TANGUAGES
Il -' ,s example Fl,' -:er
a DDC
I
h -ive works :. all three
Ff.
SEMANTIC RELATI()NSHIPS
;
p-
- :articular
I
f't'
.enerate all manner of false coordinations): or Student eaaluation of teacher :tr.formance (to make it clear who is evaluating whom). Child behaaiour is .lnother instance. It could be factorized into csnonlN AND BEHA\louR without 'ss of meaning; but the phrase is one that is widely used and understood, so .hat the convenience and precision of having a ready-made phrase could be ,-onsidered to outweigh the disadvantage of making the vocabulary larger than .s strictly necessary.
.-,'"'ord con-
F
-.:
h"
, . rnanually
S,'
,: :un down
lh
..1 benefit
!t
..:action to
h
SVStemS.
.
F -!". im
problem
machine-
rder and
-
\ext,
semantic relationships must be considered. Semantic relationships between terms are, as their name implies, built into the meanings of the terms. They are permanent, in that they do not change according to whatever document is being indexed or searched. Semantic relationships are stable, i.e., they remain
constant within an indexing language and do not change to accommodate the indexing requirements of particular documents. In theory, they ought to be transferable between indexing languages, but in practice other considerations
(e.g. disciplinary bias and the degree of speciflcity required) often militate against this. The rules govern the relationships in meaning between pairs of rvords (for example Seas/Oceans, Legs/Knees, Food/Diet) - or more precisely how the meaning of the second word relates to the flrst. There are three basic types of relationship: Equivalence, Hierarchical, and Associative.
,.-so apply,
!Eu- : :actorized
Equivalence relationships
Equivalence relationships are relationships where fwo ru I
t: Er' ry
[r:
' .
.<. Factor-
reen lost ':;), where
- _irms
fr-
{
.:
are
.rcl topics .. and are
F
:hesaurus
|r_
clements
r &:
or more terms
are
regarded as having the same meaning. One is the preferred term (descriptor); all others are non-preferred terms (non-descriptors). Non-preferred terms are indicated in a thesaurus by the instruction:
-
, rtttlt tc)
Non-preferred term USE Preferred term Asses USE Donkeys
Under the preferred term is placed the reciprocal of this instruction: Preferred term Utr Non-preferred term
Donkeys UF Asses
.
:rcepts, as . likely to
which serves as a check for both indexers and searchers. Older subject headings lists replace USE and UF with see and x, i.e.: 147
ACCESS
Asses see
Donkeys
Donkeys x Asses
The first of these is displayed both in the subject headings list and in indexes based on it. The reciprocal x serves only as an aid to the indexer, and appears in the subject headings list only, and not in the index. The following subcategories of equivalence relationship have been distinguished:
1.
Variant spellings, word forms, abbreuiations, e/r. These exempli{y the rules for word form control shown above. Rumania USE Romania
ROM USE Read Only Memory Mouse USE Mice
Romania UF Roumania Read Only Memory UF ROM
Mice UF Mouse
Singulars and plurals are distinguished if the plural is irregular and would file a considerable distance away from the singular. Selling USE Sale Sea food USE Seafood
Sale UF Selling Seafood UF Sea food
because of the filing implications. Filing sequences are often 'word by word' as opposed to 'letter by letter'
Coal mining USE Coal AND Mining This is an example of 'semantic factoring'.
2.
Synonyms. Synonyms are rarely if ever completely interchangeable. In these
examples, the co-reference is exact, but they differ in their usage. Asses USE Donkeys Noble gases USE Inert gases Wireless UStr Radio Elevators UStr Lifts German measles USE Rubella Tax planning USE Tax avoidance
3.
Donkeys UF Asses Inert gases UF Noble gases Radio UF Wireles Lifts UF Elevators Rubella UF German measles Tax avoidance UF Tax planning
Quasi-synonynzs. These are terms whose meanings are different but overlap in ordinary usage, but are treated as synonymous for indexing purposes.
Deceleration USE Acceleration Softness USE Hardness
Acceleration UF Deceleration Hardness UF Softness
The above two examples are antonyms: they represent different viewpoints of the same property continuum. The following fwo examples might or might not be regarded as equivalent, depending on subject field: 148
INDEXING AND SF]ARCHINC; TANGUAGES
Itrh"
dF
.
-
in indexes :rnd appears
.1
t:,
:.- the rules
certainly have Fostering RT Adoption (an associative relationship), and one on agriculture would have Barley BT Cereals (a hierarchical relationship). This last is an example of 'upward posting': treating a narrower term as if it rvere equivalent to, rather than a species of, its broader term.
:
! lF.-- -.
and would
i
fr,
llrl
r
b1'
Hierarchical relationships
,-
h
F
n a classiflcation schedule synonyms are often shown in parentheses, e.g. DDC's -:tb.334 Soccer (Association football). Quasi-synonyms may be shown by an .rclusion note, e.g. 796.33Inflated ball driven by foot. Example: pushball - where ?ushball is not given a specific place in the classification. (Note in passing that .r classifications like DDC that are primarily designed for shelf arrangement, :.eadings like 'Inflated ball driven by foot' are not intended as a verbal approach: .leir purpose is to define precisely and unambiguously the scope of a class :,rgether with any subclasses it may have.)
worcl'
i
r,
These rvould only be regarded as synonymous if they were on the fringe of
the subject field of the thesaurus, where the generic level is set rather higher than for central themes. A thesaurus on social welfare would almost
b* . RoM
ts*
. In rhese
.:
r ;
Here both terms are permitted terms (descriptors) and are linked in a broaderio-narrower hierarchy. This use of 'hierarchy' and 'hierarchicaf is precise and technical, and is not to be confused with common, looser usages that denote any kind of descending sequence. Indexers are able to select the most specific term available to index concepts within the document. Searchers can extend a search by transferring from a first access term to a broader (more general) or narrower (more specific) term. Hierarchical relationships are indicated by BT (broader terrn) and NT (narrower term), e.g. Soarrows BT Birds
ltrF
t* h &',
-:l .
111il.{
, )\-erlall ,
.-, ,)ses.
:n
hr'
h b I
Adoption UF Fostering Cereals UF Barley
ieen distin-
ll
|lr .
Fostering USE Adoption Barley USE Cereals
r''. llttlllts
t'lIight
Birds NT Sparrows
There are three subcategories of hierarchical relationship:
1.
Generic relationships are easiest to spot: they can be verified by the rule-of
thumb: some As are B: all Bs are A: Protest NT Rebellion Reptiles NT Snakes
Rebellion BT Protest Snakes BT Reptiles
But not: Pets NT Budgerigars
149
ACCESS
\4rhy does this not qualify? Because not all budgerigars are pets. NI, hierarchies are unique (a snake is a reptile and not any other kind of lir,i: creature), but the next example is a polyhierarchy (as is the Brain examp below). Rocks NT Coal Fossil fuels NT Coal
2.
Coal BT Rocks Coal BT Fossil fuels
Partitiue, or whole-part relationship. This applies only if the part is uniclto the whole: Science NT Chemistry
Head NT Brain Central nervous system NT Brain Canada NT Ontario
Chemistry BT Science Brain BT Head Brain BT Central nervous syster. Ontario BT Canada
But not: Buildings NT Doors
A door is a necessary part of any building, but cars, railway carriages r. also have doors.
3.
Instance (class-of-one). Proper names may be acceptable or not, accordi: to policy. (Sometimes they are regarded as identiJiers rather than descriptr,: and as such excluded from the thesaurus.)
Mountain regions NT
Alps
Alps BT Mountain regions
Hierarchical relations modulate, i.e. they move a step at a time through th. hierarchy, e.g.: Science NT Chemistry
Chemistry NT Organic chemistry
Organic chemistry BT Chemistr Chemistry BT Science
and not, e.g., Science NT Organic chemistry. If an intermediate step omitted, there is a very real likelihood that a search will skip over potentiai useful material. Older subject headings lists use see also and its reciprocal )o(, e.g.: Science see also Chemistry Chemistry >or Science
The first of these would appear both in the subject headings list and in index.based on them. The reciprocal )o( serves only as an aid to the indexer, an _ appears in the subject headings list only and not in the index. By convention. indexes display see also references only in a downwards (generic-to-speciflc 150
INDEXIN(; AND SEARCHIN(; LANGLAGES
trl u-
pets. Most .nd of living
!ts
.-in eXample
frF
h
:
15
Unlque
-irection. Users of these indexes are presumed to be aware that (e.g.) Chemistry .s part of Science. See also does not distinguish hierarchical and associative :rlationships (see below). In a classification schedule indentation is used to indicate hierarchical relation.nips, often with different typefaces. There may be a corresponding lengthening f the notation for more specific terms; but many bibliographic classification .chemes either do this inconsistently (e.g. DDC) or do not set out to do it at .11. as with the LCC and Bliss Bibliographic Classification (BC; 2nd edition, 3C2).
!* Associative relationship Dn-
s\-stem
0l
- ::rPi
Pfr-
lJris relationship is not easy to categorize. The International Standard offers two lseful clues: One of the terms should be strongly implied, according to the frarnes of reference shared by the users of an index, whenever the other is employed as an indexing term . . . It will frequently be founcl that one of the terms is a necessary component in any explanation or definition of the other.
(lSO, 1986. p.
17)
tr:
rrl i rr rt -, - l.rrrs
\s with hierarchical relationships, both terms are descriptors. R? (related term)
lfrr
-
rs the thesaural symbol, and its reciprocal is the
--:]lt0rs.
Buildings RT Doors tlrF::;
m
same.'fhe following are typical:
Doors RT Buildings
This is the more usual kind of partitive relationship, where the part is not unique to the whole, and is therefore regarded as an associative relationship rather than as hierarchical.
F*
Entomology RT Insects
Insects RT Entomology
Ii
Illumination RT lamps Programming RT Software Harvesting RT Crops Poisons RT Toxicity Insects RT Insecticides France RT French
lamps RT Illumination Sofrware RT Programming Crops RT Harvesting Toxicity RT Poisons Insecticides RT Insects French RT France
@i' lNu.
!t14
h*
b, n' lillf
-\djectives are not normally pern-ritted as descriptors. Here, 'F'rench' can be usecl as a noun, to denote the people of France; or as the first word of a phrase, e.g. French music.
Silk flowers RT Flowers
Flowers RT Silk flowers
A silk flower is not a floweri 151
ACCESS
Handicapped children RT Schools for handicapped children Nested phrases of this kind do not require a reciprocal RT (Notice in passing that the phrase would now be considered socially unacceptable. A more acceptable phrase today would be: Schools for children with special needs.) Subject headings lists have in the past used See also and m. These are the same symbols as are used to denote hierarchical relations, and are used in the same way. It is not uncommon for associative see also references to appear in one direction only. Classiflcation schedules are by their nature set out hierarchically. There may occasionally be found references to associated topics in other parts of the schedule. e.g. in DDC:
790
Recreational & performing arts Class the sociology of recreation in 306.48
There is often a temptation to enter RTs as BT/NT (broader term/narrower term). In some cases it is very dfficult to determine whether a relationship should be entered as BTINT or RT/RT A rule of thumb is to check whether both terms belong to the same basic type (abstract or concrete entities). If they do not, the relationship cannot be hierarchical, and must be associative (if it is to be admitted at all): e.g., Entomology is a discipline or science (an abstract entity), and cannot therefore belong to the same hierarchy as Insects (which are concrete entities).
Another temptation is to make RTs indiscriminately. The principle of RTs is that there must be an immediate and necessary relationship between the two terms. If the relationship is not direct or not necessary, then the terms either should not be related at all, or at best should be linked indirectly, through a third term. Consider the followine: Authors
RT
Books Publications Textbooks
Books and Textbooks are hierarchical therefore to make the one reference
to Publications. It would be enoush
Authors RT Publications and let users flnd their own way if they wish to pursue the references through Publications NT Books and Books NT'lextbooks. It is also tempting to link topics that are only indirectly connected, e.g., by sharing a common BT So there is little point in links such as Mice RT Hamsters, simply because they
152
INDEXING AND SEARCHING LANGUAGES
m
-:are the common BT Rodents. Subject headings lists have in the past been
ttt0,' -.'e
in passing inof€ accept-
h :
-:-\ lr J::-5./
n ii
liese
k:
- --cS to appear
are the
of some highly tendentious see alsos, such as Journalism see also Libel ,::d slander, with its implication that journalists go around libelling people for a -',-ing; or Psychical research see also Personality disorders, suggesting that r1.''those with personality disorders engage in psychical research.
;i1ty
are used in
r:... .. There may fui'- :arts of the
FACET ANALYSIS
-ae simplest way to present a thesaurus is to arrange all the descriptors with :reir relationships into a single alphabetical sequence. Many thesauri are of this However, much of the effectiveness of a thesaurus is lost if this alphabetical :isplay is not backed up by means of some kind of systematic display. Because nlphabetical order scatters subjects indiscriminately, it is not possible to obtain an overview of the way the subject matter of the thesaurus is structured. The "^rnd.
r --::: /narrower r . :elationship !' - :,'k I
whether ::s). If they
fli-.
m.
ilJ:
:,i\-e (if it is
:n
abstract
; * ,:,-ts (which
:rost widely used technique for creating systematic displays is facet analysis. Facet analysis involves:
1.
I l.
h:.ofRTsis hF: ..:t the hvo frr. '::ns either nu-- hrough a
Urli :
enough
A set of terms representing simple concepts: that is, the descriptors created by applying the rules of thesaurus construction. The grouping of the terms into a number of mutually exclusive categories, called facets, using just one characteristic of division at a time. Organizing the facets into a limited number of fundamental categories. These are generalized categories which can be adapted to any subject field, and deflne the role of a term within the overall scheme of the thesaurus. Some examples of fundamental categories are given in Figure 5.4. In a working thesaurus these categories will inevitably be heavily adapted to the subject-matter in hand. The Art and Architecture Thesaurws for example calls Concrete entities its Objects facet, and Time, its Styles and Periods facet. The Multilingual Egyptological Thesaurus has an even more specialized list of facets: Acquisition; Present l,ocation; Category; Provenance; Dating; Material; Technique; State of Preservation; Description; language; Writing; Category of Text; Text Content; Divine Names; Royal Names.
l.
In many cases, a notation will be required to fix the filing value of each term
in a svstematic seouence. M:
.
::
through
D;: ; to link m." . BT So I t,- .:se they .
153
ACCI'SS
Enlilies Abstract: Archaeology, Kilometres. News Concrete:
Naturally occurring:
uvtng.
I Titan iu m
I
flrls
Man-made: Paintings Complex: Buildings Properties (Attributes): Speed. Elasticity Materials, Constituents: Ad hesives Parls. Limbs. Doors
lI
Artions Processes {internal, intransitive): 6laciation
Operations (external, transitive): Marketing, Cookery Place (location, environment): london
lime: Nineteenth century, Sunner
Figure
5,4
Examples of lundanental categ,iles
: -,!
DISPLAYING THE THESAURUS The principal symbols denoting functions and reiations have been described 'Principles of language control' above. 'fo recap, they are:
o o o o
ir,
SN: Scope note, defining or restricting the meaning of a worcl within th, indexing language USE: The term preceding this symbol is a non-preferred term. The prelerrct. term follows this symbol. UF: The reciprocal of USE. The term that follows this symbol is a nonpreferred term. BT: The tern-r that tbllows is a broader ternt: another prefc.rred term, bLr, having a more general meaning.
r o
NT: The term that follows is a narrower term: another preferrecl term, btrt having a more specific meaning. RT: The term that follows is a related term: another preferred term, having a meaning that is associated with the term precerding the symbol, but not one of the types described above.
By convention, the symbols are listed aller each term in the above order. The International Standard permits other symbols in addition to BTINT if finer distinctions are considered necessary for clenoting hierarchical relations:
154
INDtrXING AND SEARCHING T.ANGUAGES
o
TT: the term that follows is the top term in the hierarchy to which the term preceding this symbol belongs.
o o r o
BTG: broader term (generic) NTG: narrower term Generic) BTP: broader term (partitive)
NTP: narrower term (partitive). See 'Principles of language control' above for explanations of generic and partitive.
-t also mentions the possibility of completely language-free symbols, e.g. < for 3T and > for NT but these refinements and alternatives are little used. (The il'SPEC Thesaurus uses T! and the British Standards Institution Root Thesaurus uses < etc.) Individual thesauri may use symbols other than these, either instead
rf the recommended
symbols (e.g. the Atnerican Psychological Association
Thesaurus abbreviates to U, B, N and R) or to denote something else, for example a classiflcation code, or the date a term was introduced.
Figure 5.5 shows a typical thesaurus record (from ERIC).
ALPHABETICAL DISPLAY
Because alphabetical order is self-evident, every thesaurus has an alphabetical
-:scribed in
-,
,i-ithin the
-'-.: preferred
'.isanon-
display of terms and their relations. Usually this is the principal display. Not infiequently it is the only display. Often, though, the alphabetical display is supplemented by other displays.
:erm, but
, :erm,
ROTATED DISPLAY
\\4rere phrases comprise a high proportion of preferred terms, a second alphabetical display may be found, usually a machine-generated rotated display of terms, to ensure access to embedded words in phrases. This form of display (see Figure 5.6) is confined to printed thesauri, as kelword access is normally available in machine-held thesauri.
but SYSTEMATIC DISPLAY
: :t. having ,. but not :der. The
"
f if finer
Alphabetical subject order is unfortunately quite arbitrary: 'catalogue' is followed in one encyclopaedia by Catalonia, Catalpa (a tree), and Catalysis. To counteract this, by bringing related topics together, various forms of systematic display are frequently found. They may or may not involve facet analysis. Representative examples include:
o
Subject growp displays (see
Figure 5.7). These are looser associations than 155
ACCESS
ANTtS0tlAL
BE,IAV|0R Mar. 19BA RIE:700 GC:530
CIJE:892
Behavior that violates the n0rrnative rules, standards, understandings, 0r expectations 0f society UF
Anti Social Behavior (1966 1S80) Socially Deviant Behavior (1966 1980)
NT
Agg ressi on C
heati ng
Chrld Abuse
Child Neqlect
^.'-.'"'--.
uilme
Driving while lntoxicated Elder Abuse
Emotional Abuse HOmrct0e
lncest
_I
Sexual Abuse Sexual Harassment Steali ng I
errorism
Vandalism Verbal Abuse
tf,
Violence BT RT
Social Behavior Alcohol Abuse Alcoholism Behavror Disorders Behavior Problems Conf lict Drug Abuse Drug Addiction lllegal Drug Use Lying
0bscenity Prosocial Behavror Recidivism
Figure
o
5.5
Thesaurus record showing relationships
hierarchical term displays, and for that reason arc more common in thc' sociai sciencc's. ERICs Group display is a typical example. Hierarchical term displays (see Figure 5.8). These may start at the top term of a hierarchy (hence the use of 'fT in the alphabetical display), successive narrower tenns being shown bv indentation. Meclical Subject Heaclings (MeSH) has 'tree structures': hierarchical displays with notations, so that the a$habetical display shows only the notational codes for the tree structures, ancl the user has to consult the tree structures to expand a search. (Most search systems for Medline have an EXPI-ODE command which ORs
tcb
L
I\
I)
I.-XI\G A,NiD
SEARC.
HIN(; LA\GT'-{( ;I:S
ANTI SEGREGATI0N PR0GRAMS (1967 1SB0) use RACIAL INTEGRATION
ANTI SEMITISM
: ' :sotations 0f
ANTI SOCIAL BEHAVI0R (1S66 1980) Use ANTISOCIAL BEHAVIOR
ANTISOCIAL BEHAVIOR ANTITHTSIS ANXIETY COMPUTER
ANXIETY
MATHEMATICS ANXIETY SEPARATION
ANXfETY
TEST ANXIITY
Figure
5.6
Thesaurus display: rotated display
AIITISOCIAL BEHAVIOB --]FATING
,-iILD
ABUSE
:iILD
NEGLECT
-
3IVMUNITY PROBLEMS
:ONFLICT
:qIME :IIMINALS l:LIN0UENCY l:LIN0UENCY
CAUSFS
-]iSCIPLINE PROBLEMS ,IRIVING WHILE INTOXICATED
]TUG ABUSE ]3UG ADDICTION :,DER ABUSE :i'gl0T}ONAL ABUSF :A[/1ILY PROBLEMS
m
.,nron
in tli
um;
F
lll[ &Lilut,,
m:'
b'
] *
rrlllll
Figure
5.7
Thesaurus display: sublect group display
.lre top tenr'
.
;t
SUCCCSSi\..
Heading. :,rllS, SO thai .-' tree struc:rd a search 1 s'hich OR.
NTs automatically to the search specification.) A few thesauri include a complete hierarchical display under each descriptor in the alphabetic:al sequence. 'I-he Thesaurus of Scientific, Technical and Engineering I'erms is of' this type.'Ihe ERIC Thesaunrs has a fwo-way hierarchical dispiay in a separate sequencle.
Systentatic display (see Figure 5.9). The hierarchical term clisplay describcd
lc/
ACCESS
: : BEHAVIOR : SOCIAL BEHAVIOR ANTI$OCIAL BEHAVIOR
. r .
AGGRESSION rr ^r vr rq ^+'a'a t rrlv a
CHILD ABUSE
a
na|lrE r
. . . . .
nEt t^lnttENavI ULLIIIUULIVV
a
INTERNATIONAL CRIMES
I
DRIVING WHILE INTOXICATED I
ELDER ABUSE EMOTIONAL ABUSE
t
HOMICIDE
a
tNnFqT
t .
SIAUAL Abubt
I
. rlArt
I
SEXUAL.HARASSMENT DI
EALIJ\T]
PLA6]AR]SM T.FRAOR]$M VANCIALISM
scf100t valrDAlrslvr VENSAL ABUSF
Vl0LfNC€
,
,
FAMILY VIOLENCI
Figure
5.8 Thesaurus display: hierarchical term display
above supplements the main alphabetical display of the thesaurus. It is, however, sometimes desirable to reverse the roles of these two displays, so that the hierarchical display becomes the principal display and the alphabetical display functions as an index to it. Categories of terms, which share properties in common but may not necessarily belong to the same hierarchy, are placed in as close juxtaposition as the limitations of linear sequencing will allow. The sequence of this systematic display, while helpful to indexer and searcher alike, is however not self-evident. In order to fix the sequence and link it to the alphabetical index, an address code using alphanumeric symbols must be attached to each descriptor. There are a number of advantages to a systematic display. While it cannot juxtapose every semantic relation, complete hierarchies are displayed and many other related terms are to be found in the vicinity. It is economical of space, in that there is no repetition of hierarchical information. Finally, the 158
I
INDEXING AND SEARCHING T.ANGUAGES
*' TT[:'l#"'#ffi;':: ""' o
.
bi0logical behavioural inlluences
:
:
:
o
o
i:
:: .
f
phvsi''|'sica'| pr'cesses
,,1;
1
ffiehaviou'r
] 1-AAB
envir0nmental behavioural influences
1l-An 11-A8A 11-A8B
:,,,,,.|iil';:'.
11
.
o
disruptive behaviour
t
rABC
,;-:::
i"'"lii:#:j,,:[,:,:i;,,,,.,
11-ACB
11
I
o
1-AA
11-AAA
-ACB-0 11-AD
ffiii*:,:,
SN
-
used to refer to chlldren's ranges from home
,:,:
behavioural disorders H
I
fi'''*l;ff
3
i :J',,'Ji
11-AE
iT l3 J,-,
communicalion RT communication disorders 12 -KD
!nli '...rus. It
is,
!:' -:splays, so h - :he alpha@- ,:ich share p.i- : rierarchy, f[" .- ..quencing h: ,,, indexer n&r : Seeu€nc€
fu , :. anumeric ! ili . tt cannot l -.aved and h . ,rnical of ft .:ally, the -
Figure
5.9 Thesaurus dispray: systemaric
dispray (Thesaurus of pray rerms)
arrangement of the principal sequence is by address codes and not by language. This independence of language makes systematic arrangement ideally suited to a multilingual thesaurus. with systematic display there are many loose ends to watch. The arrangement can only take care of one hierarchy at a time, so polyhierarchical terms must have their additional broader terms specially written in. The symbol BT(A) is sometimes used to denote an additional broader term. Scope notes, non-preferred terms, and associative relations must all be written in _ usually into the systematic display, but some thesaurus compilers prefer to write these into the alphabetical display, to give both displays a more equal signifi-
159
ACCESS
-
tt
: -'rl
Figure
5.lO
Thesaurus display: graphic display
cance. Finally, unless the systematic sequence guidance in the form of 'node labels'.
is very basic it will
need
Graphic display (see Figure 5.10) of semantic relations are of two basic kinds: tree structures and arrowgraphs. With both, hierarchical relations are displayed in a two-dimensional format. The example is of the sample arrowgraph in BS 5723:1987.
THESAURUS USE A thesaurus may be used in the indexing and/or searching of databases in three possible ways:
o
In indexing but not in
o
Retrieval is helped if a wide range of terms is used to index each record. Some electronic systems will automatically map a user from the 'unpreferred' to the preferred term. In searching but not in indexing: this is the 'searching thesaurus'. The
searching: this is the 'indexing thesaurus', where the database is mostly used for simple searching, often by less expert searchers.
160
INI)EXING AND SEARCHIN(} IANGUAGES
thesaurus assists the searching of a free-text database by suggesting additional search terms. This can be done automatically ('query expansion') ii the thesaurus is available online. A searching thesaurus tends to provide a wider set of terms as entry vocabulary than a traditional thesaurus, and make greater use of automatic construction techniques. Many experienced searchers use a thesaurus when carrying out natural language searches, especially on full-text files. The thesaurus is used, not as a source of indexing
o
terms. but as a reminder of semanticallv related terms to be added to a building block for searching. ltt both indexing and in searching: this is the traditional way in which a thesaurus is used. The same thesaurus is used for indedng (by the database compilers) and for searching (by users who know how to use a thesaurus and have one available). This kind of use presumes expert searchers. In indexing or in searching, terms may be added to the descriptor list or to the search statement without the explicit knowledge of the indexer or searcher.
SUMMARY
I Dr .::c it will need
ls r
.--: of lwo -:.1
W .-:
|rti h^
basic
relations are
ssrTrple arrOw-
:-:)3S€S in three
-
- :s', where the
ls- .-:rt searchers. I .: : .. each record. lur - r 'unpreferred' k;
.his chapter has explored issues associated with subject retrieval, and specifi-ally introduced indexing and searching languages. The two main kinds of .:dexing languages are controlled and natural language. Systems designers need . , decide when to use each of these. Other features of retrieval systems are -rhausitiviff and content specification, specificity, and the way in which complex :,,pics are represented. Search facilities in post-coordinate searching are -nportant in supporting users in their searching of controlled and natural laniuage. Boolean search logic is widely used, but relevance ranking and best :ratch search logic are becoming more significant. Other search facilities .nclude: truncation, proximity searching, range searching and search sets. lJiesauri are one way of recording controlled languages. They establish the :erms that are to be used as descriptors for subjects, and relationships behveen .ubjects. Relationships are of three types: equivalence, hierarchical and associative. Facet analysis is useful in offering a perspective on the structure of :elationships between subjects. The main means for displaying thesauri is the :lphabetic display, but there are also a variety of approaches to systematic display :hat are useful in displaying relationships.
':saurus'. The 161
ACCIISS
REFERENCES AND FURTHER READING Basch, R. (1989) The seven deadly sins of full text searching. Database, 12 (4), August, 15-23. Basch, R. (1991) My most difficult search. Database, 14 (3), June, 65-7. Bates, M. l. (1977) Factors affecting subject catalog search success. Journal ofthe American Sor:.:
for Information Science,23 (3), 161-9. Bates, M. J. (1989) Re-thinking subject cataloguing in the online environment. Library Resources r;, Technical Seruices, 33 (4), October, 400-412. Brenner, E. H. (1996) Beyond Booleans: New Approaches to Information Retrieual. Philadelphia. i National Federation of Abstracting and Indexing Services. British Standards Institution (1985) RSI Raot Thesaurus,2nd edn. Milton Keynes: British Standai Institution.
E. (1993) Browsing: a multi-dimensional framework. Annual Reuieu Information Science and Technology, 28, 231-7 6. Clausen, H. (1997) Online, CD-ROM and the Web: is it the same difference. Aslib Proceedings. 4:" (7). 177-83. Cleverdon, C. W (1967) 'lhe Cranfield tests on index language devices. Aslib Proceedings. 1r Chang, S. J. and Rice, R.
173-94.
Cousins, S.
A. (1992) linhancing subject access to OPAC's: controlled vocabulary versus natLi:.
language. Journal of Documentation,
48
(3), 291-309.
Creth, S. D. (1995) A changing profession: central roles for the academic librarian. Adaances Li bra ria nsh
i
p. I 9.
B5-9r{.
Ellis, D. (1991) Hypertext: origins and use. International Journal of Information Management. 1'(1), 5-13.
Ellis, D. (1992) The physical and cognitive paradigms in information retrieval research. Journa. Documentation, 48 (1), 45-64. l). (7996) Progress and Problems lishing.
Ellis,
in Information Retrieual. London: Library Association P.
Ellis, D, Furner-Hines, J. and Willett, P (1994) On the creation of hypertext links in full tdocuments: measurements of interlinker consistency. /oz rnal of Documentation, 5O (2), 67-!. Enser, P G. B. (1993) Query analysis in a visual information retrieval context. Journal of Docuti:. and Text Management, 1 (1), 28-52. Fairthorne, R. A. (1969) Content analysis, specification and control. Annual Reuiew of Informa: Science and Tbchnoktgy, 4:71-109. Fidel. R. (1991) Searchers' selection of search keys: II Controlled vocabulary or free-text search: . Journal of the Amerimn Society for Information Science, 42 (7), 501-14. Fidel, R. (1992) \\rho needs controlled vocabulary? Special Libraries, 83 (1), 1-9. Ingwersen, P (1992) Information Retrieual Interaction. i-ondon: Taylor Graham. /NSPEC Thesaurus. (Annual.) lnndon: INSPEC. International Standards Organization (7986) Documentation Guidelines for the Establishment c Deuelopment of Monolingual Thesauri. ISO 2788:1986. (leneva: ISO. Khan, K. and l.ocatis, C (1998) Searching through cyberspace: the effects of link display and l:. density on information retrieval from hypertext on the World Wide Web. /oz rnal of the Ameri, .: Sotiety for Information Science, 49 (2), 176-82. lancaster, R W. (1977) Vocabulary control in information retrieval systems. Aduances in Libran,: ship,7, L-40. Iancaster, F, W. (1986) Vocabulary Control for Information Retrieual,2nd edn. Arlington, VA: Ini mation Resources Press. lancaster, R \tr'. and Sandore, B. (1997) Technology and Management in Library and Informat: Seruices. l,ondon: Library Association Publishing I-ancaster, R W. and Warner, A. J. (1993) Infnrmation Retrieual Today. Arlington, VA: Informari Resources Press.
-
162
.
INDEXING AND SEARCHING LANGUAGI'S
"'
;" ,- -r. August, 15-23. -: [r-
.
'ilte Americau Sr,t:.' ,ihrary Resources
;'
il.
:
i
E
.-: British Standa:
tu-
,
Annual Reuieu
h!i'
-,
ib Prorcedings. 4v
11.
':
s
Philadelphia.
Proceedings.
.l-r-
.
- .:. r slatements in online searches. Online Reuiew, 4 (3),22'l-36. - ,: J. L. (1995) Invisible thesauri: the year 2000. Online and CD-ROM Reuien, 19 (2), 93-4. -...r.
-_
lg
VeTSUS natU:,-
.rian. Adaances :. ,\lanagement. 7I
'-'rdrch. Jourttal
.
C. (1993) Thesaurus management software. F)tcyclopedia of l.ibrary ond Itformation 51. 389-407. ';'r'. .:.,. )i. C. (1994) Indexing Baafts. Chicago: University of Chicago Press. -. 'n. S. (1997) Indexing and abstracting on the Wrrld Wide Web: an examination of six Web
: .
-T.
.
:..-.:ases. Information Technology and Libraries, l6 (2), 73-81. '.. l. E. (1990) A comparison between fiec' language and controlled language indexing and - .:.'hing. Information Seruices and Use,10 (3), 147-55. .'.. J. (1994) 'I'he controlled versus natural indexing languages debate revisited: a perspective :rformation retrieval practice and research. /ournal of Information Science,20 (2), 108-19.
,
trus of PsJchological Index T-erms. Arlington, VA: American Psychokrgical Association.
: .rr-. S., Baile, C., Cameron, A. and Flaton, J. (1994) UKOLUC Quick Guide to OnLine Commands, - .dn. Lnndon: UKol.l !C. : .rirg. B. H. (1995a) Library classilication and information retrieval thesauri: comparison and ::rrast. Cataloging and ClassiJication Quarterlt, 19 (3/4),23-44. : ,rcrg, B. H. (1995b) \4hy postcoordination fails the searcher. Indexer, 19 (3), Aprii, 155-9. -... T., Murphy. K., I)utry,'ll and Goorum, D. (1993) Accessing elaborations on core information
:
a hypermedia environment. Educational Technology Research and Deuelopment,4l (2), 19-34. Information. Comparison of Thesaurus Management Softnare for PCs. Available from World
.-:, rri'er ,'.'
-
:.sity Press. K.. Atherton, P and Newton, C. (1980) An analysis of controlled vocabulary and free text
tm
trl-
.{.. Beheshti, J., Breuleux, A. and Renaud, A. (1994) A comparison of information retrieval '-:int and CD-ROM versions of an encyclopaedia by elementary school students. Information ,,ing and Management,3O (4),499-5111. ;,ini. G. (1995) Information Seehing in Electronic Enuironments. Cambridge: Cambridge
lde Web: .
.\ssociation Pul,-
.:rks in full
:
tex:
5O (2),67-98.
"t:al of Documen:
':)
of Information
-::-text searching.
':olishment anti
:-.play and link .'tlte America n
.,
:n Librarian-
..
'n.
\A:
Infor-
.lnformation
I,
Information
163