P198-nanas

November 2019
PDF

Download

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA

Overview

Download & View P198-nanas as PDF for free.

More details

Words: 5,787
Pages: 7

Preview
Full text

Building and Applying a Concept Hierarchy Representation of a User Profile Nikolaos Nanas

Victoria Uren

Anne De Roeck

The Open University Knowledge Media Institute Milton Keynes, U.K.

The Open University Knowledge Media Institute Milton Keynes, U.K.

[email protected]

[email protected]

The Open University Department of Computing and Mathematics Milton Keynes, U.K.

ABSTRACT

we will refer to the term correlations caused by these two phenomena as, lexical correlations and topical correlations respectively. Finally, Doyle attributes a third phenomenon called “Documentation redundancy”, to document series like progress reports and newsletters, that repeatedly and periodically refer to the same spectrum of topics. Doyle goes on to argue, that in building a retrieval system one has to take into account all these three dimensions of term dependence. While term weighting can tackle documentation redundancy, the other two phenomena require a diﬀerent treatment. To deal with language redundancy, Doyle suggests joining terms with strong adjacent, lexical correlations into a single compound term. For reality redundancy on the other hand, he proposes expressing the non-adjacent, topical correlations with an hybrid structure that combines the characteristics of both hierarchies and associative graphs. In other words, he envisions the use of appropriately constructed connectionist networks that express the hierarchical topic-subtopic relations between terms. Connectionist networks have been the principal approach to representing term dependence in Information Retrieval (IR). Usually, pure associative graphs which only exploit the stochastic dependence that is caused by the above three phenomena, are adopted [23, 19, 4]. In contrast, recent research has focused on the construction of hierarchical networks based on either subsumption relations [21] or lexical relations between terms [1, 18]. Despite their extensive use in IR research, connectionist networks have not been as popular for Information Filtering (IF). Exceptions include the IF systems described by [22, 16, 11]. So far no single network structure has managed to represent all three dependence dimensions. In addition the application of such networks for IF has been limited. In this paper we present a methodology for the construction of a concept hierarchy that complies with the above requirements. A hierarchy is constructed based on a set of user speciﬁed documents on a topic of interest. A document evaluation function on the constructed hierarchy is then proposed for ﬁltering documents according to that topic. In that sense the concept hierarchy functions as a representation of the user’s interests, i.e. a user profile. The proﬁle’s ﬁltering performance has been evaluated on the Reuters RCV1 collection. The results indicate that the proposed approach is promising.

Term dependence is a natural consequence of language use. Its successful representation has been a long standing goal for Information Retrieval research. We present a methodology for the construction of a concept hierarchy that takes into account the three basic dimensions of term dependence. We also introduce a document evaluation function that allows the use of the concept hierarchy as a user proﬁle for Information Filtering. Initial experimental results indicate that this is a promising approach for incorporating term dependence in the way documents are ﬁltered.

Categories and Subject Descriptors H.3.1. [Information Storage and Retrieval]: Content Analysis and Indexing; H.3.3. [Information Storage and Retrieval]: Information Search and Retrieval

General Terms Algorithms, Design, Veriﬁcation

Keywords Term Dependence, Concept Hierarchies, Information Filtering

1.

[email protected]

INTRODUCTION

“The statistically dependent placement of words in text is a natural consequence of the way people think and communicate“. In his classic paper “Semantic road maps for literature searchers” [6], Doyle describes two basic phenomena that cause statistical dependencies between terms. “Language redundancy” refers to the habitual use of lexical compounds as semantic units, while “reality redundancy” relates to the pattern of reference to various aspects of the topic which is being discussed. In the rest of this paper

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR’03, July 28–August 1, 2003, Toronto, Canada. Copyright 2003 ACM 1-58113-646-3/03/0007 ...$5.00.

198

2.

RELATED WORK

strongly associated terms into lexical compounds. When these compounds are found in a document, the corresponding supervisors contribute to the document’s relevance. So, in contrast to pure associative term networks that only represent stochastic relations between terms, concept hierarchies have the potential of representing topic-subtopic relations. However, while subsumption hierarchies do not take into account the lexical relations between terms, lexical hierarchies are only based on such relations. In addition, although recent connectionist approaches to IF have made signiﬁcant steps towards representations of the user interests that incorporate term dependence, these systems are only based on lexical, phrase-based relations between terms. Both a methodology for constructing an hierarchical network that tackles all three dependence dimensions and a way of applying such a network to IF is legitimate.

In IR, it was recognized early, that the term independence assumption is false and that ways for taking into account term dependence should be found. Connectionist approaches have played an important role in this endeavor. Diﬀerent connectionist networks have been proposed. Here we concentrate on those that explicitly represent the associations between terms. In these networks terms are represented by nodes and associations between them as links. In order to incorporate term dependence in probabilistic IR, van Rijsbergen proposes the use of an appropriately constructed “dependence tree” [23]. The dependence tree is derived as the maximum spanning tree of the associative graph that represents the statistical dependence between every possible pair of terms. van Rijsbergen suggests the use of the constructed dependence tree for query expansion. Bhatia has adopted the latest idea to provide personalized query expansion based on a user proﬁle represented by a dependence tree [3]. Query expansion can also be based on thesauri. However, the manual construction of a thesaurus is expensive and time consuming. Therefore, methods for the automatic generation of thesauri are in great demand. Park et. al, propose the automatic construction of a thesaurus using a term similarity measure on a “collocation map” [19]. A collocation map is a sigmoid Bayesian network that encodes the statistical associations between terms on a given document collection and has been introduced by [9]. A similar approach has been described by [4] for automatic subject indexing, an application very similar to query expansion. In contrast to the above connectionist networks, where the associations between terms are stochastic, approaches for the automatic construction of hierarchical networks has also been proposed. These networks are usually referred to as “Concept”, “Topic” or “Subject” hierarchies and can be applied for the organization, summarization and interactive access to information. One method for the automatic construction of a concept hierarchy is through the use of subsumption associations between terms (“Subsumption Hierarchies”) [21]. A term ti is said to subsume tj if the documents that contain tj are a subset of the documents containing ti . Another approach is to rely on frequently occurring words within phrases or lexical compounds. The creation of such “Lexical Hierarchies” has been explored by [1, 18]. In addition to the above two approaches, Lawrie et. al. have investigated the generation of a concept hierarchy using a combination of a graph theoretic algorithm and a language model [15]. The approach has been shown to perform comparatively to both subsumption and lexical hierarchies. Finally, an evaluation that indicates that subsumption hierarchies are advantageous comparing to lexical hierarchies has been conducted by [14]. Despite their extensive use in IR research, connectionist networks of the type discussed here have not been as popular in IF. Exceptions include the INFOrmer [22] and PSUN [16] ﬁltering systems, which adopt an associative term network to represent user interests. For constructing the network, terms are considered to be associated if they appear in the same phrase. In the case of INFOrmer, document evaluation is performed using a spreading activation function on the associative term network. A similar approach has also been adopted by [11]. In PSUN on the other hand an extra representational layer of “supervisors” is introduced to group

3. BUILDING A PERSONAL CONCEPT HIERARCHY In this section we present a methodology for constructing a concept hierarchy that represents this content of a set of user speciﬁed documents that distinguishes them from the rest of the documents in a complete collection. Therefore, the methodology can be applied for the initialization of a user proﬁle that represents the underlying documents’ topic. The construction takes place in three diﬀerent phases that progressively tackle the three term dependence dimensions. Document redundancy is dealt with in the next section, where term weighting is applied to identify those out of the unique terms in the user speciﬁed documents that are most speciﬁc to those documents. The extracted terms are used as the concept hierarchy’s building blocks. In section 3.2, term associations are derived using a sliding window approach in order to generate an associative graph, from which a concept hierarchy is extracted in section 3.3. These two phases have the joint eﬀect of lexical and topical correlations being represented by the ﬁnal concept hierarchy.

3.1 Term Weighting and Selection Our ﬁrst goal is to identify the most appropriate terms for building the concept hierarchy (user proﬁle). Given a number R of user speciﬁed documents, we initially apply stop word removal and stemming to reduce the space of unique terms in the documents. Term weighting is used to assess the speciﬁcity of terms to the documents’ underlying topic, i.e. their ability to distinguish documents about that topic from the rest of the documents in the collection. The most speciﬁc terms are then selected to populate the proﬁle (ﬁg. 1). We thus tackle document redundancy by ﬁltering out terms that don’t relate to the underlying topic. Of course the exact number of proﬁle terms is a parameter that is subject to optimization. If we ignore term dependence then this initial, unconnected proﬁle version can be used to evaluate documents. For example, the inner product between the proﬁle’s and the document’s vector representations, that is described in [12], can be applied. We used a novel term weighting method called Relative Document Frequency (RelDF), that measures the relative importance of terms within the user speciﬁed documents and a general collection of documents. The essence behind the approach is analogous to the relative frequency technique that has been suggested by Edmundson and Wyllys [7]

199

Symmetric, Associative Link

Extracted, specific term

T

T

Figure 2: Associative links between profile terms

Figure 1: Specific terms populate the profile

size 201 which is larger that the typical size of local context. Topical correlations between terms that appear within the window are measured using a formula similar to the one adopted by [19]. In addition, the formula has been extended to measure the lexical correlations between terms by their average distance in text. More speciﬁcally, a weight wij is assigned to the link between two extracted terms ti and tj using the following formula:

(hence the adopted name). Based on the assumption that special or technical words are more rare in general usage than in documents about the corresponding subjects, they presented a number of ways for assessing the relative frequency of terms within a document and a general collection. Their approach has been highlighted by Doyle as a potential remedy for document redundancy [6]. In a similar way, we assume that terms pertaining to the topic of interest to the user will appear in a larger percentage of the user speciﬁed documents than in the complete collection. The method assigns to each term, a weight in the interval (-1,1), according to the diﬀerence between the term’s probabilities of appearance in the user speciﬁed documents and in the general collection. We deﬁne RelDF using equation 1, where N is the number of documents in the collection, R the number of user speciﬁed documents, and r and n are respectively the number of user speciﬁed documents and the number of documents in the collection that contain r t. While the ﬁrst part of the equation ( R ) favors those terms that exhaustively describe the user speciﬁed documents and therefore the underlying topic of interest, the second part n (− N ) biases the weighting towards terms that are speciﬁc within the general collection. Experiments presented in [17] have indicated that RelDF outperforms a large number of existing term weighting methods, when the user speciﬁes a small number of documents. RelDF =

n r − R N

wij =

2 f rij 1 · f ri · f rj d

(2)

In equation 2, f rij is the number of times ti and tj appear within the sliding window, f ri and f rj are respectively the number of occurrences of ti and tj in the user speciﬁed documents and d is the average distance between the two linked terms. Two extracted terms that appear next to each other have a distance of 1, while if n words intervene between them the distance is n + 1. The above process connects the extracted, proﬁle terms of ﬁgure 1 with symmetric, associative links (Figure 2). The ﬁrst fraction of equation 2 measures the likelihood that the two extracted terms will appear within the sliding window. The second fraction on other hand is a measure of how close the two terms usually appear. As a result, a link’s weight is a combined measure of the statistical dependencies caused by both lexical and topical correlations. According to Doyle we could tackle language redundancy by joining terms into a single lexical compound if their link’s weight was over a certain threshold. Nevertheless, we have chosen not to do so for reasons explained later.

(1)

3.2 Link Generation and Weighting

3.3 Extracting a Hierarchy

Having extracted the most speciﬁc terms from the user speciﬁed documents, the next step is to appropriately associate them. A span of contiguous words that slides through each document’s text, i.e. a “sliding window”, is used for selecting the extracted terms to be linked. The size of the window deﬁnes the kind of associations that we take into account. A small window of a few words is usually called “Local Context” and is appropriate for identifying adjacent, lexical correlations. “Topical Context” on the other hand, is deﬁned by a larger window that incorporates several sentences. The goal of topical context is to identify semantic relations between terms that are repeatedly used in text discussing a certain topic [10] (topical correlations). To identify topical correlations we adopt topical context. For the work presented here we have chosen a window of

In order to extract a concept hierarchy out of the associative graph of ﬁgure 2, a way of identifying topic-subtopic relations between terms is required. We adopt Forsyth and Radas’ hypothesis, that the more documents a term appears in the more general the term is assumed to be [8]. Some of the proﬁle terms will broadly deﬁne the underlying topic, while others co-occur with a general term and provide its attributes, specializations and related concepts [14]. Based on this hypothesis, proﬁle terms are ordered according to decreasing Relevant Document Frequency (RDF), i.e. their document frequency in the user speciﬁed documents. The higher a term’s rank the more general the term 1

200

10 words at either side of an extracted term

Link Cross-layer Link

Top Layer (1st)

or a query are linked to the corresponding nodes. Links between terms and between documents are not used. Retrieval takes place on the basis of a spreading activation function. A ﬁxed amount of energy is initially deposited to the query node. The energy is then disseminated through the network and towards the document nodes. A document’s relevance is ﬁnally estimated by the amount of energy that a document received. The diﬀerence is that in our case the network consists only of terms. Spreading activation functions have also been applied in the case of associative graphs that express the stochastic correlations between terms [4, 22]. Due to the inherent lack of direction in these networks, an initial energy is assigned to the terms that appear in a speciﬁc document and is then iteratively disseminated through the network until an equilibrium is reached. The document evaluation function that is established here is based on a directed, spreading activation function that combines the characteristics of the above approaches. Although the network contains only terms the direction implied by the hierarchy and more speciﬁcally its layering, is taken into account. Given a document D, an activation of 1 (binary indexing) is passed to those proﬁle terms that appear in D. The initial energy of an activated term ti , is Ei = 1 · wi , where wi is the weight that has been assigned to the term by the term weighting method (RelDF). If and only if, an activated term ti is directly linked to another activated term tj in a higher proﬁle layer, then an amount of energy Eij is disseminated by ti to tj through the corresponding link. Eij = Ei · wij , where wij is the weight of the link between ti and tj . A direction from lower to higher proﬁle layers is thus imposed on the cross-layer links (ﬁg. 3: arrows). Activated terms in the top two layers update their initial energy by the amounts of energy that they receive from activated terms in the layers bellow. For example, the updated energy Ej of term tj is Ej = Ej + ti ∈Al Ei · wij , where Al is the set of activated terms that are directly linked to tj and that appear in lower proﬁle layers. Terms in the middle layer update their energies ﬁrst and terms in the top layer last (feedforward). As a result, activated terms in the middle layer disseminate part of their “updated” energy. IF Ah is the set of activated terms in the top two layers that have either received or disseminated energy, then the document’s score is calculated by equation 3, where k is the number of words in D. Only terms in Ah are thus allowed to directly contribute to the document’s relevance. This constrain is a drawback of the current implementation for two reasons. First, according to this constrain, term T in ﬁgure 3 does not contribute to the documents relevance, despite representing a subtopic of interest. This can negatively aﬀect the ﬁltering performance, especially for proﬁles with a small number of terms and consequently of links. In addition, the current document evaluation function favors terms in the top two layers. Activated terms in the lowest layer contribute only implicitly by the energy that they disseminate to activated terms in the layers above. As already mentioned, we are currently experimenting with continuous evaluation functions which tackle both of the above issues.

RDF

Middle Layer (2nd)

T

Lowest Layer (3rd)

Figure 3: Concept hierarchy profile representation

is assumed to be. The terms are then assigned to three layers according to RDF thresholds. For the current work we used the following thresholds: if RDF max is the RDF of the most frequent proﬁle term, then proﬁle terms with RDF ≥ 0.8 ∗ RDF max are assigned to the top layer, those with RDF ≥ 0.4 ∗ RDF max to the middle layer and the rest of them to the third, lowest layer2 (ﬁg. 3). This division allows the implementation of the document evaluation function that is described in the next section. We are currently experimenting with document evaluation functions that do not require this division and can instead be applied on the continuous hierarchical network that results from the initial term ordering. Terms in the top layer deﬁne the general topic of interest, terms in the middle layer correspond to related subtopics and terms in the lowest layer comprise the subvocabulary used when the topic is discussed. Links that cross layer boundaries connect general terms in the top layers with more speciﬁc terms in the layers bellow (ﬁg. 3: links in black). In other words, cross-layer links reﬂect topic-subtopic relations between terms. The proﬁle terms and the cross-layer links comprise a cyclic, hierarchical network that complies with most of the design principles set by Sanderson and Croft for the generation of a concept hierarchy using subsumption [21]. However, in contrast to subsumption hierarchies, where the links are generated based on pure co-occurrence data, the adopted link generation and weighting process combines co-occurrence with distance between terms. In conclusion, while topic-subtopic relations are represented by cross-layer links, lexical correlations are reﬂected by the link weights.

4.

DOCUMENT EVALUATION

The above methodology generates out of a set of user speciﬁed documents on a topic of interest, a concept hierarchy that takes into account all three dependence dimensions. In this section, we address how to use this proﬁle representation for document ﬁltering. We have drawn ideas from the application of neural networks [13, 24] and semantic networks [5, 2, 12] to IR. According to these connectionist approaches, documents, queries and index terms are represented by nodes. Terms that are contained in a document

SD =

2

Since proﬁle terms are only the most speciﬁc terms in the user speciﬁed documents (see section 3.1), very few terms are assigned to the top layer, some more to the middle and the majority to the lowest layer.

ti ∈Ah

Ei

log(k)

(3)

The above establish a non-linear document evaluation function that takes into account the term dependencies that the concept hierarchy represents. As energy disseminates from

201

lower to higher hierarchical layers, terms in the higher proﬁle layers that appear in the document together with their associated terms in the layers bellow, have an increased contribution to the document’s score. Therefore, a document about both, topics and related subtopics of interest, receives a higher score than a document about the same number of un-related topics. On the other hand, terms in the lowest proﬁle layer disseminate their energy, only if they are found together with their associated topics and sub-topics of interest, in the layers above. In other words, these terms contribute implicitly to the document’s relevance if they are found in the topical context that they were extracted from. Topical correlations between terms are thus taken into account. Lexical correlations are also taken into account due to the way links are weighted. The amount of energy disseminated between two terms of diﬀerent proﬁle layers, is larger if they are part of a lexical compound. On the other hand, if the compound’s terms are found in the same layer, then their contribution to the document’s score can be enhanced because of the large number of links that they probably have in common. Terms that are found in adjacent positions in text share a lot of links to other terms. In addition to taking into account both topical and lexical correlations in the way documents are evaluated, the proposed approach is also computationally cheap. In contrast to traditional spreading activation approaches where numerous computational cycles are required for the network to reach an equilibrium, document evaluation takes place in two forward steps. The combination of the above two characteristics of the proposed document evaluation process, could represent a signiﬁcant improvement over existing connectionist approaches to IF [22, 16, 11]. Furthermore, as we will shortly discuss in section 6, adaptation of the proposed proﬁle representation to changes in the user interests can be eﬀectively achieved.

5.

Table 1: Per Topic Average AUP Score Proﬁle version Topic unconnected hierarchical R1 0.00148 0.00271 R2 0.0667 0.06628 R3 0.00193 0.00201 R4 0.01738 0.03274 R5 0.01482 0.00609 R6 0.07184 0.07561 R7 0.032 0.03234 R8 0.07115 0.07242 R9 0.12967 0.11122 R10 0.17488 0.17582

measure. The AUP is deﬁned as the sum of the precision value at each point in the list where a relevant document appears, divided by the total number of relevant documents. We have deviated from the strict TREC routing guidelines in two ways. To reduce the required completion time, the experiments have been conducted on only the ﬁrst 10 (R1-R10) out of the 84 topic categories. Furthermore, only 30 relevant documents per topic and not the hundreds provided by the training set, were allowed for constructing the corresponding proﬁle. This amount was considered a better approximation of the real number of documents that a user might provide. For each one of the 10 topics a proﬁle was constructed using the best terms on the basis of RelDF weighting. We have experimented with diﬀerent numbers of extracted terms. These could be 20, 40, 60, 80 and 100. For each possible combination of topic and number of extracted terms two diﬀerent kind of proﬁles were constructed. In the ﬁrst case no links between terms were generated and documents were evaluated using the inner product measure (unconnected proﬁle version). In the second case the same extracted words were used to generate the concept hierarchy and documents were evaluated using the introduced approach (hierarchical proﬁle version). Since the same weighted terms have been used in both cases any diﬀerences in performance is due to the incorporation of term dependence by the hierarchical proﬁle representation. Figure 4 presents for each number of proﬁle terms the average AUP score over all 10 topics for the unconnected and hierarchical proﬁle version. The X axis corresponds to the number of proﬁle terms and the Y axis presents the average AUP. Table 1 presents for each topic, the average AUP score over the diﬀerent numbers of proﬁle terms. The results indicate that the hierarchical proﬁle performs better than the unconnected proﬁle for large numbers of proﬁle terms, for which representing term dependence is of particular importance. Its inferior performance for small number of proﬁle terms is justiﬁed to some extent by the fact that the hierarchy is not substantially complete. The fact that terms do not contribute to the document’s relevance if they don’t receive or disseminate energy, as mentioned in section 4, is an additional reason. It is also important to note that the performance of the hierarchical proﬁle is not signiﬁcantly aﬀected by the number of proﬁle terms. It favors only those combinations of terms that comprise a meaningful hierarchy. Finally, its overall performance is better for 7 out of the 10 topics. For topics R1 and R4 in

EVALUATION

To evaluate the performance of the proposed hierarchical proﬁle, we have performed some pilot experiments on a slight modiﬁcation of the TREC-2001 routing subtask. The Text REtrieval Conference (TREC) has been held annually since 1992. Its purpose is to provide a standard infrastructure for the large-scale evaluation of IR systems. TREC-2001 adopts the Reuters Corpus Volume 1 (RCV1). The latter is an archive of 806,791 English language news stories that has recently been made freely available for research purposes3 . The stories have been manually categorized according to topic, region, and industry sector [20]. The TREC-2001 ﬁltering track is based on 84 out of the 103 RCV1 topic categories. Furthermore, it divides RCV1 into 23,864 training stories and a test set comprising the rest of the stories4 . According to TREC’s routing task, systems are allowed to use the complete relevance information and any nonrelevance-related information from the training set. Systems are evaluated on the basis of the best 1000 scoring documents, using the average uninterpolated precision (AUP) 3 http://about.reuters.com/researchandstandards/ corpus/index.asp 4 For more details on the TREC 2001 ﬁltering track see: http://trec.nist.gov/data/t10 ﬁltering/T10ﬁlter guide.htm

202

8.00E-02

7.00E-02

Average AUP Score

6.00E-02

5.00E-02

unconnected

4.00E-02

hierarchical

3.00E-02

2.00E-02

1.00E-02

0.00E+00 0

20

40

60

80

100

120

Number of Terms

Figure 4: Experimental Results particular, the hierarchical proﬁle performs almost twice as well as the unconnected version. For topic R5 however, the opposite appears to be happening. One reason could be, that although topic R5 is relatively speciﬁc (reuters-code C1511 [20]), it has a large number of documents in the test set (22813). This particularity of topic R5 in combination to the small number of initialization documents may have resulted in proﬁle overspecialization to the subjects being discussed during the time period that the initialization documents derive from. These ﬁndings encourage us to believe, that the extracted concept hierarchy is a promising approach for representing a user proﬁle. Of course, further improvements can be achieved by appropriate ﬁne tuning of parameters like the window size and in general via improvements in the document evaluation function.

6.

indicated by the hierarchical network that was more strongly activated. Automatic summarization can also be achieved by using the document evaluation function to score individual sentences within the document. Since no syntactical rules are used the approach can also be applied to the personalized ﬁltering of any media for which meaningful ordered features can be extracted. Current work is focusing on using this research as part of the development of an adaptive information ﬁlter, called Nootropia . Adaptation of the hierarchical proﬁle is achieved via an economic model in which proﬁle terms are assigned “credit” analogous to their relevant document frequency and must pay part of it as “rent” on the basis of the layer they occupy. Credit is awarded or penalized according to user feedback and as a result terms are promoted or demoted from one proﬁle layer to another. This process results in new cross-layer links while terms that are demoted from the lowest layer are purged from the proﬁle. These terms are replaced by new informative terms that are extracted from documents that received positive feedback. Term and link weights are also updated according to temporary proﬁles that are generated using the process described here from the sets of documents that have received positive and negative feedback. In conclusion, the hierarchical representation that we have described is not static. It can be adapted to changes in the user interests.

CONCLUSIONS AND FUTURE WORK

We have presented a methodology that generates a concept hierarchy from a set of user speciﬁed documents about a topic of interest, through a series of processes that take into account document, language and reality redundancy. Appropriately weighted and selected terms are used as the hierarchy’s nodes while hierarchical links and their weights represent topical and lexical correlations between the selected terms. A computationally cheap, document evaluation function has allowed us to apply the concept hierarchy as a user proﬁle for information ﬁltering. Initial experimental results are promising and point towards further improvements. These include experimentation with document evaluation functions that exploit a continuous ordering of terms. We should also note, that the hierarchical proﬁle structure can represent more than one topic of interest. In the context of information ﬁltering, this means that distinct or overlapping hierarchical networks can be constructed for each topic, to compose a user proﬁle. A document’s topic could then be

7. ADDITIONAL AUTHORS Additional author: John Domingue (Knowledge Media Institute, email: [email protected]).

8. REFERENCES [1] P. Anick and S. Tipirneri. The paraphrase search assistant: Terminological feedback for iterative information seeking. In M. Hearst, F. Gey, and R. Tong, editors, 22nd Annual International ACM

203

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

Conference, pages 314–330, 2000. [15] D. Lawrie, B. W. Croft, and A. Rosenberg. Finding topic words for hierarchical summarization. In 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 349–357, New Orleans, Louisiana, United States, 2001. ACM Press. [16] M. Mc Elligott and H. Sorensen. An evolutionary connectionist approach to personal information ﬁltering. In 4th Irish Neural Networks Conference, pages 141–146, University College Dublin, Ireland, 1994. [17] N. Nanas, V. Uren, A. De Roeck, and J. Domingue. A comparative study of term weighting methods for information ﬁltering. Technical Report KMi-TR-128, Knowledge Media Institue, The Open University, 2003. http://kmi.open.ac.uk/publications/papers/ kmi-tr-128.pdf. [18] C. G. Nevill-Manning, I. H. Witten, and G. W. Paynter. Lexically-generated subject hierarchies for browsing large collections. International Journal on Digital Libraries, 2(2-3):111–123, 1999. [19] Y. C. Park, Y. S. Han, and K.-S. Choi. Automatic thesaurus construction using bayesian networks. Information Processing and Management., 32(5):543–553, 1996. [20] T. Rose, M. Stevenson, and M. Whitehead. The reuters corpus volume 1 - from yesterday’s news to tomorrow’s language resources. In Proceedings of the Third International Conference on Language Resources and Evaluation, 2002. [21] M. Sanderson and M. Croft. Deriving concept hierarchies from text. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information retrieval, pages 206–213. ACM Press, 1999. [22] H. Sorensen, A. O’ Riordan, and C. O’ Riordan. Proﬁling with the informer text ﬁltering agent. Journal of Universal Computer Science, 3(8):988–1006, 1997. [23] C. J. van Rijsbergen. A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation, 33(2):106–199, 1977. [24] R. Wilkinson and P. Hingston. Using the cosine measure in a neural network for document retrieval. In 14th Annual Internation ACM SIGIR conference on Research and Development in Information Retrieval, pages 202–210. ACM Press, 1991.

SIGIR Conference on Research and Development in Information Retrieval, pages 153–159, 1999. R. K. Belew. Adaptive information retrieval: using a connectionist representation to retrieve and learn about documents. ACM, 1989. S. K. Bhatia. Selection of search terms based on user proﬁle. In Proceedings of the 1992 ACM/SIGAPP Symposium on Applied Computing, pages 224–233. ACM Press, 1992. Y. Chung, W. M. Pottenger, and B. R. Schatz. Automatic subject indexing using an associative neural network. In Proceedings of the 3rd ACM Conference on Digital Libraries, pages 59–68. ACM Press, 1998. F. Crestani. Application of spreading activation techniques in information retrieval. Artificial Intelligence Review, 1997. L. B. Doyle. Semantic road maps for literature searchers. Journal of the ACM (JACM), 8(4):553–578, 1961. H. P. Edmundson and R. E. Wyllys. Automatic abstracting and indexing-survey and recommendations. Communications of the ACM, 4(5):226–234, 1961. R. Forsyth and R. Rada. Machine Learning: Expert Systems and Information Retrieval, chapter Putting and Edge. Ellis Horwood, London, 1986. Y. S. Han, Y. K. Han, and K. Choi. Lexical concept acquisition from collocation map. In Workshop on Acquisition of Lexical Knowledge from Text, 31st Annual Meeting of the ACL, Columbus, Ohio, 1993. N. Ide and J. Veronis. Word sense disambiguation: The state of the art. Computational Linguistics, 24(1):1–40, 1998. A. Jennings and H. Higuchi. A personal news service based on a user model neural network. IEICE Transactions on Information and Systems, 75(2):198–209, 1992. W. P. Jones and G. W. Furnas. Pictures of relevance: A geometric analysis of similarity measures. Journal of the American Society of Information Science, 38(6):420–442, May 1986. K. L. Kwok. A neural network for probabilistic information retrieval. In 12th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21–30, Cambridge, Massachusetts, USA, 1989. D. Lawrie and B. W. Croft. Discovering and comparing topic hierarchies. In RIAO 2000

204