Using English to Retrieve Software M. R. Girardi and B. Ibrahim1 University of Geneva Abstract This paper describes ROSA, a software reuse system based on the processing of the natural language descriptions of software artifacts. Lexical, syntactic and semantic analysis of software descriptions is performed to automatically extract both verbal and nominal phrases from descriptions and use this information to create frame-based indexing units for software components. Retrieval similarity measures provide good retrieval effectiveness by supporting semantic matching and processing of lexical relationships between terms. Some results from an experiment evaluating retrieval effectiveness are discussed.
1
I ntroduc ti on
This paper describes ROSA (Reuse Of Software Artifacts) a software reuse system based on the processing of the natural language descriptions of software artifacts [9][10][11][12]. The system aims at being cost-effective, domain independent and providing good retrieval effectiveness. Automatic indexing is required to turn software retrieval systems cost-effective. Reuse systems that index software components manually are difficult and expensive to set up. We propose an approach for automatic indexing of software components, that considers the semantics of commonly used classification schemes for software components. Indexing terms are automatically extracted through lexical, syntactic and semantic processing of both user queries and software descriptions. Knowledge required for both retrieval and classification activities is also acquired from public domain lexicons. These features allow a general applicability of the system, independent of the application domain of software libraries. Most software retrieval systems retrieve components through a set of keywords provided by a user employing either a controlled vocabulary or a free one. These systems are simple and effective for experienced users. The effectiveness breaks down for users not familiar with the proper terminology. Such users may not know the proper keywords, and therefore may use a synonym, a related term or a more general or specialized term. In such cases, keyword-based systems fail, because they do not provide an answer and/or they retrieve a great number of irrelevant software components. Also, it seems more friendly for a user to submit a query in natural language than to think in proper keywords, corresponding classification schemes or boolean combinations of keywords. Users should not be concerned with the internal organization of software libraries. On the other hand, the effectiveness of keyword-based systems is limited by the so-called “keyword barrier”, i.e. the systems are unable to improve retrieval effectiveness beyond a 1. Authors’ address: Centre Universitaire d’Informatique - Universite de Geneve, 24, rue General Dufour, CH-1211 Geneve 4, Switzerland. E-mail: {girardi, bertrand}@cui.unige.ch. Fax: +41(22) 320.29.27. Tel: +41(22) 705.76.57.
2 certain performance threshold [18]. This situation is particularly critical in software retrieval where users require high precision, i.e. they expect to retrieve only the best components for reuse, avoiding to have to select the best one from a list containing many irrelevant components. Natural language processing techniques and the conceptual information that we can extract from descriptions of software components are potentially useful to increase retrieval effectiveness [13]. There are also other reasons to use natural language for software retrieval. In addition to source code, it is also interesting to retrieve all the products of the development cycle, like requirements and design specifications. Source code is available in different programming languages, and different specification languages/formalisms are used for other development products. The documentation in natural language of these products allows to handle all these artifacts uniformly. The paper is organized as follows. Section 2 summarizes related work in the area of reuse systems and on the application of natural language processing techniques to information retrieval systems. Section 3 outlines the main mechanisms in the current version of the reuse system. Section 4 describes the semantic formalism used to identify, in a software description, the knowledge needed to catalogue the component in a library. Section 5 introduces the mechanisms defined for the analysis of descriptions (morpholexical, syntactic and semantic analysis) and the semantic structure of the Knowledge Base. Some heuristics used to remove both syntactic and semantic ambiguities from software descriptors are also outlined. Section 6 presents the mechanisms used to process queries and retrieve existing components, along with the measures used for the similarity analysis of the indexing structures. Section 7 describes some empirical results on retrieval effectiveness. Section 8 concludes the paper with some remarks on the main advantages and limitations of ROSA and further research.
2
Related w ork
Current research in the domain of this work comprises work done on software reuse systems and related work on the application of natural language analysis techniques to information retrieval systems.
2.1
So f t w a r e r e u s e sy stems
Wood and Sommerville [31] propose a frame-based software component catalogue, that aims at capturing the meaning of software components. The catalogue consists of a frame for the basic function that a software component performs, with slots for the objects manipulated by the component. The frames are not used to parse input sentences. Instead, the retrieval system presents them as templates to be filled in by the user via menus. The time-consuming search through the possible fillers is left to the user. The indexing task, i.e. the construction of frames for the software component catalogue is performed manually. Prieto-Diaz [21] proposes a faceted classification scheme for cataloguing software components. This scheme has also been adopted by other reuse proposals, like the REBOOT project [26]. The scheme uses facets as descriptors of software components. These facets have a meaning similar to the semantic cases defined in this work. The scheme consists of four facets: the function performed by the component, the object manipulated by the function, the data
3 structure where the function takes place (medium or location) and the system to which the function belongs. The three first facets are considered sufficient to describe the functionality of the component. For example, the following terms, corresponding to each one of the facets, describe the ‘grep’ family of Unix components that ‘search’ (function facet), a ‘string’ (objects facet) in a ‘file’ (data structure/location facet) and are part of a ‘line editor’ (system facet). Catalogues are constructed manually and retrieval is performed by specifying keywords for each facet in the scheme. PROTEUS [8] is a software reuse system that supports different classification methods for software components (simple keyword-based, faceted, enumerated graph and attribute value) in order to do a comparative empirical evaluation of those methods. The architecture of the system has been designed to be flexible enough to support new classification and retrieval methods. Relatively few software reuse systems take advantage of the conceptual information available in the natural language documentation of software components to improve retrieval effectiveness. Existing systems may be classified in two basic groups: free-text indexing reuse systems, and knowledge-based reuse systems. 2. 1. 1
F ree-t ex t ind ex ing re use syste ms
Free-text indexing systems automatically extract keyword attributes from the queries in natural language provided by the user and these attributes are used to locate software components. Similarly, software components are classified in the software library by indexing them according to keyword attributes extracted from the natural language documentation of the software components. These systems work at the lexical level, so they do not use the syntactic and semantic information available in a description in natural language. Some of the systems that follow this approach are described below. The GURU system [17] classifies software components according to attributes automatically extracted from their natural language documentation by using an indexing scheme based on the notions of lexical affinity and quantity of information. The retrieval system accepts queries in free style natural language and the same technique used to index software components is applied to queries. Even without attempting any understanding of the query and without using any kind of semantic knowledge, the system exhibits better precision than single keyword-based systems by constructing queries with pairs of words that occur in the text, according to the concept of lexical affinity. An extension of the GURU approach is proposed in [14] for object-oriented class libraries. In this approach, besides the natural language documentation of components, source code is also a source of information to classify components. The information that is automatically extracted from source code derives from inheritance and part_of relationships. The RSL (Reusable Software Library) [2] includes the RSL database and four subsystems: a library management system, a user query system, a software retrieval and evaluation system and a computer-aided design system. The library management system, used to populate and maintain the software library, automatically scans source code files and extracts specially labelled comment statements with attributes such as keywords describing the functionality of components, author, date created, etc. The user query system has a natural language front-end and interacts with the user in order to refine the query and reduce the set of retrieved
4 components. The software retrieval and evaluation system retrieves and provides a score of software components through the specification of keywords describing the attributes of components. Finally, the computer-aided design system is used to include retrieved and selected components into an object-oriented design specification. 2 . 1. 2
Kno wl ed g e-ba sed r e use syste ms
Knowledge-based reuse systems make some kind of syntactic and semantic analysis of natural language specifications without pretending to understand the documents. They are based on a knowledge base which stores semantic information about the application domain and about natural language itself. These systems are usually more powerful than traditional keyword retrieval systems. However, they usually require enormous human resources: knowledge bases are created for each application domain and populated from scratch. Some of the systems that follow this approach are described below. LaSSIE (Large Software System Information Environment) [3] is composed of a knowledge base, a semantic analysis algorithm based on formal inference and a user interface incorporating a graphical browser and a natural language parser. LaSSIE helps programmers in searching useful information about a large software system and also supports the retrieval of components for reuse. The construction of the knowledge base is currently a manual process and future directions of the project point to the definition of knowledge acquisition mechanisms to automate that process. NLH/E (Natural Language Help/English) [28] is a question-answering system that accepts queries in English. The authors propose a frame-based catalogue for software components, organized as a case grammar. These frames are created manually but, queries are parsed using the case grammar. The parser returns the tree of matched case frames. If there is no possible match that consumes all the input, the parser returns the “best” matches, i.e. those that account for most of the input. The catalogue is specific to the Common Lisp domain. The input is restricted to imperative sentences or nominal phrases. Nie [20] proposes a software retrieval system where both queries and software descriptions are parsed using a semantic grammar. A knowledge base for a restricted application domain has been created to assist the parsing process. Knowledge consists of a set of classes modelling data and operations in the domain and a set of relationships among concepts in the classes. YAKRSIS [1] is a software information system that provides a natural language query interface and uses knowledge bases either containing generic or specific knowledge about a certain domain. The system has been used to implement a cross reference tool for C++ and a tool to locate UNIX commands. An experiment is reported where a knowledge base has been modelled representing the functionality of 70 Unix commands. Values of recall and precision of nearly 1 are reported from the experiment.
2.2
N a t u r a l l a n g u a g e pro c essin g in in fo rma tio n retrieval
Natural Language Processing (NLP) techniques have been applied in information retrieval systems mainly at the lexical, syntactic and semantic levels. The availability of machine readable dictionaries made it possible to use lexical processing techniques in information
5 retrieval. Machine-readable dictionaries have also been used in information retrieval to index text and queries by word senses instead of words themselves [16]. These approaches aim at increasing the precision in retrieval by providing a solution to the polysemy phenomenon, i.e. the fact that words have multiple meanings. Another linguistic phenomenon that reduces the effectiveness in retrieval, particularly the recall, is the synonymy, i.e having multiple words or phrases to describe the same concept. Lexicons containing synonyms, generalizations, specialization of terms or other related information are now available to approach this problem [19]. NLP techniques for syntactic analysis have been used in information retrieval to index text by more complex elements than simple words, like noun phrases. Templates [4] and parsing techniques [22] [27] have been used to determine the boundaries of noun phrases and use them to index text and queries. Syntactic analysis in information retrieval is mostly restricted to noun phrases and therefore is simpler than what is usually required in NLP applications. It has not been demonstrated, however, that any of these syntactic indexing techniques significantly improves retrieval performance [7]. One problem of indexing by noun phrases is that there are many different syntactic forms to express the same concept (e.g. ‘software retrieval’ and ‘retrieval of software’). Thus, finding similarities between two noun phrases that refer to a similar concept is a difficult process. The phrase must have the same syntactic structure in both the document and the query to make possible a matching and this situation seldom happens. Another problem is the fact that queries normally denote supersets of what the relevant document denotes. If the query is ‘software engineering methods’ we want, among others, all the documents about ‘object oriented software engineering methods’. The syntactic structures of the query and of the expressions making up the documents will be different and we need to take into account the semantic superset/subset relationship. One approach to these problems is to normalize the syntactic structures to get a canonical form. Noun phrases like “software retrieval” and “retrieval of software” produce the same canonical or dependency structure and can therefore be retrieved by the same query. This approach is taken by the CLARIT system [5] where text and queries are indexed by noun phrases in a normalized form. Experiments on retrieval by syntactic phrases have shown poor results. Furthermore, experiments on retrieval by statistical phrases exhibit better effectiveness than purely syntactic methods for information retrieval [7]. Salton [23] considers that “syntactic systems not based on contextual and other semantic considerations can not be completely successful in eliminating ambiguities and constructing useful indexing units”. Even if the application of pure syntactic analysis techniques in information retrieval has not shown real improvements in retrieval effectiveness, syntactic techniques have demonstrated their usefulness as auxiliary mechanisms for semantic processing. Syntactic methods are used in order to take advantage of the text structure, to link words (e.g. how an adjective is linked to a noun, etc.) and to resolve some problems related to negations, conjunctions, etc. Former approaches of semantic processing in information retrieval were based on the definition of a formal language into which input text could be turned. More recently, knowledge representation frameworks from artificial intelligence, like semantic networks [32], frames [29]
6 and Schank’s primitive acts [18] [25] [29] have been used to represent knowledge by specifying primitives or simple concepts and then combining them to define more complex concepts. Most information retrieval systems that use semantic analysis techniques are conceptual information retrieval systems [15] [18]. In conceptual information retrieval systems the user requests information, and the information is given directly, not just a reference to where it may be found.
3
Overvi ew of RO SA
Figure 1 shows an overview of the current version of the reuse system. The current system consists of a classification mechanism and a retrieval mechanism. The classification system catalogues the software components in a knowledge base using their descriptions in natural language. An acquisition mechanism automatically extracts from software descriptions the knowledge needed to catalogue the components in the knowledge base. The system extracts lexical, syntactic and semantic information from simple verbal and nominal sentences in the description and this knowledge is used to create a frame-based internal representation for the software component. The interpretation mechanism used for the analysis of a description does not pretend to understand the meaning of a description but rather to automatically acquire enough information to construct useful indexing units for software components. Semantic analysis of descriptions follows the rules of a semantic formalism. The formalism consists of a case system as well as constraints and heuristics to perform the translation of the description into an internal representation. Both syntactic and semantic rules are implemented in a grammar to parse descriptions into a set of frames. The semantic formalism is based on semantic relationships between noun phrases and the verb in a sentence. These semantic relationships ensure that
Query
Component
Soft. classification
Query classification Lexicon
Figure 1 - Overview of ROSA
Lexical, syntactic and semantic analysis
Lexical, syntactic and semantic analysis
Internal representation of the query
Knowledge base
Retrieval Matching and similarity analysis
Candidate(s)
7 similar software descriptions have similar internal representations. A classification scheme for software components derives from the semantic formalism, through a set of generic frames. The internal representation of a description constitutes the indexing unit for the software component, constructed as an instance of these generic frames. Public domain lexicons are used to get the lexical information needed during the parsing process. The WordNet [19] lexicon is used to obtain morphological information, grammatical categories of terms and semantic relationships between terms. The Knowledge base is a base of frames where each software component has a set of associated frames containing the internal representation of its description along with other information associated to the component (source code, executable examples, software attributes, etc). An analysis mechanism similar to the one applied to software descriptions is used to map a query in free text to an internal representation. The retrieval system uses the set of frames generated for the query to identify similar ones in the Knowledge base. The retrieval system looks for and selects components from the Knowledge base, based on the closeness of the frames associated to the query and to software descriptions. Closeness measures are derived from the semantic formalism and from the conceptual distance between terms in the frames that are compared. Software components are scored according to their closeness value with the user query. The components with a score higher than a controlled threshold become the candidates for retrieval. As it stands now, the system deals with simple imperative or nominal sentences for both queries and software component descriptions. An imperative sentence [30] consists of a verb with, possibly, an embedded noun phrase and perhaps some embedded prepositional phrases (e.g. ‘search a file for a string’). The subject is omitted and the verb sequence consists of a verb in the infinitive form. A prepositional phrase consists of a preposition and a noun phrase. A noun phrase consists of a noun, along with a determiner, perhaps one or more modifiers (adjective, past or present participle, noun) and, possibly, one or more qualifiers. Imperative sentences describe simple actions that are performed by a software component and perhaps the object manipulated by the action, the manner by which the action is performed and other semantic information related to the action. Nominal sentences [30] consist of a noun phrase possibly followed by some prepositional phrases (e.g. ‘a windowed interface to ftp’). Nominal sentences usually describe the component object (like the data structure or tool) with semantic information related to the object.
4
The sem anti c f or mal is m
The main purpose of the retrieval mechanism is to find out that descriptions like the following ones are equivalent or similar: (a) search a string in a file (b) search a file for a string (c) look for a string in a file (d) examine a text file for lines matching a pattern
8 A partial solution to this problem is to find a representation language to reduce the expressions to a canonical form. A canonical form requires that every expression equivalent to a given one be reduced to a single form by means of an effective procedure, so that the test of equivalence between descriptions can be reduced to the test of identity of the canonical forms. Having a canonical form would allow determining that descriptions (a), (b) and (c) are equivalent. Although some authors consider that it is unlikely that there could be a canonical form for English [32], there are some proposals of canonical representation languages for English, like the Conceptual Dependency Theory [25] that has been already experimented in information retrieval systems [18]. This approach argues that different sentences with the same basic meaning are given by the same representation in a conceptual dependency graph. Unfortunately, the amount of word knowledge and inferential machinery needed to experiment with and evaluate a formalism like the one proposed by the Conceptual Dependency Theory limits its application to small and well-known domains. On the other hand, having a canonical form does not solve the problem of representing similar but not necessarily equivalent expressions like (a) and (d). The representation formalism proposed in this work provides a partial canonical form to represent software descriptions. The representation is not completely canonical because it allows for some ambiguities. Instead of reducing a description to a single form, the analysis mechanism infers all possible interpretations of a sentence. Heuristics are used to disambiguate alternative interpretations, but if there is not enough knowledge to do so, alternative semantic structures are generated for the same description. Through this semantic representation, the similarity analysis of sentences infers that sentences (a), (b) and (c) are equivalent (because they have the same semantic structure and the content in the semantic structures is the same or related by the synonymy semantic relationship) and that these sentences are similar to (d) (because they have the same semantic structure and the content in the semantic structures is related by some semantic relationships (like hyponymy and hypernymy). Looking for similarities between software descriptions (e.g. generalizations and specializations) is important because software functionality is usually factored in a generic component (with a software description that is generalized) and because queries are often specific to particular user requirements. The semantic formalism used to generate the internal representation of both queries and natural language descriptions of software components is inspired by the linguistic case theory of Fillmore [6]. The formalism consists of a case system for simple imperative or nominal sentences in a software description, of a case system for noun phrases in those sentences and of a case system for other information of the software artifact along with some constraints and heuristics.
4.1
Case systems
Definition 1 Case system for imperative and nominal sentences in software descriptions
9 The case system for an imperative or nominal sentence in a software description consists basically of a sequence of one or more semantic cases: Sentence →
∑
j∈C
c
j
where: C = {j | j is a semantic case} cj = the term in the sentence associated to the semantic case ‘j’
A sentence consists of a verb (representing an action), omitted in the case of a nominal sentence, possibly followed by a noun phrase (representing the direct object of the action) and perhaps some embedded prepositional phrases (representing entities or actions related to the main action or object). For instance, the sentence ‘search a file for a string’ consists of the verb ‘search’, in the infinitive form, followed by the noun phrase ‘a file’, which represents the object manipulated by the action, and followed by the prepositional phrase ‘for a string’, which represents the goal of the ‘search’ action. The case system associates semantic cases to each one of these syntactic compounds. In the example below, the semantic cases ‘Action’, ‘Location’ and ‘Goal’ are associated respectively to the verb, direct object and prepositional phrase: ‘search a file for a string’
-->
cAction + cLocation + cGoal
Definition 2 Semantic cases Semantic cases show how prepositional phrases are semantically related to the verb or main object in a sentence. For instance, in the sentence ‘search a file for a string’, the semantic case ‘Goal’ associated to the prepositional phrase ‘for a string’ shows the target of the action ‘search’. The description of the semantic cases currently considered in this work is shown in Table 1. Definition 3 Semantic case generators A semantic case consists either of a verb (the ‘Action’ semantic case) or a case generator (maybe omitted) followed by a noun or verbal phrase: ci --> (Semantic case generator) + [Noun phrase | Verbal phrase]
A semantic case generator reveals the presence of a particular semantic case in a sentence. Semantic case generators are mainly prepositions. For instance, in the sentence ‘search a file for a string’, the preposition ‘for’ in the prepositional phrase ‘for a string’ suggests the ‘Goal’ semantic case. Table 2 shows some generators of the semantic cases considered in this work. For instance, the preposition ‘for’ suggests all the semantic cases ‘Goal’, ‘Purpose’ and ‘Duration’. The technique used for the semantic analysis of descriptions tries to disambiguate these different interpretations by applying a set of heuristics (based on both syntactic and semantic knowledge) to the associated prepositional phrase that follows in the sentence.
10
Semantic case Action Agent Comparison Condition Destination Duration Goal Instrument Location Manner Purpose Source Time
Description The main function performed by the component The software component that implements the object functionality (implicit) The entity compared with the main object The premise upon which the action is performed The destination for the main object of the component How long an action lasts The main object, the target or result of the action The entity with which the action is performed Either the main object; where the action occurs or the entity (e.g. the data structure) used as a medium for the action The mode by which the action is performed Either the main functionality of the component or the reason for the main action The origin of the main object When the action occurs
Table 1 - Major semantic cases found in simple sentences describing software components
Semantic case Comparison Condition Destination Duration Goal Instrument Location Manner Purpose Source Time
Case generator as, between, like, with as, at, except, how, unless, until to during, for, through for, to ‘by using’, with, using, via at, in, into, on, onto, through, within by for, to from at, in, into, on, through
Table 2 - Semantic cases and some of their case generators
Definition 4 Case system for noun phrases in software descriptions The case system for a noun phrase consists of a sequence of zero or more cases representing modifiers of a head noun and one case representing the head noun in the noun phrase (simple
11 noun phrase case system) or these same cases with an additional case that represents a complementary noun phrase associated to the main noun phrase (composite noun phrase case system): Noun_phrase → ∑ Mod j + Head + ( complementary_np ) j ∈ Mod
where: Mod = { j j is a modifier case } = { adjective_modifier, participle_modifier, noun_modifier }
The optional complementary noun phrase case consists of the case generators ‘of’ or ‘about’ followed by a noun phrase: complementary_np → Complementary_np_case_generator + Noun_phrase
where, Complementary_np_case_generator
= { of, about }
For instance, the phrase ‘retrieval of software artifacts’ is a composite noun phrase where the preposition ‘of’ suggests a complementary noun phrase case, representing the simple noun phrase ‘software artifacts’. The noun phrase case ‘head noun’ is associated with the term ‘retrieval’ in the main noun phrase. The noun phrase cases ‘noun_modifier’ and ‘head noun’ are associated with the terms ‘software’ and ‘artifacts’, respectively, in the complementary noun phrase. Definition 5 Case system for software attributes The case system for the additional information available in a software artifact consists of a sequence of cases representing the attributes of a software artifact (name, description, examples, test collections, reuse attributes, etc): Component → c name + c description + ...
Definition 6 Case system constraints and heuristics The following constraints and heuristics govern the case systems. They state syntactic and semantic rules to identify both semantic and noun phrase cases in a sentence of a software description. 1. No semantic case may appear twice. 2. The ‘Action’ semantic case is obtained from the main verb in the sentence. 3. The case generator of either the ‘Location’ or ‘Goal’ semantic cases is occasionally empty, in which case, the semantic case is obtained from the direct object in the sentence. Otherwise, the semantic case is obtained from a prepositional phrase, in the sentence, containing a case generator of the semantic case. 4. Other semantic cases are obtained from the prepositional phrases in the sentence containing a case generator of the semantic case.
12 5. The ‘Purpose’ and ‘Manner’ semantic cases consist of an appropriate case generator followed by a verbal phrase. Other semantic cases consist of an appropriate case generator followed by a noun phrase. 6. Two interpretations of a composite noun phrase (e.g. ‘retrieval of software artifacts’) are considered: (a) the complementary noun phrase case is identified (e.g the noun phrase ‘software artifacts’ ia associated to the complementary noun phrase case), and (b) both the modifiers and the head noun in the complementary noun phrase are considered as modifiers of the main noun phrase (e.g. the composite noun phrase is interpreted as ‘software artifacts retrieval’). The case system, the constraints and the heuristics supporting the semantic formalism are implemented in the grammar used for the analysis of the descriptions.
5
The analysi s of t h e des cr ipt ion s
Morpholexical, syntactic and semantic analysis of software descriptions is performed to map a description to a frame-based internal representation.
5.1
M o r p h o l e x i c a l a na ly sis
The purpose of morpholexical analysis is to process the individual words in a sentence to recognize their standard forms (e.g. ‘search’ as the standard form of ‘searching’), their grammatical categories (verb, noun, adjective, adverb), and their semantic relationships with other words in a lexicon (like synonyms, hypernyms and hyponyms). Morpholexical analysis also performs the processing of collocations and idioms, i.e. where two or more individual words are used in habitual association (multi-word lexical entries). Collocations consist of associations of modifiers (like adjectives) with nouns (e.g. ‘personal computer’), multiple nouns (e.g. ‘data processor’), verb particle (e.g. ‘take over’), verb preposition (e.g. ‘look for’) and verb/noun combinations (e.g. ‘take care’). Morpholexical analysis produces a set of facts stating grammatical categories of words/ collocations: lex (<word>, ).
for instance, lex (‘search’, verb). lex (‘search’, noun).
Morpholexical information is obtained from the WordNet lexicon [19]. 5 . 1. 1
Sema nt ic rel a t io ns be twe e n te rms
Two semantic relations between terms are currently considered: synonymy and hyponymy/ hypernymy. The following predicates are used to represent these relations.
13 Definition 7 Synonymy The predicate synonym(x,y,c) means that the term ‘y’ is a synonym of the term ‘x’ in the ‘c’ lexical category. For instance, synonym(‘computer’, ‘data processor’, noun).
The relationship is symmetric and transitive (for a given sense): synonym(x,y,c) -> synonym(y,x,c). synonym(x,y,c), synonym(y,z,c) -> synonym(x,z,c).
Definition 8 Hyponymy/ Hypernymy The predicate hyponym(x,y,c,d) means that the term ‘y’ is an hyponym (a specialization) of the term ‘x’ at a d-distance in the ‘c’ lexical category. For instance, hyponym(’computer’, ‘PC’, noun,1).
The predicate hypernym(x,y,c,d) means that the term ‘y’ is an hypernym (a generalization) of the term ‘x’ at a d-distance in the ‘c’ lexical category. For instance, hypernym(‘computer’,‘machine’, noun,1).
The hyponymy/hypernymy relation is usually called subordination/superordination, subset/ superset or IS_A_KIND_OF relation. The following properties characterize the relationship: hyponym(x,y,c,d), hyponym(y,z,c,e) -> hyponym (x,z,c,d+e). hyponym(x,y,c,d) -> hypernym (y,x,c,d).
Morpholexical analysis of user queries generates a set of facts stating the synonyms, generalizations and specializations of terms in the query for a particular grammatical class: synonym(<word>, <synonym>, ). hyponym(<word>, , , ). hypernym(<word>, , , ).
like, for instance: synonym(‘search’,‘look for’, verb). synonym(‘search’, ‘hunt’, noun). hyponym(‘search’, ‘finger’,verb, 1). hyponym(‘search’,‘looking for’, noun, 1). hypernym(‘search’, ‘examine’, verb, 1). hypernym(‘search’, ‘activity’, noun, 1). hypernym(‘search’, ‘act’, noun, 2).
5.2
Sy n t a c t i c a n d s em a n tic a n a ly sis
Just after morpholexical analysis, both syntactic and semantic analysis of software descriptions are performed, using the definite clause grammar supporting the case system outlined in Figure
14
Sentence
-->
Imperative_sentence | Nominal_sentence.
Imperative_sentence --> Nominal_sentence -->
Action (Direct_object_case) (Other_cases). Direct_object_case (Other_cases).
Direct_object_case
-->
Goal | Location.
Other_cases
-->
Semantic_case (Other_cases).
Semantic_case
-->
Comparison | Condition | Destination | Duration | Goal | Instrument | Location | Manner | Purpose | Source | Time.
Semantic_case ∉ {Other_cases}
Action Comparison Condition Destination Duration Goal Instrument Location Manner Purpose Source Time
--> --> --> --> --> --> --> --> --> --> --> -->
verb (adverb). Comparison_marker Noun_phrase. Condition_marker Noun_phrase. Destination_marker Noun_phrase. Duration_marker Noun_phrase. (Goal_marker) Noun_phrase. Instrument_marker Noun_phrase. (Location_marker) Noun_phrase. Manner_marker Imperative_sentence. Purpose_marker Imperative_sentence. Source_marker Noun_phrase. Time_marker Noun_phrase.
Comparison_marker --> Location_marker --> . . (see Table 2) .
BETWEEN | AS | LIKE | WITH. IN | ON | INTO | WITHIN | AT | THROUGH.
Noun_phrase
-->
Main_np -->
(Pre_determiner) (Determiner) (Ordinal) (Cardinal) (Modifiers) Head.
Qualifier
Complementary_np | Other_qualifier.
-->
Main_np (Qualifier).
Complementary_np -->
Complementary_np_marker Noun_phrase.
Modifiers --> Modifier (Modifiers) Modifier --> Adjective_modifier | Participle_modifier | Noun_modifier. Complementary_np_marker . . .
-->
OF | ABOUT.
Figure 2 - A portion of the definite clause grammar supporting the case system
2. Names in all capitals, initial capitals and all lower case indicate, respectively terminal symbols, non-terminal symbols and lexical categories. Terms in parentheses are optional. The grammar implements a subset of the grammar rules for imperative and nominal sentences in English [30].
15 The classification mechanism uses the grammar to parse software descriptions. Following the parsing process, a set of semantic rules is applied to remove semantic case ambiguities and a set of frame-based semantic structures is generated, representing the internal structures of software descriptions. Syntactic and semantic analysis also include the processing of ‘support’ verbs, i.e. verbs that support the execution of an action (e.g. ‘do’, ‘perform’, ‘carry out’, ‘make’): ‘support verb’ an action
or ‘support verb’ the action of
The analyzer substitutes the ‘support’ verb and the action with the proper verb, that implements the action, by looking for the action in a dictionary. For instance, ‘retrieval’ is ‘the act or process of retrieving’, then a phrase like ‘perform a retrieval’ is substituted by the verb ‘retrieve’: perform a retrieval --> retrieve.
5 . 2. 1
The c l a ssific a t io n sche me
Figure 3 outlines the frame-based classification scheme for modelling the internal representation of software components. The classification scheme consists of a hierarchical structure of generic frames (‘IS-AKIND-OF’ relationship). Frames that are instances of these generic frames (‘IS-A’ relationship) implement the indexing units associated to software components. The hierarchical structure allows alternative interpretations of a particular syntactic compound. Major generic frames of the Knowledge base are described in Figure 4. The generic frames model the semantic structures associated with verb phrases (‘verb_phrase’ frame), Case_frame Hierarchical_link Case_list Case Case_name Semantic_case_name
--> --> --> --> -->
NP_case_name Component_case_name Facet --> Value Lexical_category Frame_name Instance_name Component_name Num
--> --> --> --> --> -->
FRAME Frame_name Hierarchical_link CASES Case_list. IS_A Frame_name | IS_A_KIND_OF Frame_name Case (Case_list) Case_name Facet Semantic_case_name | NP_case_name | Component_case_name --> Action | Agent | Comparison | Condition | Destination | Duration | Goal | Instrument | Location | Manner | Purpose | Source | Time --> Head | Adjective_modifier | Participle_modifier | Noun_modifier | Complementary_np. --> Name | Description | ... VALUE Value | DOMAIN Frame_name | CATEGORY Lexical_category string | Instance_name verb | adj | noun | adv | component_id | text. root_frame | verb_phrase | noun_phrase | component. Component_name&_&Frame_name&_&Num. string. integer
Figure 3 - A classification scheme for modelling the internal structures of software components
16
FRAME verb_phrase IS_A_KIND_OF root_frame CASES Action CATEGORY verb Agent DOMAIN component Comparison DOMAIN noun_phrase Condition DOMAIN noun_phrase Destination DOMAIN noun_phrase Duration DOMAIN noun_phrase Goal DOMAIN noun_phrase Instrument DOMAIN noun_phrase Location DOMAIN noun_phrase Manner DOMAIN verb_phrase Purpose DOMAIN verb_phrase Source DOMAIN noun_phrase Time DOMAIN noun_phrase. FRAME noun_phrase IS_A_KIND_OF root_frame CASES Adjective_modifier CATEGORY adj Participle_modifier CATEGORY verb Noun_modifier CATEGORY noun Head CATEGORY noun Complementary_np DOMAIN noun_phrase. FRAME component IS_A_KIND_OF root_frame CASES Name CATEGORY component_id Description CATEGORY text . . {Other information associated to the component, . e.g. source code, executable examples, reuse attributes, etc}
Figure 4 - Generic frames of the Knowledge Base
noun_phrases (‘noun-phrase’ frame), and the other information associated with software components (name, description, source code, executable examples, reuse attributes, etc., in ‘component’ frame). Semantic cases, noun phrase cases and cases related to software attributes are represented as slots in a frame. According to the semantic formalism, the semantic cases defined in the sentence case system are assigned to the slots of the generic verb-phrase frame while noun phrase cases and component cases are assigned to the generic noun phrase frame and component frame, respectively. ‘Facets’ are associated with each slot in a frame, describing either the value of the case or the name of the frame where the value is instantiated (‘VALUE’ facet), the type of the frame that describes its internal structure (‘DOMAIN’ facet) or the lexical category of the case (‘CATEGORY’ facet). For instance, the ‘Location’ slot in the verb_phrase frame has a ‘DOMAIN’ facet indicating that its constituents are described in a frame of type ‘noun_phrase’. Through the parsing process, the interpretation mechanism maps the verb, the direct object and each prepositional phrase in a sentence into a semantic case, based on both syntactic features and identified case generators.
17
FRAME verb_phrase_1 IS_A verb_phrase CASES Agent VALUE grep_component Action VALUE ‘search’ Location VALUE grep_noun_phrase_1 Goal VALUE grep_noun_phrase_2. FRAME grep_noun_phrase_1 IS_A noun_phrase CASES Head VALUE ‘file’.
Figure 5 - An indexing structure for the “grep” command: “search a file for a string”
FRAME grep_noun_phrase_2 IS_A noun_phrase CASES Head VALUE ‘string’. FRAME grep_component IS_A component CASES Name VALUE ‘grep’ Description VALUE ‘search a file for a string’.
Figure 5 shows the indexing structure for the ‘grep’ family of Unix commands built from the description ‘search a file for a string’. An instance of the verb_phrase frame is generated by instantiating the slots corresponding to the semantic cases identified in the description (’Action’, ‘Location’ and ‘Goal’). These cases have an associated ‘VALUE’ facet indicating either the value of the slot (as ‘search’ for the ‘Action’ case) or the name of the instance frame with its value (grep_component, grep_noun_phrase_1 and grep_noun_phrase_2 for the semantic cases ‘Agent’, ‘Location’ and ‘Goal’, respectively). 5 . 2. 2
I nt erp ret a t io n mechanism
Through the parsing process, the interpretation mechanism maps each prepositional phrase in a sentence to a semantic case, based on the identified case generator. Sentences such as, (a) search a file for a string, (b) search a string in a file, (c) look for a string in a file, or (d) examine the text file for lines matching a pattern
produce the same semantic structure as in Figure 5 (i.e. instance frames with the same semantic cases) and either the same ‘VALUE’ facet or one that is related by semantic relationships: synonym(‘look’,‘search’,verb). hypernym(‘search’,’examine’,verb,1). hyponym(‘file’,’text file’,noun). hypernym(‘string’,’line’,noun).
Because a semantic case generator can suggest more than one semantic case (Table 3), more than one semantic structure can be generated for the analyzed description. For instance, the analysis of the sentence ‘search a file for a string’, generates two alternative semantic structures, because of the ambiguity of the ‘for’ case generator: the one of Figure 5, where the
18
Ambiguous case generator as at for in into on through to with
Semantic case comparison, condition condition, location, time duration, goal location, time location, time location, time location, time destination, goal, location comparison, instrument
Table 3 - Ambiguity in semantic cases
Figure 6 - Wrong interpretation of the description “search a file for a string”.The ‘Duration’ semantic case is identified but the interpretation is discarded through lexical knowledge of terms in the lexicon.
FRAME verb_phrase_1 IS_A verb_phrase CASES Agent VALUE grep_component Action VALUE ‘search’ Location VALUE grep_noun_phrase_1. Duration VALUE grep_noun_phrase_2.
semantic case ‘Goal’ is identified and the other of Figure 6, where the semantic case ‘Duration’ is identified1. Heuristics are used to reduce the number of alternative structures that are generated, and thus, to increase the precision of the interpretation. These heuristics are based on knowledge about the phrase in the semantic case. For instance, the rules: Duration (term) Time (term)
--> -->
hypernym(term, ‘time period’). hypernym(term, ‘time period’).
state a precondition for the ‘Duration’ and ‘Time’ semantic case: the term in the ‘Duration’ semantic case must be a specialization of the term ‘time period’. This first rule suggests that the ‘Duration’ semantic case is not a good interpretation of the prepositional phrase ‘for a string’, because ‘string’ is something different from a period of time. Thus, the rule allows to disambiguate alternative interpretations of the sentence ‘search a file for a string’. Therefore, the interpretation mechanism discards frames where an occurrence of a ‘duration’ or ‘time’ semantic case has been identified and the associated head nouns are not a time unit. In the case where the head noun is a valid time unit, ambiguous interpretations (like ‘location’, ‘goal’ and ‘condition’) are discarded based on the associated case generator. 1. Note that at this level, there is no ambiguity among these semantic cases and the ‘Purpose’ semantic case. The ‘Purpose’ interpretation is discarded during syntactic analysis, because the phrase in the prepositional phrase is a nominal and not a verbal one.
19
6
Query processi n g an d s imil ar it y an alys is
A user query is an imperative or nominal sentence. A similar analysis mechanism to the one applied to software descriptions is used to map the query to a frame-based internal representation. The internal representation is then compared with the ones associated to software descriptions in the Knowledge base. A similarity analysis is performed and matching components are retrieved and scored according to their closeness with the user’s query. The retrieval mechanism provides good level of precision by reducing the number of irrelevant components that are retrieved. The retrieval is based on the detection of similar semantic structures, i.e. structures that share all or some semantic cases and where there are some lexical relationships between the terms in the shared semantic cases. Precision is ensured by basing the similarity analysis on the syntactic and semantic information available in the internal representation of both query and software descriptions. For instance, retrieved candidates for the query ‘search a file for a string’ would consist of descriptions with similar semantic structures to the ones of Figure 5: actions similar to ‘search’; something similar to ‘file’ as the object manipulated by the action and something similar to a ‘string’ as the goal of the action. Thus, a retrieved component description to the former query like ‘look for a string in a file’, where all semantic cases are matched and terms are identical or synonyms, has the highest score in the similarity analysis. On the other hand, component descriptions irrelevant to that query like ‘file the search tree as a string’, that would be retrieved, if present, with a high score by a simple keyword retrieval mechanism, are discarded through this retrieval mechanism. Recall is ensured by allowing partial matching of the semantic structures and considering synonyms, hyponyms and hypernyms of the terms in the semantic cases. Thus, a retrieved component description for the query above, like ‘browse a file’, where only the ‘Action’ and ‘Location’ semantic cases are matched and where case terms are at a greater conceptual distance (‘browse’ is an hyponym of ‘search’) is considered also as relevant description but having a lower score than descriptions like ‘look for a string in a file’.
6.1
Si m i l a r i t y c o m p u ta tio n
Definition 9 Similarity between a query and a software component The similarity value between a query ‘q’ and a software component ‘c’ is performed by comparing the frame structures associated with the internal representation of the query and of the software component and taking the maximum value: S ( q, c ) = max S Case ( FS qk, FS cdl ) kdl
The internal representation of a query consists of a set of frame structures {FSqk} representing different interpretations ‘k’ of a query ‘q’. The internal representation of a software component consists of a set of frame structures {FScdl}, representing different interpretations ‘l’ of the different descriptions ‘d’ that can be associated with a software component ‘c’.
20 Definition 10 Similarity between frame structures The function SCase (FSqk, FScdl) below measures the similarity between the frame structure FSqk (the ‘k’ interpretation of the user query ‘q’) and the frame structure FScdl (the interpretation ‘l’ of the description ‘d’ of the software component ‘c’). The function is based both on the number of the semantic cases that have a match and on the closeness among the terms in the matched cases. Semantic cases are weighed by a factor (wj) that represents the relative importance of a semantic case in describing the functionality of the component1. S Case ( FS qk, FS cdl ) =
∑
∀j ∈ SC qk ∩ SC cdl
w j • sc_closeness ( sc qkj, sc cdlj )
with,
∑
∀j ∈ SC qk ∪ SC cdl
wj = 1
where, SCqk = {j| j is a semantic case in the frame structure of the query ‘q’ with interpretation ‘k’} SCcdl = {j| j is a semantic case in the frame structure of the component ‘c’ with description ‘l’ and interpretation ‘k’} scqkj = the term of the semantic case ‘j’ in the frame structure of the query ‘q’ with interpretation ‘k’ sccdljl = the term of the semantic case ‘j’ in the frame structure of the component ‘c’ with description ‘d’ and interpretation ‘l’ wj = the weight of the semantic case ‘j’
Definition 11 Closeness between semantic cases Closeness between semantic cases is a function of the conceptual distance between the terms in the semantic cases, according to their ‘domain’ or ‘category’ facet (single terms, noun phrases or verb phrases). np_closeness (scqkj,
sc_closeness (scqkj, sccdlj) =
closeness (scqkj, SCase (scqkj,
6 . 1. 1
sccdlj)
sccdlj)
sccdlj)
if DOMAIN (j) = noun_phrase if FACET(j )= CATEGORY if DOMAIN (j) = verb_phrase
Cl o seness mea sure be twe e n single te rms
Definition 12 Conceptual distance between single terms The function dist(x,y) measures the conceptual distance between the single terms or collocations ‘x’ and ‘y’, by considering the distance of the terms in a lexicon, according to the lexical relationships of synonymy, hyponymy and hypernymy described in section 5.1.1.
1. Current experiments assume the same relative importance for all semantic cases in a query
21
0 dist(x,y) =
if
x=y
min { d | ∞
or
synonym(x,y)
hyponym (x,y,d) or hypernym (x,y,d) }
otherwise
Definition 13 Closeness between single terms The function closeness(x,y) measures the closeness between the single terms or collocations ‘x’ and ‘y’. The function is inversely proportional to the conceptual distance between the terms in Definition 12 (nearer terms have a greater closeness value). closeness (x,y) = vdist(x,y)
with 0
Example Table 4 shows the computation of the distance and closeness value between some single terms and collocations. 6 . 1. 2
Cl o seness mea sure be twe e n simple noun phrase s
An instance of a simple noun phrase frame ‘x’ consists of a head ‘hx’(the main noun in the phrase) with a list ‘Mx’of zero, one or more modifiers (adjectives, nouns, participles). Two assumptions are made in computing the conceptual distance between noun phrases: • a list of modifiers specializes a head noun and the length of the list gives the distance between the whole noun phrase and its head (Figure 7); • the distance between noun phrases is the distance between their head nouns increased by the distance between the two lists of modifiers (Figure 8). Definition 14 Conceptual distance between noun phrases The function simple_np_dist (x,y) (Figure 8) computes the conceptual distance between two noun phrase frames ‘x’ and ‘y’, as the sum of the conceptual distance between the head nouns and the conceptual distance between the lists of modifiers. simple_np_dist (x,y) = dist (hx, hy) + ml_dist (Mx,My)
where, x, y - noun phrase instance frames hx, hy - the head noun in frames x and y, respectively Mx, My = the list of modifiers of the frames x and y, respectively.
Definition 15 Conceptual distance between modifiers The matrix M_DIST(Mx,My), m x m, where m = max [length(Mx), length(My)], defines the conceptual distance between pairs of modifiers (mxi, myj) from lists Mx and My.
22
dis t
y
closenes s v = 0.5
Lexical relationships in WordNet
‘PC’ 0 1 synonym (‘personal computer’, ‘PC’) ‘desktop computer’ 1 0.5 hyponym (‘personal computer’,‘desktop computer’,1) ‘laptop’ 2 0.25 hyponym (‘personal computer’, ‘laptop’, 2) ‘digital computer’ 1 0.5 hypernym (‘personal computer’, ‘digital computer’, 1) ‘computer’ 2 0.25 hypernym (‘personal computer’, ‘computer’, 2) Table 4 - The distance and closeness value between some single terms or collocations and the collocation ‘personal computer’
Mx hx
length (Mx)
hx
dist (Mx hx,hx) = dist (hx,Mx hx) = length (Mx )
e.g. , 1
‘input file’
dist (‘input file’, ‘file’) = 1
‘file’
Figure 7 - A list of modifiers specializes a head noun
ml_dist (Mx,My) Mx hx
Myhx
simple_np_dist (x,y) = ml_dist(Mx,My) + dist (hx,hy)
dist (hx,hy) My hy Figure 8 - The distance between two noun phrases is the distance between the head increased by the distance between the lists of modifiers
min [ dist (mxi, myj), 2 ] m_dist (mxi, myj) =
if
0 < i ≤ length ( M x ) , 0 < j ≤ length ( M y )
1 otherwise
The conceptual distance between modifiers is computed by taking the minimum value of either the single term distance value between the modifiers, according to the measure in Definition 12, or a distance value of ‘2’, according to the assumption that a modifier specializes a head noun. As illustrated in Figure 9, a maximum distance of ‘2’ between the lists mx1 hx and my1 hx is obtained by considering that my1 hx is an specialization of hx and that hx is a generalization of mx1 hx.
23
m_dist(mx1,my1) = min [dist(mx1,my1),2]
1
1
mx1 hx
hx
my1 hx
1
mx1 mx2 hx
min [dist(mx2,my1),2]
mx1 my1hx
min [dist(mx1,my1),2]
min [dist(mx1,my2),2]
my1 my2 hx
min [dist(mx2,my2),2]
mx2 my1hx
m_dist(mx1,my1)
m_dist(mx1,my2)
m_dist(mx2,my1)
m_dist(mx2,my2)
m_dist(mx1,my1)
m_dist(mx1,my2)
M_DIST(mx1 mx2,my1 my2) =
M_DIST(mx1 ,my1 my2) = 1
1
ml_dist (mx1 mx2,my1 my2) = min [m_dist(mx1,my1) + m_dist(mx2,my2), m_dist(mx1,my2) + m_dist(mx2,my1)] ml_dist (mx1 ,my1 my2) = min [m_dist(mx1,my1) + 1, m_dist(mx1,my2) + 1]
Figure 9 - The distance between lists of modifiers
For instance, m_dist(computer,computer) = min (0,2) = 0 m_dist(PC,computer) = min (1,2) = 1 m_dist(person,computer) = min ( ∞ ,2) = 2
In the case of lists of modifiers that have different lengths, the matrix is completed by 1distance values, reflecting that each modifier in the greater list can specialize the other one. For instance, the ‘1’ distance values in the matrix M_DIST(mx1 ,my1 my2) in Figure 9 show that either mx1 my1 hx or mx1 my2 hx should be considered as a specialization of mx1 hx. Definition 16 Conceptual distance between lists of modifiers The function ml_dist (Mx,My) computes the conceptual distance between the lists of modifiers Mx = mx1 mx2 ... mxlength(Mx) and My = my1 my2 ... mylength(My). The distance is computed by
24 considering all the sets built with pairs of different modifiers from Mx and My, and taking the minimum sum of the distances between the pairs of modifiers for all those sets1. Figure 9 illustrates the computation of the conceptual distance between the lists of modifiers mx1 mx2 and my1 my2, as the minimum value of the sums of the elements of the main and secondary diagonal of the associated distance matrix. ml_dist ( M x, M y ) = min k = 1, m
∑
m_dist ( m xi, m yj ) , r
(i,j) ∈ DIAG k
∑
m_dist ( m xi, m yj ) l
(i,j) ∈ DIAG k
where, r
DIAG k = { ( i, j ) | i = [ ( 1 + n ) mod m ] + 1, j = [ ( k + n ) mod m ] + 1, n = 1, m } r
DIAG k = { ( i, j ) | i = [ ( k + m – n ) mod m ] + 1, j = [ ( 1 + n ) mod m ] + 1, n = 1, m }
Definition 17 Closeness between simple noun phrases The function simple_np_closeness (x,y) measures the closeness between two noun phrase frames ‘x’ and ‘y’. The function is inversely proportional to the conceptual distance measure between noun phrase frames in Definition 14. Closeness is computed by either considering or discarding modifiers (u =1) in the noun phrase, to evaluate the relative effects of modifiers in effectiveness results. simple_np_closeness (x,y) = u ml_dist (Mx,My) . closeness (hx, hy) 0 < u ≤ 1
6 . 1. 3
Cl o seness mea sure be twe e n composite noun phrase s
A composite noun phrase x = <mainx, complementx> is a noun phrase composed of a main simple noun phrase mainx and a complementary noun phrase complementx. For instance, ‘retrieval of software artifacts’ is a composite noun phrase composed of the main simple noun phrase ‘retrieval’ and a complementary simple noun phrase ‘software artifacts’. Definition 18 Closeness between composite noun phrases The closeness between a composite noun phrase frame x and a composite noun phrase frame y is computed as follows: np_closeness (x,y) = max [simple_np_closeness (mainx,mainy), w. simple_np_closeness (complementx,complementy), w. simple_np_closeness (mainx,complementy), w. simple_np_closeness (complementx,mainy)] if x and y are composite noun phrases np_closeness (x,y) = max [simple_np_closeness (mainx,mainy), w. simple_np_closeness (mainx,complementy)] if x is a simple noun phrase and y a composite noun phrase np_closeness (x,y) = max [simple_np_closeness (mainx,mainy), w. simple_np_closeness (complementx,mainy)] if x is a composite noun phrase and y is a simple noun phrase
1. Note that the order of modifiers in a list is not currently considered in the computation of the distance measure.
25 where, w = the significance of a match with a complementary noun phrase ( 0 ≤ w ≤ 1 ).
Example Table 5 shows some examples of the closeness value between noun phrases.
7
Evaluati on of r et r ieval ef f ect iven es s
Both the retrieval and the classification system of ROSA have been implemented in BIMprolog on a SUN platform. This section describes an experiment with the retrieval prototype. The experiment aims at evaluating the retrieval effectiveness by measuring, among others: • the effectiveness of the proposed similarity measures for different parameter values and weighing factors; • the influence in retrieval effectiveness of increasing/decreasing the maximum level of generalizations/specializations of terms considered for the matching process and, • the influence in retrieval effectiveness of processing modifiers. The experiment aims also at comparing the retrieval effectiveness of our approach with an available retrieval mechanism supporting, like the one of ROSA, both queries in natural language and scoring of the retrieved material. Other experiments evaluating the effectiveness of the classification mechanism are introduced in [9].
7.1
Method
To evaluate the effectiveness of the retrieval mechanism of ROSA, the traditional information retrieval measures, ‘recall’ and ‘precision’ were used. ‘Recall’ measures the proportion of relevant material that is retrieved while ‘precision’ measures the proportion of the retrieved material that is relevant: Recall = RelRet/Rel Precision = RelRet/Ret
(1) (2)
where, RelRet = Number of retrieved components that are relevant Rel = Number of relevant components in the knowledge base Ret = Number of retrieved components
The formulas (1) and (2) do not apply well adequate to retrieval mechanisms producing a rank of the retrieved material like ROSA does. Instead, the following normalized formulas in [24] that we have rewritten, were used in our experiments: Rel Recall = 1 – ∑ Rank i – i ⁄ ( Rel • ( N – Rel ) ) i=1
(3)
26
np_closeness u = 1, v = 0.9, w = 0.9
np_closeness u = 0.9, v= 0.9, w = 0.9
computer program
1
1
program
1
0.9
machine program
1
0.9
hyponym(‘machine’, ‘computer’,1)
PC software
0.9
0.81
hypernym (‘PC’, ‘computer’,1), hyponym (‘software’, ‘program’, 1)
list of PC programs
0.9
0.81
hypernym (‘PC’, ‘computer’,1)
Noun phrase
Lexical relationships in WordNet
Table 5 - The closeness value between some noun phrases and the noun phrase ‘computer program’
Rel N ) Precision = 1 – log ∏ Rank i ⁄ i ⁄ log ( C Rel i=1
(4)
where N = Number of components in the knowledge base Ranki = Actual rank of the relevant retrieved components
The formulas (3) and (4) allow the evaluation of retrieval systems that provide a score of the retrieved material. The ‘rank’ variable describes the actual rank of the relevant retrieved material within the list of retrieved material. For instance, if the number of components in the knowledge base, relevant to a given query (Rel), is 3, then the ideal ranking will be [1,2,3], where the relevant components appear respectively in the first, second and third place. In that case, both recall and precision will be one. The actual ranking will be usually different (e.g. it could be [2,5,8], indicating that the relevant components are ranked in the second, fifth and eighth place), with lower values of ‘recall’ and ‘precision’. The test collection consisted of 20 queries and a knowledge base containing the framebased internal representation of 418 general purpose UNIX commands. The knowledge base was populated automatically through the classification mechanism described in section 5. Table 6 shows the set of test queries with their relevant command names and short descriptions. They were constructed by asking a user to describe through an imperative sentence the functionality of UNIX commands he uses frequently. Different values of parameters/weighing factors were tested in order to observe their influence on retrieval effectiveness. We tested: • the significance of the closeness value between single terms in the computation of the similarity measure by increasing/decreasing the ‘v’ parameter (Definition 13); • the significance of the complementary noun phrase case in the noun phrase similarity computation by considering (w>0) or discarding (w=0) the case (Definition 18); • the significance of modifiers in the noun phrase similarity computation by considering (0
27 Query 1 2 3
Relevant commands - Short descriptions
go to another directory change the current directory show directory
cd - change working directory cd - change working directory pwd - display the pathname of the current directory ls - list the contents of a directory 4 give the files in a directory ls - list the contents of a directory 5 print directory content ls - list the contents of a directory 6 copy a file to another file cp - copy files dd - convert and copy files with various data formats 7 create a new file from another cp - copy files file dd - convert and copy files with various data formats 8 delete a file rm - remove files or directories 9 erase a directory rmdir - remove files or directories 10 erase all files in a directory rm - remove files or directories 11 delete a file from the disk 12 stop the current processes 13 show the running processes
14
15
16
17
18 19 20
rm - remove files or directories kill - send a signal to a process, or terminate a process ps - display the status of current processes mps - display the status of current processes on MP system list existing processes ps - display the status of current processes mps - display the status of current processes on MP system look for a string in a file grep, egrep, fgrep -search a file for a string or regular expression strings - find printable strings in an object file or binary locate a file on the disk find - find files by name, or by other characteristics whereis - locate the binary, source, and manual page files for a command find commands by keyword man - display reference manual pages; find reference pages by keyword apropos - locate commands by keyword lookup show current date date - display or set the date create a new directory mkdir - make a directory show the number of words in a wc - display a count of lines, words and characters file
Table 6 - Test queries with their relevant command names and short descriptions
All semantic cases in a query were considered as having the same significance. Therefore, we have assigned equal values to the associated weighing factors (wj). The values of recall and precision were calculated according to the ranking of the retrieved components and considering both the best and worst place in the ranked list. A ranked list consists of a list of retrieved components ordered by their similarity values with the user query.
28 Some components may have the same similarity value. For these components, the fact that a retrieved component appears before or after the other ones only depends of the order in which they were processed. Then, we compute the effectiveness values for the best place in the ranked list (which is the first one in the sublist of the retrieved components having the same similarity value) and for the worst case (which is the last one of that sublist). WAIS (Wide Area Information Server) of Thinking Machines Corporation was used for comparative purposes. The system scores retrieved material according to a weighing function the description of which is not available. The values of recall and precision were computed according to the ranking of the retrieved components but considering only the worst place of the components in the ranked list. The reason for discarding best positions in the ranked list was the lack of discriminating values of similarity in the scored list, i.e. the same score is frequently assigned to a very large list of components. Thus, computing recall and precision for the best places in the scored list may produce results that do not reflect the actual effectiveness of the system.
7.2
An a l y s i s o f r e s u lts
Table 7 shows ROSA retrieval effectiveness results for different parameters values, computed as the average between the best and worst position of retrieved components in the ranked list. Both high values of precision and recall were obtained. Best results were obtained with high parameters values affecting all the closeness values between single terms, modifiers matching, complementary noun phrase matching and by discarding generalization/specializations of terms or allowing a maximum of one level of generalizations/specialization in the computation of the distance between terms (row 1 in Table 7). Decreasing the significance of the closeness value between single terms through the ‘v’ parameter (row 2 in Table 7) causes a small loss of ‘precision’ (-1.09%), without affecting the ‘recall’ value. Decreasing the value of the ‘w’ parameter affecting the complementary noun phrase case in the computation of the closeness between noun phrases (row 3 in Table 7) causes a small loss of both ‘recall’ (-1.00%) and ‘precision’ (-2.17%). Discarding the complementary noun phrase relationship (row 4 in Table 7) results in a considerable loss of both ‘recall’ (-11.00%) and ‘precision’ (-13.04%). Decreasing the significance of modifiers in the computation of the closeness value between noun phrases through the ‘u’ parameter (row 5 in Table 7) produces a small diminution of the precision value (-2.17%). Discarding modifiers in the computation of that measure (row 6 in Table 7) affects both ‘precision’ (-4.35%) and ‘recall’ (-1.00%). Finally, ‘recall’ and ‘precision’ remain unchanged by either discarding generalization/ specialization of terms or by considering these relationships at a maximum distance of one in the generalization/specialization hierarchy (row 1 in Table 7). However, we observed an important loss of precision (- 18.48%) when allowing up to two levels of generalization/ specialization (row 7 in Table 7). The main reason of this behavior is that there is no treatment of word senses in the current version of the system. Therefore, looking for generalization/ specializations of terms may result in many irrelevant generalizations/specializations for the actual sense a term has. Note that ‘recall’ is also lower in row 7 than in row 1 (-3.00%) due to the fact that the ‘recall’ measure takes into account the ranking of relevant components, ranking which can be impacted by the generalization/specialization level.
29
Parameters
Recall Avg.
Loss of Recall
Precision Avg.
Loss of Precision
1
v = u = w= 0.90, Max. level gen./spec. = {0,1}
1.00
-
0.92
-
2
v = 0.5, u = w= 0.9, Max. level gen./spec. = 1
1.00
0.00 %
0.91
- 1.09 %
3
v = u = 0.90, w= 0.50, Max. level gen./spec. = 1
0.99
- 1.00 %
0.90
- 2.17 %
4
v = u = 0.90, w= 0.00, Max. level gen./spec. = 1
0.89
- 11.00 %
0.80
- 13.04 %
5
v = w = 0.90, u = 0.50, Max. level gen./spec. = 1
1.00
0.00 %
0.90
- 2.17 %
6
v = w = 0.90, u = 1.00, Max. level gen./spec. = 1
0.99
- 1.00 %
0.88
- 4.35 %
7
v = u = w= 0.90, Max. level gen./spec. = 2
0.97
- 3.00 %
0.75
- 18.48 %
Table 7 ROSA effectiveness results for different parameters values
System
Recall Avg.
Precision Avg.
ROSA
0.99
0.89
WAIS
0.73
0.47
+35.62 %
+89.36 %
Improvement ROSA over WAIS
Table 8 - WAIS and ROSA comparative results
Table 8 shows ROSA versus WAIS retrieval effectiveness results computed by considering the worst place of retrieved components in the ranked list for the same test collection of the previous experiment. An important improvement in both precision (89.36%) and recall (35.62%) was observed. Effectiveness results have shown the validity of the proposed similarity measures. Differently from other approaches for phrase-based indexing and retrieval, improvements of ‘precision’ values do not affect ‘recall’. High values of ‘recall’ are obtained by allowing partial matching of semantic cases and matching of complementary noun phrases. Precision is enhanced by increasing the significance of all the closeness values between single terms, the complementary noun phrase case and the modifiers closeness in the computation of the similarity measure between phrases. Results are very encouraging. However, retrieval effectiveness depends not only on the quality of the similarity measures but also on other factors. First, it depends greatly on the quality of component descriptions, i.e. if they describe the component in exhaustivity and specificity. Bad descriptions will necessarily produce poor retrieval results. Component descriptors in ROSA were produced from the short descriptions of UNIX commands. The descriptions provide considerable specificity but not exhaustivity. More descriptors for each component should be available in the knowledge base. Second, the effectiveness also depends on the quality of available dictionaries. The WordNet lexicon [19] have been extremely useful in our experiments. However, many word senses were not found as well as many semantic relationships between terms. Considering the large number of WordNet users we can expect that next releases will be more complete.
30
8
Fi nal rem arks
An approach for a software reuse system based on the processing of the descriptions in natural language of software components has been presented. The approach addresses three main problems of software reuse systems: precision; cost-effectiveness and user-friendliness. Precision is ensured by constructing accurate software indexing units from the lexical, syntactic and semantic information extracted from the descriptions of software components. Costeffectiveness is obtained by the automatic indexing of software components. Processing of free text user queries allows for a more user-friendly retrieval system. Initial experimental results are encouraging. The classification approach is feasible and good retrieval effectiveness has been obtained. More experimentation is needed with larger corpora from divers software collections (e.g. Smalltalk library, ASSET collection, etc). The NLP approach presents clear advantages from the point of view of costs over either manual indexing approaches or knowledge-based systems where knowledge bases are constructed manually. On the other hand, the retrieval effectiveness of our approach (and, in general, of all approaches for automatic indexing) depends not only of the effectiveness of the classification mechanism and retrieval similarity measures. It depends also on the quality of the software descriptions that are available. Descriptions describing components both in specificity and exhaustivity are needed to ensure a reasonable retrieval effectiveness. Further research is ongoing to increase the capabilities of the current system. Basically, we would like to include support for more complex software descriptions, increase the assistance to the selection and customization processes and define mechanisms to acquire knowledge of reuse experiences. Selection assistance aims at helping users in evaluating retrieved candidates and measuring reuse effort through standard information about the component (like documentation, source code and executable examples, etc.) and through some measures of the flexibility of the component, like frequency of reuse, previous customization effort, generality, modularity, code complexity, self descriptiveness, quality of documentation, portability, reliability, reuse guidelines, etc. As we pointed above, software artifacts are not always well documented or their descriptions may not be appropriate from the point of view of different users. This could affect retrieval effectiveness. User’s experience with the reuse system could provide useful information to improve the descriptive power of the software library, either automatically or through a library administrator. A learning system will allow to increase retrieval effectiveness by improving the indexing structures of software components in the Frame Base through user’s experiences with the retrieval mechanism. The mechanism would be based on the user feedback following the retrieval process.
Re fe rences [1]
R. Adams, An Experiment in Software Retrieval,in Proceedings of 4th European Software Engineering Conference (ESEC'93), (I. Sommerville and M. Paul eds.), LNCS 717, Springer-Verlag, GarmischPartenKirchen, Germany, September 1993, pp.380-396.
[2]
B. Burton, R. Aragon, S. Bailey, K. Koehler and L. Mayes, The Reusable Software Library, IEEE Software, vol. 4 no. 4, 25-33 (1987).
31 [3]
P. Devanbu, R. Brachman, P. Selfridge and B. Ballard, LaSSIE: A Knowledge-based Software Information System, CACM, vol. 34, no. 5, 34-49 (1991).
[4]
M. Dillon and A. S. Gray, FASIT: A Fully Automatic Syntactically Based Indexing System, Journal of the American Society for Information Science, vol. 34, no. 2, 99-108 (1983).
[5]
D. A. Evans, K. Ginther-Webster, M. Hart, R. G. Leferts and I. A. Monarch, Automatic Indexing Using Selective NLP and First-Order Thesauri, in Proceedings of RIAO’91, Universitat Autonoma de Barcelona, Barcelona, April 1991, pp. 624-643.
[6]
C. Fillmore, The case for case, in Universals in Linguistic Theory, (E. Bach and R. Harms eds.), Rinehart and Wiston, New York, 1968, pp. 1-88.
[7]
J. L. Fagan, The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval, Journal of the American Society for Information Science, vol. 4, no. 2, 115-139 (1989).
[8]
W. Frakes and T. Pole, Proteus: A Software Reuse Library System that Supports Multiple Representation Methods, SIGIR Forum , vol. 24, no. 3, 43-55 (1990).
[9]
M. R. Girardi and B. Ibrahim, Automatic Indexing of Software Artifacts, in Proceedings of the Third International Conference on Software Reusability, IEEE Computer Society, Rio de Janeiro, Brazil, November 1-4, 1994. (To be published)
[10] M. R. Girardi and B. Ibrahim, A Similarity Measure for Retrieving Software Artifacts, in Proceeedings of Sixth International Conference on Software Engineering and Knowledge Engineering (SEKE'94), Knowledge Systems Institute, Jurmala, Latvia, June 21-23, 1994, pp. 478-485. [11] M.R. Girardi and B. Ibrahim, A software reuse system based on natural language specifications, in Proceedings of 5th International Conference on Computing and Information (ICCI'93), (O. Abou-Rabia, C. Chang and W. Koczkodaj eds.), IEEE Computer Society, Sudbury , May 27-29, 1993, pp. 507-511. [12] M.R. Girardi and B. Ibrahim, An approach to improve the effectiveness of software retrieval, in Proceedings of the 3rd Irvine Software Symposium (ISS'93), (D. J. Richardson and R. N. Taylor eds.), IRUS, University of California, Irvine, Irvine, April 30, 1993, pp. 89-100. [13] M. R. Girardi and B. Ibrahim, New Approaches for Reuse Systems, in Position Paper Collection of Second International Workshop on Software Reuse, (E. Guerrieri ed.), Lucca, March 24-26, 1993. [14] R. Helm and Y.S. Maarek, Integrating Information Retrieval and Domain Specific Approaches for Browsing and Retrieval in Object-Oriented Class Libraries, in Proceedings OOPSLA/ECOOP '91, ACM SIGPLAN Notices, vol. 26, no. 11, 47-61 (1991). [15] P. S. Jacobs and L. F. Rau, SCISOR: Extracting Information from On-line News, CACM, vol. 33, no. 11, 8897 (1990). [16]
R. Krovetz and W. B. Croft, Word Sense Disambiguation Using Machine-Readable Dictionaries, in Proceeding of the 12th International SIGIR Conference on Research and Development in Information Retrieval, (N. J. Belkin and W. B. Croft eds.), Boston, 1989, pp. 127-136.
[17] Y. Maarek, D. Berry and G. Kaiser, An Information Retrieval Approach For Automatically Constructing Software Libraries, IEEE Transactions on Software Engineering, vol. 17, no. 8, 800-813 (1991). [18] M. L. Mauldin, Conceptual Information Retrieval - A Case Study in Adaptive Partial Parsing, Kluwer, Boston, 1991. [19] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross and K. Miller, Five Papers on WordNet, CSL Report 43, Princeton University, July 1990.
32 [20] J. Nie, F. Paradis and J. Vaucher, Using Information Retrieval for Software Reuse, in Proceedings of 5th International Conference on Computing and Information (ICCI '93), (O. Abou-Rabia, C. Chang and W. Koczkodaj eds.), IEEE Computer Society, Sudbury, Ontario Canada, May 27-29, 1993, pp. 448-452. [21] R. Prieto-Diaz, Implementing Faceted Classification for Software Reuse, CACM, vol. 34, no. 5, 89-97 (1991). [22] G. Salton and M. Smith,On the Application of Syntactic Methodologies in Automatic Text Analysis, in Proceedings of the Twelfth Annual International ACMSIGIR Conference on Research and Development in Information Retrieval (SIGIR'89), Cambridge, Massachusetts, June 25-28, 1989, pp. 137-150, Special Issue of SIGIR Forum. [23] G. Salton, Automatic Text Processing - The Transformation, Analysis and Retrieval of Information by Computer, Addison-Wesley, Reading, 1989. [24] G. Salton, Introduction to Modern Information Retrieval, Mc. Graw Hill, New York, 1983, p. 181. [25] R. C. Schank, Conceptual information processing, North-Holland, New York, 1975. [26] G. Sindre and E. Karlsson, A Method for Software Reuse through Large Component Libraries, in Proceedings of 5th International Conference on Computing and Information (ICCI '93), (O. Abou-Rabia, C. Chang and W. Koczkodaj eds.), IEEE Computer Society, Sudbury, Ontario Canada, May 27-29, 1993, pp. 464-468. [27] A. Smeaton, Using Morpho-Syntactic Language Analysis in Phrase Matching, in Proceedings of RIAO'91, in Proceedings of RIAO’91, Universitat Autonoma de Barcelona, Barcelona, April 1991 pp. 415-430. [28] W. Tichy, R. Adams and L. Holter, NLH/E: A natural language help system, in Proceedings of 11th ICSE , Pittsburgh, May 1989, pp. 364-374. [29] P. H. Winston, Artificial Intelligence, Addison-Wesley, Reading, 1984. [30] T. Winograd, Language as a Cognitive Process. Volume I: Syntax, Addison-Wesley, Reading, 1983. [31] M. Wood and I. Sommerville, "An Information Retrieval System for Software Components," SIGIR Forum, vol. 22, no. 3,4, 1988. [32] W. Woods, What’s in a link: foundations for Semantic Networks, in Readings in Knowledge Representation, (J. Brachman and H. Levesque eds.), Morgan Kaufman, 1985.