Research--trace Ability With Lsi

  • Uploaded by: Andrew Denner
  • 0
  • 0
  • April 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Research--trace Ability With Lsi as PDF for free.

More details

  • Words: 1,564
  • Pages: 41
Team 2

TRACEABILITY WITH LSI

 Traceability

Traceability 

The ability to link between different artifacts Example artifacts: code, user manuals, design

documentation, development wikis, etc.



In particular, link code to: Relevant requirements Sections in design documents Test-cases Other structured and free-text artifacts



Also, link from requirements, design documents, etc. to code

A Comparison of Traceability Techniques for Specifications

3

What’s Traceability Good For? 

Program Comprehension

Which code segment implements which specific

requirement and vice-versa



Impact analysis Keeping non-code artifacts up-to-date



Requirement Tracing Discover what code needs to change to handle

a new req. Aid in determining whether a specification is completely implemented and covered by tests

A Comparison of Traceability Techniques for Specifications

4

Challenges   



Scalability  Large # of artifacts Heterogeneity  Large # of different document formats and programming languages Noisy  Free text information (natural language): conjuctions, prepositions, abbreviations, etc.  Some information may be outdated, or just plain wrong Prior work:  Recovering Traceability Links in Software Artifact Management Systems using information retrieval methods [Lucia et al., 2007]  Recovering Traceability Links between Code and Documentation [Antoniol et al., 2002, Deerwester et al., 1990, Marcus and Maletic, 2003]

A Comparison of Traceability Techniques for Specifications

5

Example /* * The File interface provides… */ public class FileImpl extends FilePOA{ private String nativefileName; /** * Creates a new File… */ public FileImpl(String nativePath ...){ … } /** *… */ Private String f(..){…} }

A Comparison of Traceability Techniques for Specifications

6

Traceability Link Recovery Find basic parts of Method identifiers (e,g, The three steps of Auto_due into Auto and due) Source Code Component

Identifiers Extraction

document path

Query Extraction

Identifiers Separation

Text Normalization

INDEXER

Document Classifier

Code Path

List of identifiers

Scored Document List

Text Normalization Software Documents

Letter Transformation

Stopwords Removal

Morphological Analysis

Document Path

Capital to Small Letter

Articles, Punctuations, etc Removal

INDEXER

Morphological Analysis (Plural to singular, infinitive, etc) 7

 Pre-processing

Text Preprocessing …

Copyright owners grant member companies of the OMG permission to make a limited …

… Text Preprocessi ng

copyright owner grant member compani omg permiss make limit …

•Lower-case , stop-words, number etc.

A Comparison of Traceability Techniques for Specifications

9

Words Extraction /* * The File interface provides… */ public class FileImpl extends FilePOA{ private String nativefileName; /** * Creates a new File… */ public FileImpl(String nativePath ...){ … }

words extracti on

/** *… */ Private String f(..){…} }

•Class Name •Public Function names •Public function arguments and return type •Comments A Comparison of Traceability Techniques for Specifications

10

Words Expansion …NativePath, fileName,

Words expansion

NativePath,Native,Path, fileName, File,Name, delete_all_elements, …

•Use well-known coding standards for sub-words

A Comparison of Traceability Techniques for Specifications

11

 Vector

Space Model

Vector Space Model 

Vector Space Model (VSM) [Salton et al., 1975]  Each document, d, is represented by a vector of ranks

of the terms in the vocabulary: vd = [rd(w1), rd(w2), …, rd(w|V|)]  The query is similarly represented by a vector  The similarity between the query and document is the cosine of the angle between their respective vectors 

Assumes terms are independent  Some terms are likely to appear together  Terms can have different meanings depending on context

13

Vector Space Model ■

Classic IR might lead to poor retrieval due to: ◆ ◆ ◆



unrelated documents might be included in the answer set relevant documents that do not contain at least one index term are not retrieved Reasoning: retrieval based on index terms is vague and noisy

Term-document matrix has a very high dimensionality are there really that many important features

for each document and term?

14

Example  from

Lillian Lee

auto engine bonnet tires lorry boot

car emissions hood make model trunk

make hidden Markov model emissions normalize

Synonymy

Polysemy

Will have small cosine

Will have large cosine

but are related

but not truly related

 Latent

Semantic Indexing

Latent Semantic Indexing ■





The user information need is more related to concepts and ideas than to index terms A document that shares concepts with another document known to be relevant might be of interest LSI [Deerwester et al., 1990]  Enhance the semantics of long descriptions.  reduction can improve effectiveness  reduction can find surprising relationships!





The key idea is to map documents and queries into a lower dimensional space (i.e., composed of higher level concepts which are in fewer number than the index terms) Retrieval in this reduced concept space might be superior to retrieval in the space of index terms

Lecture 12

Information Retrieval

17

Latent Semantic Indexing

Singular Value Decomposition  unique

mathematical decomposition of a matrix into the product of three matrices: two with orthonormal columns one with singular values on the diagonal

 tool

for dimension reduction  finds optimal projection into low-dimensional space

Lecture 12

Information Retrieval

19

Singular Value Decomposition DT wtd



t d

=

T

r r

r d

t r

Compute singular value decomposition of a term-document matrix D, a representation of M in r dimensions T, a matrix for transforming new documents  gives relative importance of

dimensions Lecture 12

Information Retrieval

20

LSI Term matrix T 

T matrix gives a vector for each term in LSI space multiply by a new document vector to “fold in”

new documents into LSI space



LSI is a rotation of the term-space original matrix: terms are d-dimensional new space has lower dimensionality dimensions are groups of terms that tend to co-

occur in the same documents

○ synonyms, contextually-related words, variant endings

Lecture 12

Information Retrieval

21

Document matrix D D

matrix

coordinates of documents in LSI space same dimensionality as T vectors can compute the similarity between a term

and a document

http://lsi.research.telcordia.com/

Lecture 12

Information Retrieval

22

Dimension Reduction

Lecture 12

Information Retrieval

23

Improved Retrieval with LSI 

New documents and queries are "folded in" multiply vector by T



Compute similarity for ranking as in VSM compare queries and documents by dot-product



Improvements come from reduction of noise no need to stem terms (variants will co-occur) no need for stop list ○ stop words are used uniformly throughout collection, so

they tend to appear in the first dimension

No speed or space gains, though…

Lecture 12

Information Retrieval

24

Computing an Example Technical Memo Titles c1: Human machine interface for ABC computer applications c2: A survey of user opinion of computer system response time c3: The EPS user interface management system c4: System and human system engineering testing of EPS c5: Relation of user perceived response time to error measurement m1: The generation of random, binary, ordered trees m2: The intersection graph of paths in trees m3: Graph minors IV: Widths of trees and well­quasi­ordering m4: Graph minors: A survey

Computing an Example M=

r (human.user) = -.38 r (human.minors) = -.29

Computing an Example K=

Computing an Example S=

Computing an Example D=

Computing an Example New M=

r (human.user) = .94 r (human.minors) = -.83

Incremental LSI  Allows

the fast and low-cost computation of traceability links by using the results from previous LSI computation.  Avoids the full cost of LSI computation for TLR by analyzing the changes to documentation and source code in different versions of the system,  and then derive the changes to the set of documentation-to-source code traceability links.

 Evaluation

Experiment of LEDA  Trance

line from the Manual to the Source

Code  Find out which parts of the source code are described by a given manual section.

IR Quality Measures  Precision

 Recall

@ n:

@ n:

 Average

precision:

A Comparison of Traceability Techniques for Specifications

34

Experiment of LEDA

Experiment of LEDA

Experiment of LEDA

Figure 2. Time Cost with iLSI, threshold=0.7

Experiment of LEDA

Conclusions  Latent

semantic indexing provides an interesting conceptualization of the IR problem  It allows reducing the complexity of the underline representational framework which might be explored, for instance, with the purpose of interfacing with the user

Conclusions  Traceability

between code and documentation in real world systems is effective via IR techniques.  For realistic datasets the Vector Space Model, which did not perform dimensionality reduction where shown to be the most effective.

A Comparison of Traceability Techniques for Specifications

40

References 











Hsinyi Jiang, Tien N. Nguyen, Ing-Xiang Chen, Hojun Jaygarl, Carl K. Chang: Incremental Latent Semantic Indexing for Automatic Traceability Link Evolution Management. ASE 2008: 59-68 G. Antoniol, G. Canfora, G. Casazza, A.D. Lucia, and E. Merlo. Recovering Traceability Links Between Code and Documentation. IEEE Trans. Softw. Eng. , 28(10):970-983, 2002. Dumais, S. T., Furnas, G. W., Landauer, T. K. and Deerwester, S. (1988), "Using latent semantic analysis to improve information retrieval." In Proceedings of CHI'88: Conference on Human Factors in Computing, New York: ACM, 281-285. Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R.A. (1990) "Indexing by latent semantic analysis." Journal of the Society for Information Science, 41(6), 391-407. Foltz, P. W. (1990) "Using Latent Semantic Indexing for Information Filtering". In R. B. Allen (Ed.) Proceedings of the Conference on Office Information Systems, Cambridge, MA, 40-47. Deerwester, S.,Dumais, S.T., Landauer, T.K.,Furnas, G.W. and Harshman, R.A. (1990). "Indexing by latent semantic analysis." Journal of the Society for Information Science, 41(6), 391-407.

Related Documents

Lsi Update
November 2019 16
Lsi 1
November 2019 5
Lsi-11
October 2019 15
Lsi Dennyja April Final
August 2019 15
Lsi Vlore Broshure
May 2020 1

More Documents from ""