Team 2
TRACEABILITY WITH LSI
Traceability
Traceability
The ability to link between different artifacts Example artifacts: code, user manuals, design
documentation, development wikis, etc.
In particular, link code to: Relevant requirements Sections in design documents Test-cases Other structured and free-text artifacts
Also, link from requirements, design documents, etc. to code
A Comparison of Traceability Techniques for Specifications
3
What’s Traceability Good For?
Program Comprehension
Which code segment implements which specific
requirement and vice-versa
Impact analysis Keeping non-code artifacts up-to-date
Requirement Tracing Discover what code needs to change to handle
a new req. Aid in determining whether a specification is completely implemented and covered by tests
A Comparison of Traceability Techniques for Specifications
4
Challenges
Scalability Large # of artifacts Heterogeneity Large # of different document formats and programming languages Noisy Free text information (natural language): conjuctions, prepositions, abbreviations, etc. Some information may be outdated, or just plain wrong Prior work: Recovering Traceability Links in Software Artifact Management Systems using information retrieval methods [Lucia et al., 2007] Recovering Traceability Links between Code and Documentation [Antoniol et al., 2002, Deerwester et al., 1990, Marcus and Maletic, 2003]
A Comparison of Traceability Techniques for Specifications
5
Example /* * The File interface provides… */ public class FileImpl extends FilePOA{ private String nativefileName; /** * Creates a new File… */ public FileImpl(String nativePath ...){ … } /** *… */ Private String f(..){…} }
A Comparison of Traceability Techniques for Specifications
6
Traceability Link Recovery Find basic parts of Method identifiers (e,g, The three steps of Auto_due into Auto and due) Source Code Component
Identifiers Extraction
document path
Query Extraction
Identifiers Separation
Text Normalization
INDEXER
Document Classifier
Code Path
List of identifiers
Scored Document List
Text Normalization Software Documents
Letter Transformation
Stopwords Removal
Morphological Analysis
Document Path
Capital to Small Letter
Articles, Punctuations, etc Removal
INDEXER
Morphological Analysis (Plural to singular, infinitive, etc) 7
Pre-processing
Text Preprocessing …
Copyright owners grant member companies of the OMG permission to make a limited …
… Text Preprocessi ng
copyright owner grant member compani omg permiss make limit …
•Lower-case , stop-words, number etc.
A Comparison of Traceability Techniques for Specifications
9
Words Extraction /* * The File interface provides… */ public class FileImpl extends FilePOA{ private String nativefileName; /** * Creates a new File… */ public FileImpl(String nativePath ...){ … }
words extracti on
/** *… */ Private String f(..){…} }
•Class Name •Public Function names •Public function arguments and return type •Comments A Comparison of Traceability Techniques for Specifications
10
Words Expansion …NativePath, fileName,
Words expansion
NativePath,Native,Path, fileName, File,Name, delete_all_elements, …
•Use well-known coding standards for sub-words
A Comparison of Traceability Techniques for Specifications
11
Vector
Space Model
Vector Space Model
Vector Space Model (VSM) [Salton et al., 1975] Each document, d, is represented by a vector of ranks
of the terms in the vocabulary: vd = [rd(w1), rd(w2), …, rd(w|V|)] The query is similarly represented by a vector The similarity between the query and document is the cosine of the angle between their respective vectors
Assumes terms are independent Some terms are likely to appear together Terms can have different meanings depending on context
13
Vector Space Model ■
Classic IR might lead to poor retrieval due to: ◆ ◆ ◆
unrelated documents might be included in the answer set relevant documents that do not contain at least one index term are not retrieved Reasoning: retrieval based on index terms is vague and noisy
Term-document matrix has a very high dimensionality are there really that many important features
for each document and term?
14
Example from
Lillian Lee
auto engine bonnet tires lorry boot
car emissions hood make model trunk
make hidden Markov model emissions normalize
Synonymy
Polysemy
Will have small cosine
Will have large cosine
but are related
but not truly related
Latent
Semantic Indexing
Latent Semantic Indexing ■
■
The user information need is more related to concepts and ideas than to index terms A document that shares concepts with another document known to be relevant might be of interest LSI [Deerwester et al., 1990] Enhance the semantics of long descriptions. reduction can improve effectiveness reduction can find surprising relationships!
■
■
The key idea is to map documents and queries into a lower dimensional space (i.e., composed of higher level concepts which are in fewer number than the index terms) Retrieval in this reduced concept space might be superior to retrieval in the space of index terms
Lecture 12
Information Retrieval
17
Latent Semantic Indexing
Singular Value Decomposition unique
mathematical decomposition of a matrix into the product of three matrices: two with orthonormal columns one with singular values on the diagonal
tool
for dimension reduction finds optimal projection into low-dimensional space
Lecture 12
Information Retrieval
19
Singular Value Decomposition DT wtd
t d
=
T
r r
r d
t r
Compute singular value decomposition of a term-document matrix D, a representation of M in r dimensions T, a matrix for transforming new documents gives relative importance of
dimensions Lecture 12
Information Retrieval
20
LSI Term matrix T
T matrix gives a vector for each term in LSI space multiply by a new document vector to “fold in”
new documents into LSI space
LSI is a rotation of the term-space original matrix: terms are d-dimensional new space has lower dimensionality dimensions are groups of terms that tend to co-
occur in the same documents
○ synonyms, contextually-related words, variant endings
Lecture 12
Information Retrieval
21
Document matrix D D
matrix
coordinates of documents in LSI space same dimensionality as T vectors can compute the similarity between a term
and a document
http://lsi.research.telcordia.com/
Lecture 12
Information Retrieval
22
Dimension Reduction
Lecture 12
Information Retrieval
23
Improved Retrieval with LSI
New documents and queries are "folded in" multiply vector by T
Compute similarity for ranking as in VSM compare queries and documents by dot-product
Improvements come from reduction of noise no need to stem terms (variants will co-occur) no need for stop list ○ stop words are used uniformly throughout collection, so
they tend to appear in the first dimension
No speed or space gains, though…
Lecture 12
Information Retrieval
24
Computing an Example Technical Memo Titles c1: Human machine interface for ABC computer applications c2: A survey of user opinion of computer system response time c3: The EPS user interface management system c4: System and human system engineering testing of EPS c5: Relation of user perceived response time to error measurement m1: The generation of random, binary, ordered trees m2: The intersection graph of paths in trees m3: Graph minors IV: Widths of trees and wellquasiordering m4: Graph minors: A survey
Computing an Example M=
r (human.user) = -.38 r (human.minors) = -.29
Computing an Example K=
Computing an Example S=
Computing an Example D=
Computing an Example New M=
r (human.user) = .94 r (human.minors) = -.83
Incremental LSI Allows
the fast and low-cost computation of traceability links by using the results from previous LSI computation. Avoids the full cost of LSI computation for TLR by analyzing the changes to documentation and source code in different versions of the system, and then derive the changes to the set of documentation-to-source code traceability links.
Evaluation
Experiment of LEDA Trance
line from the Manual to the Source
Code Find out which parts of the source code are described by a given manual section.
IR Quality Measures Precision
Recall
@ n:
@ n:
Average
precision:
A Comparison of Traceability Techniques for Specifications
34
Experiment of LEDA
Experiment of LEDA
Experiment of LEDA
Figure 2. Time Cost with iLSI, threshold=0.7
Experiment of LEDA
Conclusions Latent
semantic indexing provides an interesting conceptualization of the IR problem It allows reducing the complexity of the underline representational framework which might be explored, for instance, with the purpose of interfacing with the user
Conclusions Traceability
between code and documentation in real world systems is effective via IR techniques. For realistic datasets the Vector Space Model, which did not perform dimensionality reduction where shown to be the most effective.
A Comparison of Traceability Techniques for Specifications
40
References
Hsinyi Jiang, Tien N. Nguyen, Ing-Xiang Chen, Hojun Jaygarl, Carl K. Chang: Incremental Latent Semantic Indexing for Automatic Traceability Link Evolution Management. ASE 2008: 59-68 G. Antoniol, G. Canfora, G. Casazza, A.D. Lucia, and E. Merlo. Recovering Traceability Links Between Code and Documentation. IEEE Trans. Softw. Eng. , 28(10):970-983, 2002. Dumais, S. T., Furnas, G. W., Landauer, T. K. and Deerwester, S. (1988), "Using latent semantic analysis to improve information retrieval." In Proceedings of CHI'88: Conference on Human Factors in Computing, New York: ACM, 281-285. Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R.A. (1990) "Indexing by latent semantic analysis." Journal of the Society for Information Science, 41(6), 391-407. Foltz, P. W. (1990) "Using Latent Semantic Indexing for Information Filtering". In R. B. Allen (Ed.) Proceedings of the Conference on Office Information Systems, Cambridge, MA, 40-47. Deerwester, S.,Dumais, S.T., Landauer, T.K.,Furnas, G.W. and Harshman, R.A. (1990). "Indexing by latent semantic analysis." Journal of the Society for Information Science, 41(6), 391-407.