Relevance Ranking And Relevance Feedback: Carl Staelin

  • Uploaded by: api-20013624
  • 0
  • 0
  • July 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Relevance Ranking And Relevance Feedback: Carl Staelin as PDF for free.

More details

  • Words: 1,810
  • Pages: 34
Relevance Ranking and Relevance Feedback Carl Staelin

Motivation - Feast or famine •





Queries return either too few or too many results Users are generally looking for the best document with a particular piece of information Users don’t want to look through hundreds of documents to locate the information

⇒ Rank documents according to expected relevance!

Lecture 3

Information Retrieval and Digital Libraries

Page 2

Model •

Can we get user feedback? •



Document score is influenced by similarity to previously user-rated documents

Can we utilize external information? •

Lecture 3

E.g., how many other documents reference this document? Information Retrieval and Digital Libraries

Page 3

Queries 

Most queries are short 



One to three words

Many queries are ambiguous 

“Saturn”  

Lecture 3

Saturn the planet? Saturn the car?

Information Retrieval and Digital Libraries

Page 4

Sample “Internal” Features        

Term frequency Location of term appearance in document Capitalization Font Relative font size Bold, italic, … Appearance in , <meta>, <h?>… tags Co-location with other (relevant) words<br /> <br /> Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 5<br /> <br /> Sample “External” Features <br /> <br /> Document citations    <br /> <br /> <br /> <br /> <br /> <br /> Location within website (e.g. height in the directory structure or click distance from /) Popularity of pages from similar queries <br /> <br /> <br /> <br /> How often is it cited? How important are the documents that cited it? Relevance of text surrounding hyperlink Relevance of documents citing this document<br /> <br /> Search engine links often connect through search site so they can track click-throughs<br /> <br /> Your idea here…<br /> <br /> Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 6<br /> <br /> Problem Given all these features, how do we rank the search results?  Often use similarity between query and document  May use other factors to weight ranking score, such as citation ranking  May use an iterative search which ranks documents according to similarity/dissimilarity to query and previously marked relevant/irrelevant documents Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 7<br /> <br /> Relevance Feedback Often an information retrieval system does not return useful information on the first try!  If at first you don’t succeed, try, try, again  Find out from the user which results were most relevant, and try to find more documents like them and less like the others  Assumption: relevant documents are somehow similar to each other and different from irrelevant documents  Question: how?<br /> <br /> Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 8<br /> <br /> Relevance Feedback Methods <br /> <br /> Two general approaches:  <br /> <br /> <br /> <br /> <br /> <br /> Create new queries with user feedback Create new queries automatically<br /> <br /> Re-compute document weights with new information Expand or modify the query to more accurately reflect the user’s desires<br /> <br /> Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 9<br /> <br /> Vector Space ReWeighting  <br /> <br />  <br /> <br /> Given a query Q with its query vector q Initial, user-annotated results D <br /> <br /> Dr = relevant, retrieved documents<br /> <br /> <br /> <br /> Dn = irrelevant, retrieved documents<br /> <br /> <br /> <br /> di are the document weight vectors<br /> <br /> Update weights on query vector q Re-compute similarity score to new q<br /> <br /> Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 10<br /> <br /> Vector Space ReWeighting <br /> <br /> Basic idea: <br /> <br /> <br /> <br /> <br /> <br /> increase weight for terms appearing in relevant documents Decrease weight for terms appearing in irrelevant documents<br /> <br /> There are a few standard equations…<br /> <br /> Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 11<br /> <br /> Vector Space ReWeighting Rochio: <br /> <br /> q' = α q + (β /|Dr|)∑di ∈Dr di - (γ /|Dn|)∑di ∈Dn di<br /> <br /> Ide regular <br /> <br /> q' = α q + β ∑di ∈Dr di - γ ∑di ∈Dn di<br /> <br /> Ide Dec_hi <br /> <br /> q' = α q + β ∑di ∈Dr di - γ maxdi ∈Dn (di )<br /> <br /> Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 12<br /> <br /> Vector Space Re-Weighting <br /> <br /> <br /> <br /> The initial query vector q0 will have non-zero weights only for terms appearing in the query The query vector update process can add weight to terms that don’t appear in the original query <br /> <br /> <br /> <br /> Automatic expansion of the original query terms<br /> <br /> Some terms can end up having negative weight! <br /> <br /> Lecture 3<br /> <br /> E.g., if you want to find information on the planet Saturn, “car” could have a negative weight…<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 13<br /> <br /> Probabilistic Re-Weighting <br /> <br /> <br /> <br /> <br /> <br /> After initial search, get feedback from user on document relevance Use relevance information to recalculate term weights Re-compute similarity and try again…<br /> <br /> Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 14<br /> <br /> Probabilistic Re-Weighting Remember from last time:  Sim  ∑ w w (ln(P(t |R)/(1-P(t |R))) + ij k ik jk k k ln((1-P(tk|R))/P(tk|R))) <br /> <br /> P(tk|R) = 0.5<br /> <br /> <br /> <br /> P(tk|R) = ni /N; ni = #docs containing tk<br /> <br /> =>  Sim = ∑ w w ln((N - n )/n ) ij k ik jk i i<br /> <br /> Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 15<br /> <br /> Probabilistic Re-Weighting Given document relevance feedback:  D = set of relevant retrieved docs r <br /> <br /> Drk = subset of Dr containing tk<br /> <br /> =>  P(t |R) = |D | / |D | k rk r <br /> <br /> P(tk|R) = (ni - |Drk |) / (N - |Dr|)<br /> <br /> Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 16<br /> <br /> Probabilistic Re-Weighting <br /> <br /> <br /> <br /> <br /> <br /> Substituting the new probabilities gives Simij = ∑k wik wjk ln((|Drk| / (|Dr|-|Drk|))  ((ni - |Drk|) / (N - |Dr|- (ni - |Drk|)))) However small values of |Drk|, |Dr| can cause problems, so usually a fudge factor is added<br /> <br /> Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 17<br /> <br /> Probabilistic Re-Weighting <br /> <br /> <br /> <br /> Effectively updates query term weights No automatic query expansion <br /> <br /> <br /> <br /> Terms not in the initial query are never considered<br /> <br /> No “memory” of previous weights<br /> <br /> Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 18<br /> <br /> Query Expansion Via Local Clustering    <br /> <br />  <br /> <br /> Create new queries automatically Cluster initial search results Use clusters to create new queries Compute Sk(n), which is the set of keywords similar to tk that should be added to the query D = set of retrieved documents V = vocabulary of D<br /> <br /> Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 19<br /> <br /> Association Clustering Find terms that frequently co-occur within documents  s = c = ∑ f f kl kl i ik il Or, normalized association matrix s  s = c / (c kl kl kk + cll – ckl ) Association cluster Sk(n) <br /> <br /> Sk(n) = n largest skl s.t. l ≠ k<br /> <br /> Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 20<br /> <br /> Metric Clustering Measure distance between keyword appearances in a document  r(t , t ) = # words between t and t k l k l <br /> <br /> ckl = ∑k∑l (1 / r(tk, tl))<br /> <br /> <br /> <br /> Normalized skl = ckl / (|tk|  |tl|)<br /> <br /> <br /> <br /> |tk| = # of words stemmed to tk<br /> <br /> <br /> <br /> Sk(n) is the same as before…<br /> <br /> Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 21<br /> <br /> Scalar Clustering <br /> <br /> <br /> <br /> Terms with similar association clusters are more likely to be synonyms sk is the row vector from association clustering skl<br /> <br /> <br /> <br /> skl = (sk  sl) / (|sk|  |sl|)<br /> <br /> <br /> <br /> Sk(n) is the same as before…<br /> <br /> Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 22<br /> <br /> Query Expansion Via Local Context Analysis Combine information from initial search results and global corpus • Break retrieved documents into fixed-length passages • Treat each passage as a document, and rank order them using the initial query • Compute the weight of each term in top ranked passages using a TFIDF-like similarity with query • Take the top m terms and add them to the original query with weight: •<br /> <br /> w = 1 – 0.9(rank / m)<br /> <br /> Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 23<br /> <br /> Local Context Analysis Compute weight of each term in top ranked passages  N = # of passages  n = # of passages containing t k k <br /> <br /> pfik = frequency of tk in passage pi<br /> <br /> <br /> <br /> f(tk, tl) = ∑i pfik pfil<br /> <br /> <br /> <br /> idfk = max(1, log10 (N/nk)/5)<br /> <br /> <br /> <br /> Sim(q, tk) = ∏ tl ∈query (δ + ln(f(tk,tl)  idfk)/ln(N) )idf l Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 24<br /> <br /> Global Analysis <br /> <br /> <br /> <br /> Examine all documents in the corpus Create a corpus-wide data structure that is used by all queries<br /> <br /> Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 25<br /> <br /> Similarity Thesaurus  <br /> <br /> <br /> <br /> itfi = ln(|V| / |Vi|) wik = ((0.5 + 0.5fik/maxi(fik))itfi) / (∑j(0.5 + 0.5fjk/maxj(fjk))2itfi2) ckl = ∑i wik wil<br /> <br /> Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 26<br /> <br /> Query Expansion With Similarity Thesaurus <br /> <br /> wqk = weight from above with query<br /> <br /> <br /> <br /> sim(q,tl) = ∑k wqk ckl<br /> <br /> <br /> <br /> Take top m terms according to sim(q,tl)<br /> <br /> <br /> <br /> Assign query weights to new terms <br /> <br /> <br /> <br /> wk = sim(q,tk) / ∑l wql<br /> <br /> Re-run with new (weighted) query<br /> <br /> Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 27<br /> <br /> SONIA Feedback SONIA is a meta-search engine  Clusters search results  Extracts keywords describing each cluster  Allow user to expand search within cluster<br /> <br /> Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 28<br /> <br /> ifile Feedback ifile is an automatic mail filer for mh  Email is automatically filed in folders  User refile actions provide feedback on poor classification decisions  Does not use positive feedback <br /> <br /> Lecture 3<br /> <br /> No reinforcement of correct decisions, only correction based on bad decisions<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 29<br /> <br /> Search Engine Feedback Search engines rarely allow relevance feedback   <br /> <br /> Too CPU-intensive Most people aren’t willing to wait Search engines typically operate near the edge of their capacity <br /> <br /> Lecture 3<br /> <br /> Google has 2,000 servers(?) and 300TB of data Information Retrieval and Digital Libraries<br /> <br /> Page 30<br /> <br /> Meta-Search Engine Feedback Meta-search engines collect and process results from several search engines <br /> <br /> <br /> <br /> <br /> <br /> Lecture 3<br /> <br /> Most search engines do not allow users to specify query term weights Some search engines allow users to specify “negative” query terms User relevance feedback might be used to weight results by search engine<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 31<br /> <br /> Pet Peeves What are your pet peeves when trying to locate information from      <br /> <br /> Lecture 3<br /> <br /> Web? Intranet? Email? Local disk? Peer-to-peer network? Who knows? Information Retrieval and Digital Libraries<br /> <br /> Page 32<br /> <br /> Information Types What types of information do you typically try to find?  <br /> <br /> <br /> <br />  <br /> <br /> Lecture 3<br /> <br /> Technical papers? Product information? (e.g., books, motherboards, …) Travel? (e.g., where to go for vacation, and what to do there?) People? (e.g., Mehran Sahami) … Information Retrieval and Digital Libraries<br /> <br /> Page 33<br /> <br /> Project <br /> <br /> First draft of search is available  <br /> <br /> <br /> <br /> It is incomplete and buggy Can look at architecture to see how you might extend it<br /> <br /> Python w/ extensions  <br /> <br /> Should be installed on cluster RPMs/SRPMs available on web<br /> <br /> http://www.hpl.hp.com/personal/Carl_Staelin/cs236601/softwar e/ Lecture 3<br /> <br /> Information Retrieval and Digital Libraries<br /> <br /> Page 34 </div> </div> <hr /> <h4>Related Documents</h4> <div class="row"> <div class="col-lg-2 col-md-4 col-sm-6 col-6"> <div class="card item-doc mb-4"> <a href="https://pdfcoke.com/documents/relevance-ranking-and-relevance-feedback-carl-staelin-4d378y4jdpzx" class="d-block"><img class="card-img-top" src="https://pdfcoke.com/img/crop/300x300/4d378y4jdpzx.jpg" alt=""/></a> <div class="card-body text-left"> <h5 class="card-title"><a href="https://pdfcoke.com/documents/relevance-ranking-and-relevance-feedback-carl-staelin-4d378y4jdpzx" class="text-dark">Relevance Ranking And Relevance Feedback: Carl Staelin</a></h5> <small class="text-muted float-left"><i class="fas fa-clock"></i> July 2020</small> <small class="text-muted float-right"><i class="fas fa-eye"></i> 4</small> <div class="clearfix"></div> </div> </div> </div> <div class="col-lg-2 col-md-4 col-sm-6 col-6"> <div class="card item-doc mb-4"> <a href="https://pdfcoke.com/documents/relevance-and-reliability-263edxvn85o0" class="d-block"><img class="card-img-top" src="https://pdfcoke.com/img/crop/300x300/263edxvn85o0.jpg" alt=""/></a> <div class="card-body text-left"> <h5 class="card-title"><a href="https://pdfcoke.com/documents/relevance-and-reliability-263edxvn85o0" class="text-dark">Relevance And Reliability</a></h5> <small class="text-muted float-left"><i class="fas fa-clock"></i> June 2020</small> <small class="text-muted float-right"><i class="fas fa-eye"></i> 4</small> <div class="clearfix"></div> </div> </div> </div> <div class="col-lg-2 col-md-4 col-sm-6 col-6"> <div class="card item-doc mb-4"> <a href="https://pdfcoke.com/documents/reach-relevance-and-relationship-4d378m04vpzx" class="d-block"><img class="card-img-top" src="https://pdfcoke.com/img/crop/300x300/4d378m04vpzx.jpg" alt=""/></a> <div class="card-body text-left"> <h5 class="card-title"><a href="https://pdfcoke.com/documents/reach-relevance-and-relationship-4d378m04vpzx" class="text-dark">Reach, Relevance And Relationship</a></h5> <small class="text-muted float-left"><i class="fas fa-clock"></i> May 2020</small> <small class="text-muted float-right"><i class="fas fa-eye"></i> 3</small> <div class="clearfix"></div> </div> </div> </div> <div class="col-lg-2 col-md-4 col-sm-6 col-6"> <div class="card item-doc mb-4"> <a href="https://pdfcoke.com/documents/humanities-its-relevance-and-components-vl3dv456j3wp" class="d-block"><img class="card-img-top" src="https://pdfcoke.com/img/crop/300x300/vl3dv456j3wp.jpg" alt=""/></a> <div class="card-body text-left"> <h5 class="card-title"><a href="https://pdfcoke.com/documents/humanities-its-relevance-and-components-vl3dv456j3wp" class="text-dark">Humanities, Its Relevance And Components</a></h5> <small class="text-muted float-left"><i class="fas fa-clock"></i> November 2019</small> <small class="text-muted float-right"><i class="fas fa-eye"></i> 9</small> <div class="clearfix"></div> </div> </div> </div> <div class="col-lg-2 col-md-4 col-sm-6 col-6"> <div class="card item-doc mb-4"> <a href="https://pdfcoke.com/documents/dharma-and-its-relevance-today-4d37v6ydw03x" class="d-block"><img class="card-img-top" src="https://pdfcoke.com/img/crop/300x300/4d37v6ydw03x.jpg" alt=""/></a> <div class="card-body text-left"> <h5 class="card-title"><a href="https://pdfcoke.com/documents/dharma-and-its-relevance-today-4d37v6ydw03x" class="text-dark">Dharma And Its Relevance Today</a></h5> <small class="text-muted float-left"><i class="fas fa-clock"></i> November 2019</small> <small class="text-muted float-right"><i class="fas fa-eye"></i> 17</small> <div class="clearfix"></div> </div> </div> </div> <div class="col-lg-2 col-md-4 col-sm-6 col-6"> <div class="card item-doc mb-4"> <a href="https://pdfcoke.com/documents/relevance-of-customer-service-p5on6gv7qm3e" class="d-block"><img class="card-img-top" src="https://pdfcoke.com/img/crop/300x300/p5on6gv7qm3e.jpg" alt=""/></a> <div class="card-body text-left"> <h5 class="card-title"><a href="https://pdfcoke.com/documents/relevance-of-customer-service-p5on6gv7qm3e" class="text-dark">Relevance Of Customer Service</a></h5> <small class="text-muted float-left"><i class="fas fa-clock"></i> May 2020</small> <small class="text-muted float-right"><i class="fas fa-eye"></i> 7</small> <div class="clearfix"></div> </div> </div> </div> </div> </div> </div> </div> </div> <footer class="footer pt-5 pb-0 pb-md-5 bg-primary text-white"> <div class="container"> <div class="row"> <div class="col-md-3 mb-3 mb-sm-0"> <h5 class="text-white font-weight-bold mb-4">Our Company</h5> <ul class="list-unstyled"> <li><i class="fas fa-location-arrow"></i> 3486 Boone Street, Corpus Christi, TX 78476</li> <li><i class="fas fa-phone"></i> +1361-285-4971</li> <li><i class="fas fa-envelope"></i> <a href="mailto:info@pdfcoke.com" class="text-white">info@pdfcoke.com</a></li> </ul> </div> <div class="col-md-3 mb-3 mb-sm-0"> <h5 class="text-white font-weight-bold mb-4">Quick Links</h5> <ul class="list-unstyled"> <li><a href="https://pdfcoke.com/about" class="text-white">About</a></li> <li><a href="https://pdfcoke.com/contact" class="text-white">Contact</a></li> <li><a href="https://pdfcoke.com/help" class="text-white">Help / FAQ</a></li> <li><a href="https://pdfcoke.com/account" class="text-white">Account</a></li> </ul> </div> <div class="col-md-3 mb-3 mb-sm-0"> <h5 class="text-white font-weight-bold mb-4">Legal</h5> <ul class="list-unstyled"> <li><a href="https://pdfcoke.com/tos" class="text-white">Terms of Service</a></li> <li><a href="https://pdfcoke.com/privacy-policy" class="text-white">Privacy Policy</a></li> <li><a href="https://pdfcoke.com/cookie-policy" class="text-white">Cookie Policy</a></li> <li><a href="https://pdfcoke.com/disclaimer" class="text-white">Disclaimer</a></li> </ul> </div> <div class="col-md-3 mb-3 mb-sm-0"> <h5 class="text-white font-weight-bold mb-4">Follow Us</h5> <ul class="list-unstyled list-inline list-social"> <li class="list-inline-item"><a href="#" class="text-white" target="_blank"><i class="fab fa-facebook-f"></i></a></li> <li class="list-inline-item"><a href="#" class="text-white" target="_blank"><i class="fab fa-twitter"></i></a></li> <li class="list-inline-item"><a href="#" class="text-white" target="_blank"><i class="fab fa-linkedin"></i></a></li> <li class="list-inline-item"><a href="#" class="text-white" target="_blank"><i class="fab fa-instagram"></i></a></li> </ul> <h5 class="text-white font-weight-bold mb-4">Mobile Apps</h5> <ul class="list-unstyled "> <li><a href="#" class="bb-alert" data-msg="IOS app is not available yet! Please try again later!"><img src="https://pdfcoke.com/static/images/app-store-badge.svg" height="45" /></a></li> <li><a href="#" class="bb-alert" data-msg="ANDROID app is not available yet! Please try again later!"><img style="margin-left: -10px;" src="https://pdfcoke.com/static/images/google-play-badge.png" height="60" /></a></li> </ul> </div> </div> </div> </footer> <div class="footer-copyright border-top pt-4 pb-2 bg-primary text-white"> <div class="container"> <p>Copyright © 2024 PDFCOKE.</p> </div> </div> <script src="https://pdfcoke.com/static/javascripts/jquery.min.js"></script> <script src="https://pdfcoke.com/static/javascripts/popper.min.js"></script> <script src="https://pdfcoke.com/static/javascripts/bootstrap.min.js"></script> <script src="https://pdfcoke.com/static/javascripts/bootbox.all.min.js"></script> <script src="https://pdfcoke.com/static/javascripts/filepond.js"></script> <script src="https://pdfcoke.com/static/javascripts/main.js?v=1730802512"></script> <!-- Global site tag (gtag.js) - Google Analytics --> <script async src="https://www.googletagmanager.com/gtag/js?id=UA-144986120-1"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-144986120-1'); </script> </body> </html>