Relevance Ranking and Relevance Feedback Carl Staelin
Motivation - Feast or famine •
•
•
Queries return either too few or too many results Users are generally looking for the best document with a particular piece of information Users don’t want to look through hundreds of documents to locate the information
⇒ Rank documents according to expected relevance!
Lecture 3
Information Retrieval and Digital Libraries
Page 2
Model •
Can we get user feedback? •
•
Document score is influenced by similarity to previously user-rated documents
Can we utilize external information? •
Lecture 3
E.g., how many other documents reference this document? Information Retrieval and Digital Libraries
Page 3
Queries
Most queries are short
One to three words
Many queries are ambiguous
“Saturn”
Lecture 3
Saturn the planet? Saturn the car?
Information Retrieval and Digital Libraries
Page 4
Sample “Internal” Features
Term frequency Location of term appearance in document Capitalization Font Relative font size Bold, italic, … Appearance in
, <meta>, … tags Co-location with other (relevant) words
Lecture 3
Information Retrieval and Digital Libraries
Page 5
Sample “External” Features
Document citations
Location within website (e.g. height in the directory structure or click distance from /) Popularity of pages from similar queries
How often is it cited? How important are the documents that cited it? Relevance of text surrounding hyperlink Relevance of documents citing this document
Search engine links often connect through search site so they can track click-throughs
Your idea here…
Lecture 3
Information Retrieval and Digital Libraries
Page 6
Problem Given all these features, how do we rank the search results? Often use similarity between query and document May use other factors to weight ranking score, such as citation ranking May use an iterative search which ranks documents according to similarity/dissimilarity to query and previously marked relevant/irrelevant documents Lecture 3
Information Retrieval and Digital Libraries
Page 7
Relevance Feedback Often an information retrieval system does not return useful information on the first try! If at first you don’t succeed, try, try, again Find out from the user which results were most relevant, and try to find more documents like them and less like the others Assumption: relevant documents are somehow similar to each other and different from irrelevant documents Question: how?
Lecture 3
Information Retrieval and Digital Libraries
Page 8
Relevance Feedback Methods
Two general approaches:
Create new queries with user feedback Create new queries automatically
Re-compute document weights with new information Expand or modify the query to more accurately reflect the user’s desires
Lecture 3
Information Retrieval and Digital Libraries
Page 9
Vector Space ReWeighting
Given a query Q with its query vector q Initial, user-annotated results D
Dr = relevant, retrieved documents
Dn = irrelevant, retrieved documents
di are the document weight vectors
Update weights on query vector q Re-compute similarity score to new q
Lecture 3
Information Retrieval and Digital Libraries
Page 10
Vector Space ReWeighting
Basic idea:
increase weight for terms appearing in relevant documents Decrease weight for terms appearing in irrelevant documents
There are a few standard equations…
Lecture 3
Information Retrieval and Digital Libraries
Page 11
Vector Space ReWeighting Rochio:
q' = α q + (β /|Dr|)∑di ∈Dr di - (γ /|Dn|)∑di ∈Dn di
Ide regular
q' = α q + β ∑di ∈Dr di - γ ∑di ∈Dn di
Ide Dec_hi
q' = α q + β ∑di ∈Dr di - γ maxdi ∈Dn (di )
Lecture 3
Information Retrieval and Digital Libraries
Page 12
Vector Space Re-Weighting
The initial query vector q0 will have non-zero weights only for terms appearing in the query The query vector update process can add weight to terms that don’t appear in the original query
Automatic expansion of the original query terms
Some terms can end up having negative weight!
Lecture 3
E.g., if you want to find information on the planet Saturn, “car” could have a negative weight…
Information Retrieval and Digital Libraries
Page 13
Probabilistic Re-Weighting
After initial search, get feedback from user on document relevance Use relevance information to recalculate term weights Re-compute similarity and try again…
Lecture 3
Information Retrieval and Digital Libraries
Page 14
Probabilistic Re-Weighting Remember from last time: Sim ∑ w w (ln(P(t |R)/(1-P(t |R))) + ij k ik jk k k ln((1-P(tk|R))/P(tk|R)))
P(tk|R) = 0.5
P(tk|R) = ni /N; ni = #docs containing tk
=> Sim = ∑ w w ln((N - n )/n ) ij k ik jk i i
Lecture 3
Information Retrieval and Digital Libraries
Page 15
Probabilistic Re-Weighting Given document relevance feedback: D = set of relevant retrieved docs r
Drk = subset of Dr containing tk
=> P(t |R) = |D | / |D | k rk r
P(tk|R) = (ni - |Drk |) / (N - |Dr|)
Lecture 3
Information Retrieval and Digital Libraries
Page 16
Probabilistic Re-Weighting
Substituting the new probabilities gives Simij = ∑k wik wjk ln((|Drk| / (|Dr|-|Drk|)) ((ni - |Drk|) / (N - |Dr|- (ni - |Drk|)))) However small values of |Drk|, |Dr| can cause problems, so usually a fudge factor is added
Lecture 3
Information Retrieval and Digital Libraries
Page 17
Probabilistic Re-Weighting
Effectively updates query term weights No automatic query expansion
Terms not in the initial query are never considered
No “memory” of previous weights
Lecture 3
Information Retrieval and Digital Libraries
Page 18
Query Expansion Via Local Clustering
Create new queries automatically Cluster initial search results Use clusters to create new queries Compute Sk(n), which is the set of keywords similar to tk that should be added to the query D = set of retrieved documents V = vocabulary of D
Lecture 3
Information Retrieval and Digital Libraries
Page 19
Association Clustering Find terms that frequently co-occur within documents s = c = ∑ f f kl kl i ik il Or, normalized association matrix s s = c / (c kl kl kk + cll – ckl ) Association cluster Sk(n)
Sk(n) = n largest skl s.t. l ≠ k
Lecture 3
Information Retrieval and Digital Libraries
Page 20
Metric Clustering Measure distance between keyword appearances in a document r(t , t ) = # words between t and t k l k l
ckl = ∑k∑l (1 / r(tk, tl))
Normalized skl = ckl / (|tk| |tl|)
|tk| = # of words stemmed to tk
Sk(n) is the same as before…
Lecture 3
Information Retrieval and Digital Libraries
Page 21
Scalar Clustering
Terms with similar association clusters are more likely to be synonyms sk is the row vector from association clustering skl
skl = (sk sl) / (|sk| |sl|)
Sk(n) is the same as before…
Lecture 3
Information Retrieval and Digital Libraries
Page 22
Query Expansion Via Local Context Analysis Combine information from initial search results and global corpus • Break retrieved documents into fixed-length passages • Treat each passage as a document, and rank order them using the initial query • Compute the weight of each term in top ranked passages using a TFIDF-like similarity with query • Take the top m terms and add them to the original query with weight: •
w = 1 – 0.9(rank / m)
Lecture 3
Information Retrieval and Digital Libraries
Page 23
Local Context Analysis Compute weight of each term in top ranked passages N = # of passages n = # of passages containing t k k
pfik = frequency of tk in passage pi
f(tk, tl) = ∑i pfik pfil
idfk = max(1, log10 (N/nk)/5)
Sim(q, tk) = ∏ tl ∈query (δ + ln(f(tk,tl) idfk)/ln(N) )idf l Lecture 3
Information Retrieval and Digital Libraries
Page 24
Global Analysis
Examine all documents in the corpus Create a corpus-wide data structure that is used by all queries
Lecture 3
Information Retrieval and Digital Libraries
Page 25
Similarity Thesaurus
itfi = ln(|V| / |Vi|) wik = ((0.5 + 0.5fik/maxi(fik))itfi) / (∑j(0.5 + 0.5fjk/maxj(fjk))2itfi2) ckl = ∑i wik wil
Lecture 3
Information Retrieval and Digital Libraries
Page 26
Query Expansion With Similarity Thesaurus
wqk = weight from above with query
sim(q,tl) = ∑k wqk ckl
Take top m terms according to sim(q,tl)
Assign query weights to new terms
wk = sim(q,tk) / ∑l wql
Re-run with new (weighted) query
Lecture 3
Information Retrieval and Digital Libraries
Page 27
SONIA Feedback SONIA is a meta-search engine Clusters search results Extracts keywords describing each cluster Allow user to expand search within cluster
Lecture 3
Information Retrieval and Digital Libraries
Page 28
ifile Feedback ifile is an automatic mail filer for mh Email is automatically filed in folders User refile actions provide feedback on poor classification decisions Does not use positive feedback
Lecture 3
No reinforcement of correct decisions, only correction based on bad decisions
Information Retrieval and Digital Libraries
Page 29
Search Engine Feedback Search engines rarely allow relevance feedback
Too CPU-intensive Most people aren’t willing to wait Search engines typically operate near the edge of their capacity
Lecture 3
Google has 2,000 servers(?) and 300TB of data Information Retrieval and Digital Libraries
Page 30
Meta-Search Engine Feedback Meta-search engines collect and process results from several search engines
Lecture 3
Most search engines do not allow users to specify query term weights Some search engines allow users to specify “negative” query terms User relevance feedback might be used to weight results by search engine
Information Retrieval and Digital Libraries
Page 31
Pet Peeves What are your pet peeves when trying to locate information from
Lecture 3
Web? Intranet? Email? Local disk? Peer-to-peer network? Who knows? Information Retrieval and Digital Libraries
Page 32
Information Types What types of information do you typically try to find?
Lecture 3
Technical papers? Product information? (e.g., books, motherboards, …) Travel? (e.g., where to go for vacation, and what to do there?) People? (e.g., Mehran Sahami) … Information Retrieval and Digital Libraries
Page 33
Project
First draft of search is available
It is incomplete and buggy Can look at architecture to see how you might extend it
Python w/ extensions
Should be installed on cluster RPMs/SRPMs available on web
http://www.hpl.hp.com/personal/Carl_Staelin/cs236601/softwar e/ Lecture 3
Information Retrieval and Digital Libraries
Page 34