Review of Last Lecture CS419/519 Information Filtering and Retrieval Jon Herlocker Dept. of Computer Science Oregon State University Tu/Th Day 2
Today
• Tour of an research information retrieval system – Traditional IR search engine – Collaborative filtering recommendations – Crawling, pre-processing, and indexing – Personalization – User study
Project Idea Presentations
• Project idea presentations • Introduction to IR • Assignments for next week
What is Information Retrieval? • How does it differ from databases (data retrieval)?
Data Retrieval vs. Information Retrieval Data retrieval Data Table Exact match Matching SQL(artificial) Query specification Complete Model Deterministic Highly structured Content Data object Matching Items wanted Query language
Information retrieval Information Document, image, other Partial match, best match Relevant Natural Incomplete Probabilistic Less structured
Table by Xin Xao, Drexel University
1
Data Retrieval vs. Information Retrieval • Information retrieval solutions may incorporate data retrieval
Next Models of information retrieval
– Data retrieval as a subset of information retrieval
• For this class, data retrieval alone is not interesting
Traditional Information Need Model
Some Assumptions of Traditional Model
1. User has an information need 2. Users forms a query 3. IR system makes best match with documents 4. User evaluates ranked documents 5. Is the need met? 6a. Yes -> done 6b. No -> reformulate query
• It is possible for the user to specify their exact needs • Document texts are functionally equivalent to information needs – Essentially document retrieval
• The information need remains constant throughout search process • The user will always recognize relevant documents (Belkin, Oddy, & Brooks)
Relevance?
Other Models
• What does relevance mean? • A document might be relevant for many reasons
• Anomalous States of Knowledge (ASKs)
– – – –
Answers a question with a fact Gives part of an answer Gives link to the answer Gives related information
• Relevance is subjective! – We’ll return to this discussion later
– (Belkin, Oddy, and Brooks) – A recognized anomaly in a user’s state of knowledge that the user is not able to specify specifically
• Berry-Picking Model – (Bates 90) – Interesting information scattered like berries on bushes – The query is continually shifting
2
Takeaway Messages • •
Textbook Roadmap
Modeling information need and user activity is complex Be aware of simplistic assumptions that most IR work makes
MIR – Chap. 2 - Models
Models for Retrieval
• Goal is to retrieve documents that are relevant to a user’s single information need
• What documents best match the need described by a query? • Modeling
– Based on those traditional assumptions discussed earlier
– Information need • From a query
– Information content • From a document
– Closeness or similarity • Between need and content
• Based on content analysis – Measurable attributes of queries and documents
3
Generic Document Retrieval Model
Best Matching Documents
Information Need Ranking Algorithm
Query Language Representation of Information Need
Representation of Document Content
Prior Knowledge & Assumptions
Models Covered In this class: • Boolean Model • Vector Space Model Potentially more if time permits and interest.
Documents
Index Terms • Roughly a word or phrase describing the content of a document • Manual Indexing – A human reads or scans a document and assigns it index terms – (i.e. Library of Congress subject terms)
Full-text Indexing Models • Boolean model • Vector space model • Probabilistic model
• Automatic Indexing – Full-text indexing – Every word in the document becomes an index term
Generic Full-Text Indexing Definitions • K = {k1, ..., kt}
– Set of all index terms
• wi, j > 0
– Weight of term ki in document dj – wi,j = 0 if ki is not in dj
Boolean Model • Compute vectors for each document (dj) • If a keyword ki appears in dj, then wi,j is 1, otherwise 0
• Each document is describe as a vector – dj = (w1,j, w2,j, ..., wt,j) Modern Information Retrieval, Baeza-Yates & Ribeiro-Neto
4
Boolean Query Language • Terms – Words – Phrases
• Operators – AND – OR – NOT
Rules for Boolean Logic • DeMorgan’s Law – NOT (A AND B) = (NOT A) OR (NOT B) – NOT(A OR B) = (NOT A) AND (NOT B)
• Search for “Boolean Logic” if you want to know more
Example Boolean Queries • • • • • •
House House AND Corvallis House OR Corvallis (House OR Condo) AND Corvallis House AND Oregon AND NOT Eugene (House OR Condo) AND Corvallis and NOT Eugene
Informal Pseudo Boolean Notation • Evolved in web search engines • +house +Corvallis – house AND Corvallis
• +house +Oregon –Eugene – House and Oregon and NOT Eugene
• House +Corvallis –?
Pseudo Boolean Notation
Ordering of Retrieved Documents
• House +Corvallis
• Pure Boolean has no order
– (Corvallis AND House) OR Corvallis
• +House +Condo Corvallis Salem – (House and Condo and Corvallis) OR (House and Condo and Salem) OR (House and Condo)
– All returned documents are equally relevant
• In reality, different approaches can be taken – Chronologically – Order by number of times a specified term occurs – Other approaches – get further and further away from Boolean.
5
Boolean Searching • Upsides – Easy to implement – Simple queries are easy – Query language gives significant control over results
• Downsides – Binary relevance decision
Who Uses Boolean Searches • Everybody until about ten-fifteen years ago • Even now, many commercial systems (library catalogs, abstracting services, etc)
• No ordering criteria • Usually too much or too little
– Syntax can be complex
Boolean model – Data Retrieval or Information Retrieval? • Very close – hard to distinguish • Differences
Proximity operators • NEAR • WITHIN
– Enormous number of attributes – A document only has values for a few of those attributes – Inefficient to store and search using traditional data retrieval methods – Ordering may still be important
6