Cs419-519 Slides Part 2

July 2020
PDF

Download

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA

Overview

Download & View Cs419-519 Slides Part 2 as PDF for free.

More details

Words: 989
Pages: 6

Preview
Full text

Review of Last Lecture CS419/519 Information Filtering and Retrieval Jon Herlocker Dept. of Computer Science Oregon State University Tu/Th Day 2

Today

• Tour of an research information retrieval system – Traditional IR search engine – Collaborative filtering recommendations – Crawling, pre-processing, and indexing – Personalization – User study

Project Idea Presentations

• Project idea presentations • Introduction to IR • Assignments for next week

What is Information Retrieval? • How does it differ from databases (data retrieval)?

Data Retrieval vs. Information Retrieval Data retrieval Data Table Exact match Matching SQL(artificial) Query specification Complete Model Deterministic Highly structured Content Data object Matching Items wanted Query language

Information retrieval Information Document, image, other Partial match, best match Relevant Natural Incomplete Probabilistic Less structured

Table by Xin Xao, Drexel University

1

Data Retrieval vs. Information Retrieval • Information retrieval solutions may incorporate data retrieval

Next Models of information retrieval

– Data retrieval as a subset of information retrieval

• For this class, data retrieval alone is not interesting

Traditional Information Need Model

Some Assumptions of Traditional Model

1. User has an information need 2. Users forms a query 3. IR system makes best match with documents 4. User evaluates ranked documents 5. Is the need met? 6a. Yes -> done 6b. No -> reformulate query

• It is possible for the user to specify their exact needs • Document texts are functionally equivalent to information needs – Essentially document retrieval

• The information need remains constant throughout search process • The user will always recognize relevant documents (Belkin, Oddy, & Brooks)

Relevance?

Other Models

• What does relevance mean? • A document might be relevant for many reasons

• Anomalous States of Knowledge (ASKs)

– – – –

Answers a question with a fact Gives part of an answer Gives link to the answer Gives related information

• Relevance is subjective! – We’ll return to this discussion later

– (Belkin, Oddy, and Brooks) – A recognized anomaly in a user’s state of knowledge that the user is not able to specify specifically

• Berry-Picking Model – (Bates 90) – Interesting information scattered like berries on bushes – The query is continually shifting

2

Takeaway Messages • •

Textbook Roadmap

Modeling information need and user activity is complex Be aware of simplistic assumptions that most IR work makes

MIR – Chap. 2 - Models

Models for Retrieval

• Goal is to retrieve documents that are relevant to a user’s single information need

• What documents best match the need described by a query? • Modeling

– Based on those traditional assumptions discussed earlier

– Information need • From a query

– Information content • From a document

– Closeness or similarity • Between need and content

• Based on content analysis – Measurable attributes of queries and documents

3

Generic Document Retrieval Model

Best Matching Documents

Information Need Ranking Algorithm

Query Language Representation of Information Need

Representation of Document Content

Prior Knowledge & Assumptions

Models Covered In this class: • Boolean Model • Vector Space Model Potentially more if time permits and interest.

Documents

Index Terms • Roughly a word or phrase describing the content of a document • Manual Indexing – A human reads or scans a document and assigns it index terms – (i.e. Library of Congress subject terms)

Full-text Indexing Models • Boolean model • Vector space model • Probabilistic model

• Automatic Indexing – Full-text indexing – Every word in the document becomes an index term

Generic Full-Text Indexing Definitions • K = {k1, ..., kt}

– Set of all index terms

• wi, j > 0

– Weight of term ki in document dj – wi,j = 0 if ki is not in dj

Boolean Model • Compute vectors for each document (dj) • If a keyword ki appears in dj, then wi,j is 1, otherwise 0

• Each document is describe as a vector – dj = (w1,j, w2,j, ..., wt,j) Modern Information Retrieval, Baeza-Yates & Ribeiro-Neto

4

Boolean Query Language • Terms – Words – Phrases

• Operators – AND – OR – NOT

Rules for Boolean Logic • DeMorgan’s Law – NOT (A AND B) = (NOT A) OR (NOT B) – NOT(A OR B) = (NOT A) AND (NOT B)

• Search for “Boolean Logic” if you want to know more

Example Boolean Queries • • • • • •

House House AND Corvallis House OR Corvallis (House OR Condo) AND Corvallis House AND Oregon AND NOT Eugene (House OR Condo) AND Corvallis and NOT Eugene

Informal Pseudo Boolean Notation • Evolved in web search engines • +house +Corvallis – house AND Corvallis

• +house +Oregon –Eugene – House and Oregon and NOT Eugene

• House +Corvallis –?

Pseudo Boolean Notation

Ordering of Retrieved Documents

• House +Corvallis

• Pure Boolean has no order

– (Corvallis AND House) OR Corvallis

• +House +Condo Corvallis Salem – (House and Condo and Corvallis) OR (House and Condo and Salem) OR (House and Condo)

– All returned documents are equally relevant

• In reality, different approaches can be taken – Chronologically – Order by number of times a specified term occurs – Other approaches – get further and further away from Boolean.

5

Boolean Searching • Upsides – Easy to implement – Simple queries are easy – Query language gives significant control over results

• Downsides – Binary relevance decision

Who Uses Boolean Searches • Everybody until about ten-fifteen years ago • Even now, many commercial systems (library catalogs, abstracting services, etc)

• No ordering criteria • Usually too much or too little

– Syntax can be complex

Boolean model – Data Retrieval or Information Retrieval? • Very close – hard to distinguish • Differences

Proximity operators • NEAR • WITHIN

– Enormous number of attributes – A document only has values for a few of those attributes – Inefficient to store and search using traditional data retrieval methods – Ordering may still be important

6

Cs419-519 Slides Part 2

Overview

More details

Related Documents

Interim Slides Part 2

Cs419-519 Slides Part 2

Interim Slides Part 1

Interim Slides Part 3

Slides 2

Slides 2