Information Retrieval 1

  • Uploaded by: Chethan.M
  • 0
  • 0
  • October 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Information Retrieval 1 as PDF for free.

More details

  • Words: 1,714
  • Pages: 12
IR – Introduction

Created by Chethan.M

Information Retrieval ISiM Syllabus : MISM 623 – Information Retrieval Systems Course Objectives - This course examines information retrieval within the context of full-text datasets. The students should be able to understand and critique existing information retrieval systems and to design and build information retrieval systems themselves. The course will introduce students to traditional methods as well as recent advances in information retrieval (IR), handling and querying of textual data. The focus will be on newer techniques of processing and retrieving textual information, including hypertext documents available on the World Wide Web. Course Outline Topics covered include: • IR Models o Boolean Model o Vector Space Model o Relational DBMS o Probabilistic Models o Language Models • Web Information Retrieval o citation network analysis o social collaboration (PageRank and HITS algorithms) • Term Indexing o Zipf's Law o term weighting • Searching and Data Structures o Inverted files to support Boolean and Vector Models o Clustering • non-hierarchical • single pass • reallocation o hierarchical agglomerative o String Searching o Tries, binary tries, binary digital tries, suffix trees, etc. • Retrieval Effectiveness Evaluation o Recall, Precision, Fallout o Comparing systems using average precision Course Readings: (Chethan is using) 1. Modern Information Retrieval / by Ricardo Baeza-Yates and Berthier RibeiroNeto, 2001 2. Introduction to Information Retrieval / Christopher D. Manning, Prabhakar ISiM

IR – Introduction

Created by Chethan.M

Example of IR: Just getting a credit card out of your wallet so that you can type in the card number is a form of information retrieval. What is Information Retrieval? Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Motivation: Information retrieval deals with the representation, storage, organization of & access to information items. The representation & organization of the information items should provide the user with easy access to the information in which he is interested. Unfortunately, characterization of the user information need is not a simple problem. Given the user query, the key goal of an IR system (Search Engine) is to retrieve information which might be useful or relevant to the user. The emphasis is on the retrieval of information as opposed to the retrieval of data. Information Retrieval is…..  The indexing and retrieval of textual documents.  Concerned firstly with retrieving relevant documents to a query.  Concerned secondly with retrieving from large sets of documents efficiently.  Selectivity  Finding some desired info in a store of information  IR = select from source process  IR and Literature searching (finding document) Information Retrieval System: “An information retrieval system is a device interposed between a potential user of information & information collection itself. For a given information problem, the purpose of the system is to capture wanted items & to filter out unwanted item”. Information retrieval systems can be distinguished by the scale at which they operate, and it is useful to distinguish three prominent scales.  In web search, the system has to provide search over billions of documents stored on millions of computers.  At the other extreme is personal information retrieval. In the last few years, consumer operating systems have integrated information retrieval (such as Apple’s Mac OS X Spotlight or Windows Vista’s Instant Search).  In between is the space of enterprise, institutional, and domainspecific search, where retrieval might be provided for collections such as a corporation’s internal documents, a database of patents, or research articles on biochemistry.

ISiM

IR – Introduction

Created by Chethan.M

Data Retrieval:  Which documents contain a set of keywords  Well defined semantics  A single erroneous object implies failure. Information Retrieval:  Information about a subject or topic  Semantics is frequently loose  Small errors are tolerated  NLP retrieval & non-structure data  Ranking & Relevance IR System:  Interpret contents of information items.  Generate a Ranking which reflects relevance.  Notion of relevance is most important. Data Retrieval VS Information Retrieval Databases What we’re Retrieving Queries we’re posing Results We get Interaction with system

Information Retrieval

Structured data. Clear Semantics based on a formal model. Formally defined queries. Unambiguous Exact. Always in a formal sense. One-shot Queries

Mostly unstructured. Free text with some Metadata Vague, imprecise information needs (often expressed in Natural language) Sometimes relevant, often not. Interaction is important (Relevance feedback).

Text Database VS Database

1.

Text Database Emphasize to Retrieval processing

Database Transaction Processing

2. 3.

Non-Data update Non-Data Integrity

Data Update Data Integrity

4.

Non-Data Structure  Book  Web page

Data Structure  Student Record  Registration Data

ISiM

IR – Introduction

Created by Chethan.M

History (Past) •

1960-70’s: – Initial exploration of text retrieval systems for “small” corpora of scientific abstracts, and law and business documents. – Development of the basic Boolean and vector-space models of retrieval. – Prof. Salton and his students at Cornell University are the leading researchers in the area.



1980’s: – Large document database systems, many run by companies: • Lexis-Nexis • Dialog • MEDLINE



1990’s: – Searching FTPable documents on the Internet • Archie • WAIS – Searching the World Wide Web • Lycos • Yahoo • Altavista



1990’s continued: – Organized Competitions • NIST TREC – Recommender Systems • Ringo • Amazon • NetPerceptions – Automated Text Categorization & Clustering



2000’s – Link analysis for Web Search • Google – Automated Information Extraction • Whizbang • Fetch • Burning Glass – Question Answering • TREC Q/A track •

ISiM

IR – Introduction •

Created by Chethan.M

2000’s continued: – Multimedia IR • Image • Video • Audio and music – Cross-Language IR • DARPA Tides – Document Summarization

Present Source of data  Electronic Library  Document of University  Data Online (web site) Example  AltaVista  Google  Etc. Past, Present and Future • Library is first Organization for IR • index assign by an academic and private • Searching technique (past : in library) – Title , subject – Hierarchies search system (e.g. Dewey Decimal), Controlled vocabularies, Collections of abstracts • Searching technique (present : in library) – Department (of a faculty) , Term index – to develop format in User interface – Electronic service – Hypertext service Related Research Areas of IR (Future) • Electronic Commerce on Web (Digital Library Online) • Database Management • Library and Information Science • Artificial Intelligence (AI) • Natural Language Processing (NLP) • Machine Learning (ML) Typical IR Task • Given: – A corpus of textual natural-language documents. ISiM

IR – Introduction



Created by Chethan.M

– A user query in the form of a textual string. Find: – A ranked set of documents that are relevant to the query.

Document corpus

Query String

IR System

Ranked Documen

1. Doc1 2. Doc2 3. Doc3 . .

Relevance • Relevance is a subjective judgment and may include: – Being on the proper subject. – Being timely (recent information). – Being authoritative (from a trusted source). – Satisfying the goals of the user and his/her intended use of the information (information need). • Much of IR depends upon idea that – Similar vocabulary -> relevant to same queries • Usually look for documents matching query words • “Similar” can be measured in many ways – String matching/comparison ISiM

IR – Introduction

Created by Chethan.M

– Same vocabulary used – Probability that documents arise from same model – Same meaning of text Keyword Search • Simplest notion of relevance is that the query string appears verbatim in the document. • Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words). Problems with Keywords • May not retrieve relevant documents that include synonymous terms. – “restaurant” vs. “café” – “PRC” vs. “China” • May retrieve irrelevant documents that include ambiguous terms. – “bat” (baseball vs. mammal) – “Apple” (company vs. fruit) – “bit” (unit of data vs. act of eating) Intelligent IR • Taking into account the meaning of the words used. • Taking into account the order of words in the query. • Adapting to the user based on direct or indirect feedback. • Taking into account the authority of the source. IR Basic Concepts •

ISiM

The User Task – Retrieval • information or data • purposeful – Browsing • glancing around • F1; cars, Le Mans, France tourism

IR – Introduction

Created by Chethan.M

Retrieval

Database Browsing

Fig: Interaction of the user with the retrieval system through distinct tasks.

• • • • •

ISiM

Document representation viewed as a continuum: logical view of documents might shift Document set to term index Indexing Automatic A Specialist Full text : all occurrence word in document select keyword Stop word Stemming

IR – Introduction Two IR main Functions: 1. Indexing (System perspective) - Text processing - Index construction 2. Retrieval (User perspective) - User interface - Query processing - Searching from index (index lookup) - Search result ranking IR System: (1) Indexing

IR System: (2) Retrieval

ISiM

Created by Chethan.M

IR – Introduction

ISiM

Created by Chethan.M

IR – Introduction

Created by Chethan.M

IR System Architecture

User Interface Text

User Need User Feedback Query Ranked Docs

Text Operations Logical View

Query Operations

Indexing Inverted Index File

Searching

Ranking

Database Manager

Retrieved Docs

Text Database

Fig: The Process of retrieving information. IR System Components • • • • •



ISiM

Text Operations forms index words (tokens). – Stopword removal – Stemming Indexing constructs an inverted index of word to document pointers. Searching retrieves documents that contain a given query token from the inverted index. Ranking scores all retrieved documents according to a relevance metric. User Interface manages interaction with the user: – Query input and document output. – Relevance feedback. – Visualization of results. Query Operations transform the query to improve retrieval: – Query expansion using a thesaurus. – Query transformation using relevance feedback.

IR – Introduction

Created by Chethan.M

References: 1. Modern Information Retrieval / by Ricardo Baeza-Yates and Berthier RibeiroNeto, 2001 2. Introduction to Information Retrieval / Christopher D. Manning, Prabhakar 3. Intelligent Information Retrieval and Web Search , Raymond Mooney, University of Texas at Austin 4. Introduction to Information Retrieval (IR), T.Keerati Boonchote

ISiM

Related Documents