IR – Introduction
Created by Chethan.M
Information Retrieval ISiM Syllabus : MISM 623 – Information Retrieval Systems Course Objectives - This course examines information retrieval within the context of full-text datasets. The students should be able to understand and critique existing information retrieval systems and to design and build information retrieval systems themselves. The course will introduce students to traditional methods as well as recent advances in information retrieval (IR), handling and querying of textual data. The focus will be on newer techniques of processing and retrieving textual information, including hypertext documents available on the World Wide Web. Course Outline Topics covered include: • IR Models o Boolean Model o Vector Space Model o Relational DBMS o Probabilistic Models o Language Models • Web Information Retrieval o citation network analysis o social collaboration (PageRank and HITS algorithms) • Term Indexing o Zipf's Law o term weighting • Searching and Data Structures o Inverted files to support Boolean and Vector Models o Clustering • non-hierarchical • single pass • reallocation o hierarchical agglomerative o String Searching o Tries, binary tries, binary digital tries, suffix trees, etc. • Retrieval Effectiveness Evaluation o Recall, Precision, Fallout o Comparing systems using average precision Course Readings: (Chethan is using) 1. Modern Information Retrieval / by Ricardo Baeza-Yates and Berthier RibeiroNeto, 2001 2. Introduction to Information Retrieval / Christopher D. Manning, Prabhakar ISiM
IR – Introduction
Created by Chethan.M
Example of IR: Just getting a credit card out of your wallet so that you can type in the card number is a form of information retrieval. What is Information Retrieval? Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Motivation: Information retrieval deals with the representation, storage, organization of & access to information items. The representation & organization of the information items should provide the user with easy access to the information in which he is interested. Unfortunately, characterization of the user information need is not a simple problem. Given the user query, the key goal of an IR system (Search Engine) is to retrieve information which might be useful or relevant to the user. The emphasis is on the retrieval of information as opposed to the retrieval of data. Information Retrieval is….. The indexing and retrieval of textual documents. Concerned firstly with retrieving relevant documents to a query. Concerned secondly with retrieving from large sets of documents efficiently. Selectivity Finding some desired info in a store of information IR = select from source process IR and Literature searching (finding document) Information Retrieval System: “An information retrieval system is a device interposed between a potential user of information & information collection itself. For a given information problem, the purpose of the system is to capture wanted items & to filter out unwanted item”. Information retrieval systems can be distinguished by the scale at which they operate, and it is useful to distinguish three prominent scales. In web search, the system has to provide search over billions of documents stored on millions of computers. At the other extreme is personal information retrieval. In the last few years, consumer operating systems have integrated information retrieval (such as Apple’s Mac OS X Spotlight or Windows Vista’s Instant Search). In between is the space of enterprise, institutional, and domainspecific search, where retrieval might be provided for collections such as a corporation’s internal documents, a database of patents, or research articles on biochemistry.
ISiM
IR – Introduction
Created by Chethan.M
Data Retrieval: Which documents contain a set of keywords Well defined semantics A single erroneous object implies failure. Information Retrieval: Information about a subject or topic Semantics is frequently loose Small errors are tolerated NLP retrieval & non-structure data Ranking & Relevance IR System: Interpret contents of information items. Generate a Ranking which reflects relevance. Notion of relevance is most important. Data Retrieval VS Information Retrieval Databases What we’re Retrieving Queries we’re posing Results We get Interaction with system
Information Retrieval
Structured data. Clear Semantics based on a formal model. Formally defined queries. Unambiguous Exact. Always in a formal sense. One-shot Queries
Mostly unstructured. Free text with some Metadata Vague, imprecise information needs (often expressed in Natural language) Sometimes relevant, often not. Interaction is important (Relevance feedback).
Text Database VS Database
1.
Text Database Emphasize to Retrieval processing
Database Transaction Processing
2. 3.
Non-Data update Non-Data Integrity
Data Update Data Integrity
4.
Non-Data Structure Book Web page
Data Structure Student Record Registration Data
ISiM
IR – Introduction
Created by Chethan.M
History (Past) •
1960-70’s: – Initial exploration of text retrieval systems for “small” corpora of scientific abstracts, and law and business documents. – Development of the basic Boolean and vector-space models of retrieval. – Prof. Salton and his students at Cornell University are the leading researchers in the area.
•
1980’s: – Large document database systems, many run by companies: • Lexis-Nexis • Dialog • MEDLINE
•
1990’s: – Searching FTPable documents on the Internet • Archie • WAIS – Searching the World Wide Web • Lycos • Yahoo • Altavista
•
1990’s continued: – Organized Competitions • NIST TREC – Recommender Systems • Ringo • Amazon • NetPerceptions – Automated Text Categorization & Clustering
•
2000’s – Link analysis for Web Search • Google – Automated Information Extraction • Whizbang • Fetch • Burning Glass – Question Answering • TREC Q/A track •
ISiM
IR – Introduction •
Created by Chethan.M
2000’s continued: – Multimedia IR • Image • Video • Audio and music – Cross-Language IR • DARPA Tides – Document Summarization
Present Source of data Electronic Library Document of University Data Online (web site) Example AltaVista Google Etc. Past, Present and Future • Library is first Organization for IR • index assign by an academic and private • Searching technique (past : in library) – Title , subject – Hierarchies search system (e.g. Dewey Decimal), Controlled vocabularies, Collections of abstracts • Searching technique (present : in library) – Department (of a faculty) , Term index – to develop format in User interface – Electronic service – Hypertext service Related Research Areas of IR (Future) • Electronic Commerce on Web (Digital Library Online) • Database Management • Library and Information Science • Artificial Intelligence (AI) • Natural Language Processing (NLP) • Machine Learning (ML) Typical IR Task • Given: – A corpus of textual natural-language documents. ISiM
IR – Introduction
•
Created by Chethan.M
– A user query in the form of a textual string. Find: – A ranked set of documents that are relevant to the query.
Document corpus
Query String
IR System
Ranked Documen
1. Doc1 2. Doc2 3. Doc3 . .
Relevance • Relevance is a subjective judgment and may include: – Being on the proper subject. – Being timely (recent information). – Being authoritative (from a trusted source). – Satisfying the goals of the user and his/her intended use of the information (information need). • Much of IR depends upon idea that – Similar vocabulary -> relevant to same queries • Usually look for documents matching query words • “Similar” can be measured in many ways – String matching/comparison ISiM
IR – Introduction
Created by Chethan.M
– Same vocabulary used – Probability that documents arise from same model – Same meaning of text Keyword Search • Simplest notion of relevance is that the query string appears verbatim in the document. • Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words). Problems with Keywords • May not retrieve relevant documents that include synonymous terms. – “restaurant” vs. “café” – “PRC” vs. “China” • May retrieve irrelevant documents that include ambiguous terms. – “bat” (baseball vs. mammal) – “Apple” (company vs. fruit) – “bit” (unit of data vs. act of eating) Intelligent IR • Taking into account the meaning of the words used. • Taking into account the order of words in the query. • Adapting to the user based on direct or indirect feedback. • Taking into account the authority of the source. IR Basic Concepts •
ISiM
The User Task – Retrieval • information or data • purposeful – Browsing • glancing around • F1; cars, Le Mans, France tourism
IR – Introduction
Created by Chethan.M
Retrieval
Database Browsing
Fig: Interaction of the user with the retrieval system through distinct tasks.
• • • • •
ISiM
Document representation viewed as a continuum: logical view of documents might shift Document set to term index Indexing Automatic A Specialist Full text : all occurrence word in document select keyword Stop word Stemming
IR – Introduction Two IR main Functions: 1. Indexing (System perspective) - Text processing - Index construction 2. Retrieval (User perspective) - User interface - Query processing - Searching from index (index lookup) - Search result ranking IR System: (1) Indexing
IR System: (2) Retrieval
ISiM
Created by Chethan.M
IR – Introduction
ISiM
Created by Chethan.M
IR – Introduction
Created by Chethan.M
IR System Architecture
User Interface Text
User Need User Feedback Query Ranked Docs
Text Operations Logical View
Query Operations
Indexing Inverted Index File
Searching
Ranking
Database Manager
Retrieved Docs
Text Database
Fig: The Process of retrieving information. IR System Components • • • • •
•
ISiM
Text Operations forms index words (tokens). – Stopword removal – Stemming Indexing constructs an inverted index of word to document pointers. Searching retrieves documents that contain a given query token from the inverted index. Ranking scores all retrieved documents according to a relevance metric. User Interface manages interaction with the user: – Query input and document output. – Relevance feedback. – Visualization of results. Query Operations transform the query to improve retrieval: – Query expansion using a thesaurus. – Query transformation using relevance feedback.
IR – Introduction
Created by Chethan.M
References: 1. Modern Information Retrieval / by Ricardo Baeza-Yates and Berthier RibeiroNeto, 2001 2. Introduction to Information Retrieval / Christopher D. Manning, Prabhakar 3. Intelligent Information Retrieval and Web Search , Raymond Mooney, University of Texas at Austin 4. Introduction to Information Retrieval (IR), T.Keerati Boonchote
ISiM