Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Information Retrieval - Introduction to Digital Libraries - Lecture Slides, Slides of Digital Communication Systems

Main points of Introduction to Digital Libraries are: Information Retrieval, Specific Information, Relevant Information, Common Modern Application, Search Engines, Technology, Fundamental, Conducted Pre-Internet, Document, Query

Typology: Slides

2012/2013

Uploaded on 04/29/2013

awais
awais 🇮🇳

4.3

(15)

155 documents

1 / 50

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Information Retrieval
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32

Partial preview of the text

Download Information Retrieval - Introduction to Digital Libraries - Lecture Slides and more Slides Digital Communication Systems in PDF only on Docsity!

Information Retrieval

Introduction

 Information retrieval is the process of

locating the most relevant information to

satisfy a specific information need.

 Traditionally, we used databases and

keywords to locate information.

 The most common modern application is

search engines.

 Historically, the technology has been

developed from the mid-50’s onwards,

with a lot of fundamental research

conducted pre-Internet!

More Terminology

 Searching/Querying

 (^) Retrieving all the possibly relevant results for a given query.

 Indexing

 (^) Creating indices of all the documents/data to enable faster searching/quering.

 Ranked retrieval

 (^) Retrieval of a set of matching documents in decreasing order of estimated relevance to the query.

Models for IR

 Boolean model

 (^) Queries are specified as boolean expressions and only documents matching those criteria are returned.  (^) e.g., apples AND bananas

 Vector model

 (^) Both queries and documents are specified as lists of terms and mapped into an n - dimensional space (where n is the number of possible terms). The relevance then depends on the angle between the vectors.

Extended Boolean Models

 Any modern search engine that returns no

results for a very long query probably uses

some form of boolean model!

 (^) Altavista, Google, etc.  (^) Vector models are not as efficient as boolean models.

 Some extended boolean models filter on

the basis of boolean matching and rank on

the basis of term weights (tf.idf).

Filtering and Ranking

 Filtering

 (^) Removal of non-relevant results.  (^) Filtering restricts the number of results to those that are probably relevant.

 Ranking

 (^) Ordering of results according to calculated probability of relevance.  (^) Ranking puts the most probably relevant results at the “top of the list”.

Inverted (Postings) Files

bananas bananas apples bananas bananas apples bananas apples apples Doc Doc  (^) An inverted file for a term contains a list of document identifiers that correspond to that term. Doc1: 1 Doc2: 4 Doc1: 3 Doc2: 1 bananas 5 original apples 4 documents inverted files

Implementation of Inverted Files

 Each term corresponds to a list of

weighted document identifiers.

 (^) Each term can be a separate file, sorted by weight.  (^) Terms, documents identifiers and weights can be stored in an indexed database.

 Search engine indices can easily take 2-

times as much space as the original data.

 (^) The MG system (part of Greenstone) uses index compression and claims 1/3 as much space as the original data.

IF Optimisation Example

Id W 5 1

Id W 3 7

W

Id 3 2

Id W 3 7

Original inverted file Sort on W(eight) column To get the original data: W[1] = W’[1] W[i] = W[i-1]+W’[i] Subtract each weight from the previous value Transformed inverted file – this is what is encoded and stored Note: We can do this with the ID column instead!

Boolean Ranking

 (^) Assume a document D and a query Q are both n - term vectors.  (^) Then the inner product is a measure of how well D matches Q: ∑ =

n t t t

Similarity D Q d q

1

= n t t t d q D Q Similarity 1 . 1  (^) Normalise so that long vectors do not adversely affect the ranking.

tf.idf

 Term frequency (tf)

 (^) The number of occurrences of a term in a document – terms which occur more often in a document have higher tf.

 Document frequency (df)

 (^) The number of documents a term occurs in – popular terms have a higher df.

 In general, terms with high “tf” and low

“df” are good at describing a document

and discriminating it from other

documents – hence tf.idf (term frequency *

inverse document frequency).

Inverse Document Frequency

 (^) Common formulation:  (^) Where f t is the number of documents term t occurs in (document frequency) and N is the total number of documents.  (^) Many different formulae exist – all increase the importance of rare terms.  (^) Now, weight the query in the ranking formula to include an IDF with the TF. w t =log e

1  N f t

Similarity =
∣ D ∣∣ Q ∣

t = 1 n

d

t

. log

e

N
f

t

. q

t

Vector Ranking

 (^) In n -dimensional Euclidean space, the angle between two vectors is given by: X Y XY cos θ =  (^) Note:  (^) cos 90 = 0 (orthogonal vectors shouldn’t match)  (^) cos 0 = 1 (corresponding vectors have a perfect match)  (^) Cosine θ is therefore a good measure of similarity of vectors.  (^) Substituting good tf and idf formulae in X.Y, we then get a similar formula to before (except we use all terms t[1..N]).

Term Document Space

 (^) A popular view of inverted files is as a matrix of terms and documents. Bananas 1 4 Apples 3 1 Doc1 Doc documents terms