Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Information Retrieval - Introduction to Digital Libraries - Lecture Slides, Slides of Digital Communication Systems

Guru Ghasidas Vishwavidyalaya Digital Communication Systems

Main points of Introduction to Digital Libraries are: Information Retrieval, Specific Information, Relevant Information, Common Modern Application, Search Engines, Technology, Fundamental, Conducted Pre-Internet, Document, Query

Typology: Slides

2012/2013

Uploaded on 04/29/2013

awais 🇮🇳

4.3

(15)

155 documents

1 / 50

This page cannot be seen from the preview

Don't miss anything!

Information Retrieval

Docsity.com

Partial preview of the text

Download Information Retrieval - Introduction to Digital Libraries - Lecture Slides and more Slides Digital Communication Systems in PDF only on Docsity!

Information Retrieval

Introduction

 Information retrieval is the process of

locating the most relevant information to

satisfy a specific information need.

 Traditionally, we used databases and

keywords to locate information.

 The most common modern application is

search engines.

 Historically, the technology has been

developed from the mid-50’s onwards,

with a lot of fundamental research

conducted pre-Internet!

More Terminology

 Searching/Querying

 (^) Retrieving all the possibly relevant results for a given query.

 Indexing

 (^) Creating indices of all the documents/data to enable faster searching/quering.

 Ranked retrieval

 (^) Retrieval of a set of matching documents in decreasing order of estimated relevance to the query.

Models for IR

 Boolean model

 (^) Queries are specified as boolean expressions and only documents matching those criteria are returned.  (^) e.g., apples AND bananas

 Vector model

 (^) Both queries and documents are specified as lists of terms and mapped into an n - dimensional space (where n is the number of possible terms). The relevance then depends on the angle between the vectors.

Extended Boolean Models

 Any modern search engine that returns no

results for a very long query probably uses

some form of boolean model!

 (^) Altavista, Google, etc.  (^) Vector models are not as efficient as boolean models.

 Some extended boolean models filter on

the basis of boolean matching and rank on

the basis of term weights (tf.idf).

Filtering and Ranking

 Filtering

 (^) Removal of non-relevant results.  (^) Filtering restricts the number of results to those that are probably relevant.

 Ranking

 (^) Ordering of results according to calculated probability of relevance.  (^) Ranking puts the most probably relevant results at the “top of the list”.

Inverted (Postings) Files

bananas bananas apples bananas bananas apples bananas apples apples Doc Doc  (^) An inverted file for a term contains a list of document identifiers that correspond to that term. Doc1: 1 Doc2: 4 Doc1: 3 Doc2: 1 bananas 5 original apples 4 documents inverted files

Implementation of Inverted Files

 Each term corresponds to a list of

weighted document identifiers.

 (^) Each term can be a separate file, sorted by weight.  (^) Terms, documents identifiers and weights can be stored in an indexed database.

 Search engine indices can easily take 2-

times as much space as the original data.

 (^) The MG system (part of Greenstone) uses index compression and claims 1/3 as much space as the original data.

IF Optimisation Example

Id W 5 1

Id W 3 7

W

Id 3 2

Id W 3 7

Original inverted file Sort on W(eight) column To get the original data: W[1] = W’[1] W[i] = W[i-1]+W’[i] Subtract each weight from the previous value Transformed inverted file – this is what is encoded and stored Note: We can do this with the ID column instead!

Boolean Ranking

 (^) Assume a document D and a query Q are both n - term vectors.  (^) Then the inner product is a measure of how well D matches Q: ∑ =

n t t t

Similarity D Q d q

∑

= n t t t d q D Q Similarity 1 . 1  (^) Normalise so that long vectors do not adversely affect the ranking.

tf.idf

 Term frequency (tf)

 (^) The number of occurrences of a term in a document – terms which occur more often in a document have higher tf.

 Document frequency (df)

 (^) The number of documents a term occurs in – popular terms have a higher df.

 In general, terms with high “tf” and low

“df” are good at describing a document

and discriminating it from other

documents – hence tf.idf (term frequency *

inverse document frequency).

Inverse Document Frequency

 (^) Common formulation:  (^) Where f t is the number of documents term t occurs in (document frequency) and N is the total number of documents.  (^) Many different formulae exist – all increase the importance of rare terms.  (^) Now, weight the query in the ranking formula to include an IDF with the TF. w t =log e

1  N f t

Similarity =

∣ D ∣∣ Q ∣

t = 1 n

d

. log

N

f

. q

Vector Ranking

 (^) In n -dimensional Euclidean space, the angle between two vectors is given by: X Y X ⋅ Y cos θ =  (^) Note:  (^) cos 90 = 0 (orthogonal vectors shouldn’t match)  (^) cos 0 = 1 (corresponding vectors have a perfect match)  (^) Cosine θ is therefore a good measure of similarity of vectors.  (^) Substituting good tf and idf formulae in X.Y, we then get a similar formula to before (except we use all terms t[1..N]).

Term Document Space

 (^) A popular view of inverted files is as a matrix of terms and documents. Bananas 1 4 Apples 3 1 Doc1 Doc documents terms

Information Retrieval - Introduction to Digital Libraries - Lecture Slides, Slides of Digital Communication Systems

Related documents

Partial preview of the text

Download Information Retrieval - Introduction to Digital Libraries - Lecture Slides and more Slides Digital Communication Systems in PDF only on Docsity!

Information Retrieval

Introduction

 Information retrieval is the process of

locating the most relevant information to

satisfy a specific information need.

 Traditionally, we used databases and

keywords to locate information.

 The most common modern application is

search engines.

 Historically, the technology has been

developed from the mid-50’s onwards,

with a lot of fundamental research

conducted pre-Internet!

More Terminology

 Searching/Querying

 Indexing

 Ranked retrieval

Models for IR

 Boolean model

 Vector model

Extended Boolean Models

 Any modern search engine that returns no

results for a very long query probably uses

some form of boolean model!

 Some extended boolean models filter on

the basis of boolean matching and rank on

the basis of term weights (tf.idf).

Filtering and Ranking

 Filtering

 Ranking

Inverted (Postings) Files

 Each term corresponds to a list of

weighted document identifiers.

 Search engine indices can easily take 2-

times as much space as the original data.

IF Optimisation Example

W

Boolean Ranking

Similarity D Q d q

∑

tf.idf

 Term frequency (tf)

 Document frequency (df)

 In general, terms with high “tf” and low

“df” are good at describing a document

and discriminating it from other

documents – hence tf.idf (term frequency *

inverse document frequency).

Inverse Document Frequency

Similarity =

∣ D ∣∣ Q ∣

d

. log

N

f

. q

Vector Ranking

Term Document Space