CSE 435/535 Fall 2021 – Draft Syllabus

Information Retrieval

Reg # 18953/18954

Lecture: Monday, Wednesday 12:30 pm – 1:50 pm (Buffalo time)

Knox 110

Instructor:  Rohini K. Srihari

 

Description:

 

This course will introduce students to text-based information retrieval (IR) techniques, i.e. search engines.  The course begins with the fundamentals of processing large-scale, multilingual text document collections.  Various IR models such as the Boolean model, vector space model, and probabilistic models will be studied. Efficient indexing techniques for (i) general document collections, (ii) specialized collections (e.g. Wikipedia, biomedical, patents) and (iii) high velocity data such as social media will be discussed. Techniques for improving search efficiency, improving performance as well as evaluation methodology will be covered. The latter part of the course will focus on web search including link analysis techniques such as PageRank and HITS.  The use of word vectors (Word2vec, GloVe) generated through neural models and their use in IR systems will be introduced.   Students will work on programming projects (implemented on the AWS cloud computing platform) to gain hands-on expertise in building IR systems.  This course provides the foundation for the follow-on course (CSE 635) which discusses natural language processing (NLP) and deeper text mining solutions.

 

Prerequisites:   Programming expertise (Java, Python) Linear Algebra

 

Textbook: Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schütze,

Cambridge University Press (2008, online version 2012)

Note: an online version of this book is available at http://informationretrieval.org

Other, more recent reference material will be made available on the piazza site during the semester.

 

Instructor: Rohini K. Srihari, Professor, Dept. of Computer Science & Eng

338D Davis Hall

 

TAs:
Sougata Saha.

Souvik Das

TBD

 

Covid Protocol

 

1.     All students must be vaccinated before coming to campus.

2.     Masks must be worn at all times during class.  Anyone not wearing a mask will be asked to leave.

Course Details:

 

1.     You are expected to attend all lectures and to complete all readings on time.  Recordings will be made available shortly after live class concludes.  The recordings are meant to serve as study aids, not as a substitute for attending class. 

 

2.     There will be 4 programming assignments in this course. The assignments cover the configuration of Solr for a particular search task, building of search indexes, evaluation of IR models, and a final (group) project requiring the development of a complete IR solution based on a real-world problem. All programming assignments will require the use of an AWS account; more information on this will be provided in class.

 

3.  We will use Piazza for course related discussion.

Class notes will be posted there prior to class. Projects and announcements will also be posted on this site. Piazza should be used for Q&A related to the course and particularly projects.

*** You should not post class materials (notes, exams, projects) on public sites: this would be a violation of Intellectual Property rights ****

 

4.  Please read department policy on academic dishonesty; this will be enforced strictly.

 


 

 

COURSE SCHEDULE

 

 

Week and Date

Topics

Readings *

Key Activities

Week 1
Aug 30, Sept 1

Introduction to IR

Conceptual Models of IR

Boolean Model
Project 1 release

Chapter 1, 2

·       Project 1 Release

·       Create twitter, AWS accounts

 

Week 2
Sept 6 (holiday)

Sept 8

Tokenization

Text analysis: stop lists, stemming

Dictionaries, Tolerant Retrieval

Chapter 3
Supplements

Recitation – SOLR, AWS setup (hands-on)

Week 3
Sept 13, 15

Index Construction

Distributed Indexing and Search Hadoop

Chapter 4
Supplements

 

Week 4
Sept 20, 22

Text Properties: Heaps, Zipfs Laws
Index Compression
Vector-Space Model
Project 2 release

Chapter 5, 6

·       Project 1 Due on Sept 19

·       Project 2 Release

Week 5
Sept 27, 29

TF-IDF Weighting

Scoring and Ranking in IR Systems

Chapter 6, 7

 

Week 6
Oct 4, Oct 6

Evaluation
Machine Learned Ranking

Midterm 1

Chapter 8

Handouts

Midterm 1

Week 7
Oct 11, 13

Relevance Feedback
Query Expansion: Local and Global

Project 3 release

Chapter 9

·       Project 2 Due on Oct 14

·       Project 3 Release

Week 8
Oct 18, 20

Probabilistic IR: Okapi (BM 25), DFR, Language Models

Chapter 11,12

 

 

Week 9
Oct 25, 27

Prob IR contd.

Text Classification

Chapter 13, 14

 

Week 10
Nov 1, 3

Web Search
Web Crawling

Chapter 19, 20

·       Project – 3 Due on Nov 5

Week 11
Nov 8, 10

Midterm 2

Social Network Analysis: Link Analysis, PageRank, HITS

Project 4 release

Chapter 21

Handouts

·       Midterm 2

·       Project 4 Release

 

 

Week 12
Nov 15, 17

Word Vectors: Latent Semantic Indexing
Word2Vec, GloVe, Doc2Vec

Chapter 18
Handouts

 

Week 13
Nov 22

Using word embeddings in Search

Computational Advertising

Handouts

 

Nov 24-27

***THANKSGIVING BREAK***

 

 

 

Week 14

Nov 29, Dec 1

E-commerce, social media search

Knowledge Graphs

Handouts

 

Week 15
Dec 6, 8

Student Project Presentations

 

Project – 4  Due on Dec 10

 

*Chapters are from the An Introduction to Information Retrieval textbook unless specified.