CSE 435/535
Fall 2021 – Draft Syllabus
Information
Retrieval
Reg # 18953/18954
Lecture: Monday, Wednesday 12:30 pm – 1:50 pm
(Buffalo time)
Knox 110
Instructor: Rohini K. Srihari
Description:
This course will introduce students to text-based information retrieval (IR) techniques, i.e. search engines. The course begins with the fundamentals of processing large-scale, multilingual text document collections. Various IR models such as the Boolean model, vector space model, and probabilistic models will be studied. Efficient indexing techniques for (i) general document collections, (ii) specialized collections (e.g. Wikipedia, biomedical, patents) and (iii) high velocity data such as social media will be discussed. Techniques for improving search efficiency, improving performance as well as evaluation methodology will be covered. The latter part of the course will focus on web search including link analysis techniques such as PageRank and HITS. The use of word vectors (Word2vec, GloVe) generated through neural models and their use in IR systems will be introduced. Students will work on programming projects (implemented on the AWS cloud computing platform) to gain hands-on expertise in building IR systems. This course provides the foundation for the follow-on course (CSE 635) which discusses natural language processing (NLP) and deeper text mining solutions.
Prerequisites: Programming
expertise (Java, Python) Linear Algebra
Textbook: Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schütze,
Cambridge University Press (2008, online version 2012)
Note: an online version of
this book is available at http://informationretrieval.org
Other, more recent
reference material will be made available on the piazza site during the
semester.
Instructor: Rohini K. Srihari, Professor, Dept. of Computer Science & Eng
338D Davis Hall
TAs:
Sougata Saha.
Souvik Das
TBD
Covid Protocol
1. All students must be vaccinated before coming to campus.
2.
Masks must be worn at all times during
class. Anyone not wearing a mask will be
asked to leave.
Course Details:
1. You are expected to attend all lectures and to complete all readings on time. Recordings will be made available shortly after live class concludes. The recordings are meant to serve as study aids, not as a substitute for attending class.
2. There will be 4 programming assignments in this course. The assignments cover the configuration of Solr for a particular search task, building of search indexes, evaluation of IR models, and a final (group) project requiring the development of a complete IR solution based on a real-world problem. All programming assignments will require the use of an AWS account; more information on this will be provided in class.
3. We will use Piazza for course related discussion.
Class notes will be posted there
prior to class. Projects and announcements will also be posted on this site.
Piazza should be used for Q&A related to the course and particularly
projects.
*** You should not post class materials (notes, exams, projects) on public
sites: this would be a violation of Intellectual Property rights ****
4. Please read department policy on academic dishonesty; this will be enforced strictly.
COURSE SCHEDULE
Week
and Date |
Topics |
Readings
* |
Key
Activities |
Week 1 |
Introduction to IR Conceptual Models of IR Boolean Model |
Chapter 1, 2 |
· Project 1 Release · Create twitter, AWS accounts |
Week 2 Sept 8 |
Tokenization Text analysis: stop lists, stemming Dictionaries, Tolerant Retrieval |
Chapter 3 |
Recitation – SOLR, AWS setup (hands-on) |
Week 3 |
Index Construction Distributed Indexing and Search Hadoop |
Chapter 4 |
|
Week 4 |
Text Properties: Heaps, Zipfs Laws |
Chapter 5, 6 |
·
Project
1 Due on Sept 19 · Project 2 Release |
Week 5 |
TF-IDF Weighting Scoring and Ranking in IR Systems |
Chapter 6, 7 |
|
Week 6 |
Evaluation Midterm 1 |
Chapter 8 Handouts |
Midterm
1 |
Week 7 |
Relevance Feedback Project 3 release |
Chapter 9 |
· Project
2 Due on Oct 14 · Project
3 Release |
Week 8 |
Probabilistic IR: Okapi (BM 25), DFR, Language Models |
Chapter 11,12 |
|
Week 9 |
Prob IR contd. Text Classification |
Chapter 13, 14 |
|
Week 10 |
Web Search |
Chapter 19, 20 |
· Project – 3 Due on Nov 5 |
Week 11 |
Midterm 2 Social Network Analysis: Link Analysis, PageRank, HITS Project 4 release |
Chapter 21 Handouts |
·
Midterm
2 · Project 4 Release |
Week 12 |
Word Vectors: Latent Semantic Indexing |
Chapter 18 |
|
Week 13 |
Using word embeddings in Search Computational Advertising |
Handouts |
|
Nov 24-27 |
***THANKSGIVING BREAK*** |
|
|
Week 14 Nov 29, Dec 1 |
E-commerce, social media search Knowledge Graphs |
Handouts |
|
Week 15 |
Student Project Presentations |
|
Project – 4
Due on Dec 10 |
*Chapters are from the An Introduction to Information Retrieval textbook unless specified.