CS/EE 506/606: Information Retrieval

Synopsis

How many information retrieval systems have you used since waking up this morning? Probably more than you think. Information retrieval systems, including but not limited to web search engines, product recommender systems, library catalogues, and social media applications represent vital tools for navigating our modern information ecosystem. The underlying algorithms and technologies that power these systems come from every corner of computer and information science, and have a rich and fascinating history.

In this course, we will study the art and science of information retrieval. We will cover a wide range of technical topics and applications of IR. Furthermore, because information is generally produced and consumed by humans, we will have a particular focus on issues surrounding human users of IR systems.

The course will not involve a final exam; however, it will involve a final project, which will include an in-class presentation as well as a formal written paper. There will also be several homework assignments, including a pilot study for the final project. Furthermore, students will be expected to present at least one paper in class as well as participate in class discussions. There will also be a considerable amount of reading assigned, which students will be expected to actually do.

Fig. 1: Vannevar Bush's 'Memex, a hypothetical electro-mechanical hypertext system described in 1945 and arguably the blueprint for modern information systems.

Learning objectives

By the end of the course, students will:

Possess a working knowledge of the fundamental models of information retrieval;
Be familiar with a variety of result weighting schemes (TF/IDF, etc.);
Be able to critically analyze a search user interface
Be conversant with evaluation methodologies for IR, both system- and user-oriented.

Prerequisites

A working knowledge of programming is required for this course. While not strictly required, many parts of this course will be easier if you are familiar with standard mathematical concepts used in NLP: n-gram language models, Bayesian statistics, etc.

Instructor

Fig. 2: The instructor, next to a large sculpture of the majestic Crotaphytus bicintores.

CS/EE 5/655 is being taught by Steven Bedrick. He can usually be found in his natural habitat, Gaines Hall room 19. While he has no set office hours, GH is far enough off the beaten path that you should probably schedule something with him before making the schlep.

We strongly encourage you to consult the Student Health Center for guidance about any pre-travel immunizations that may be required before visiting Gaines Hall.

Textbook

In addition to a large number of articles and book chapters, we will be using the following texts extensively:

Manning C, Raghavan P, and Schütze H. Introduction to Information Retrieval. Cambridge University Press, 2008. Available Online
Hearst M. Search User Interfaces. Cambridge University Press, 2009. Available Online

Schedule

Date	Topic	Reading	HW Assigned
Mar 31 (T)	Course Overview; Information Behavior	Hearst Ch. 3, Case Ch. 3, Belkin (1980), and Patterson (2001).	HW1
Apr 2 (Th)	IR Basics	Manning, et al. Ch. 1 and 2
Apr 7 (T)	IR Models: Boolean, Vector, Probabilistic	Manning, et al. Ch. 6; Zhai 2007 (only sections 1 and 2); Sparck Jones 1972 and Robertson 2004.	HW2
Apr 9 (Th)	Index Construction/Optimization/Compression	Manning, et al. Ch. 4 and 5
Apr 14 (T)	Experimental Evaluation	Manning et al. Ch 8., Hearst Ch. 2, Cleverdon 1991, and Käki & Aula 2008.
Apr 16 (Th)	Web search, PageRank	Manning et al. Ch19 and 20, Leskovec Ch 5, Kurland & Lee 2010, and Bing 2014.
	Note: The Manning textbook also has a good chapter on link analysis; it overlaps enough with the Leskovec chapter that I'm not assigning it as reading, but you might find it useful and/or easier to follow than the Leskovec chapter.
Apr 21 (T)	No Class — Steven in Bethesda
Apr 23 (Th)	Search UI/UX (Presenter: Joe Hamilton)	Hearst in Baeza-Yates & Ribeiro-Neto, Ch. 2; Hearst Ch. 1, 4 and 5, Wu et al. 2012, Clarke et al. 2007, and Guan 2007.
	There is some overlap in today's readings, but much less than it may initially appear.
Apr 28 (T)	Learning From User Behavior	Jiang 2013, Jones & Klinkner 2008, Agichtein et al. 2006, and Lagun et al. 2014.
Apr 30 (Th)	Relevance Feedback	Manning et al. Ch 9, Lee & Croft 2013, Caballero & Akella 2012
May 5 (T)	Query suggestion/reformulation (Presenter: Joseph Hackman)	Hearst Ch. 6, Huang & Efthimiadis 2009, Jain et al. 2011, Ozertem et al. 2012
May 7 (Th)	Machine Learning & Ranking (Presenter: Shiran)	Manning et al. Ch 15, Liu 2009 sections 1–5, Zhu et al. 2014
	In the Manning chapter, focus on the section on "Machine learning methods in ad hoc information retrieval" (15.4). Don't be put off by the length of the Liu paper- the page layout involves very large margins, and it's not as long as it looks!
May 12 (T)	Multimedia Retrieval (Presenter: Krystal)	Larson & Jones 2012 (Sections 1, 2, and 6), Mei et al. 2014, Kennedy & Naaman 2008, Apostolova 2013, Zhang et al. 2012	Note: The Larson and Mei articles are background; Krystal will be presenting the others
May 14 (Th)	Document clustering (Presenter: Joel)	Manning et al. Ch 17, Slaney 2008, Cohen 2010, Chappell 2013
May 19 (T)	Microblog search, Time & Space (Presenter: Meikun)	Teevan 2011 and Woodward 2015 (as background), Bennett 2011, Cheng 2014, Mishra 2014	Pilot project presentations!
May 21 (Th)	Cross-Language IR (Presenter: Allison)	Zhou et al. 2012 (as background); Oard 2008; Steichen et al. 2015; Nikoulina et al. 2012
May 26 (T)	Guest Lecture: Bill Hersh, MD (OHSU)	Hersh 2014, Lin 2008, Stanton 2014	NOTE: Class will start at 3:15 today
May 28 (Th)	Guest Lecture: Stephen Wu, PhD (Mayo Clinic, TrapIt)
June 2 (T)	No class: NAACL
June 4 (Th)	No class: NAACL
June 11 (Th)	Project presentations!
June 12 (F)	Project presentations!

Logistics

CS5/606 will be held Tuesdays and Thursdays, from 4:00 to 5:30 PM, in GH5.

Homework

We will have homework, (basically) all of which will involve programming. The point of the homework is to give you "hands on" experience with the algorithms and techniques we'll be covering, not to learn how to write production-ready code. For some of the assignments, I will provide "scaffolding" code that may save you significant time; mostly, this code will be written in either Python or Java. If you want to use something else to do the assignment, you are of course free to do so.

The homework assignments will all come with a "due date." Assignments will be due at 11:59 pm (Portland time) on their due date. If you think you will need additional time to complete an assignment, let me know as soon as possible. If something serious and unexpected comes up at the last minute (illness, family emergency, etc.), we'll work something out.

The deliverables for the final project are an in-class presentation and a short paper done in the style of a conference submission: a maximum of eight pages, not counting references. The writeup will be due on June 15. Unless otherwise agreed, this will be a hard deadline, as grades are due later that week.

Grading

Your grade will be based on three things: in-class participation (including paper presentations) (30%), homework (30%), and the final project (40%).

Resources

Useful books

Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval: The Concepts and Technology Behind Search, 2nd ed. Addison-Wesley, 2011. Some Chapters Available Online
Case, D. Looking for Information: A Survey of Research on Information Seeking, Needs and Behavior, 3rd ed. Emerald Group Publishing, 2012.
Grossman DA, Frieder O. Information Retrieval: Algorithms and Heuristics, 2nd ed. Springer Netherlands, 2004.
Hearst M. Search User Interfaces. Cambridge University Press, 2009. Available Online
Ingwersen P and Järvelin K. The Turn: Integration of Information Seeking and Retrieval in Context. Springer, 2005.
Leskovec J, Rajaraman A, and Ullman J. Mining of Massive Data Sets, 2nd ed. Cambridge University Press, 2014. Available Online
Manning C, Raghavan P, and Schütze H. Introduction to Information Retrieval. Cambridge University Press, 2008. Available Online

Websites of note

We will be filling these in as we go along!

Open-source IR systems:
- Lucene
- ElasticSearch
- Lemur
  - Indri (Lemur's language-model based system)
- Terrier
- Whoosh

"The Information Palace"

Articles [↑]

Agichtein E, Brill E, Dumais S. Improving Web Search Ranking by Incorporating User Behavior Information. SIGIR '06. pp. 19–26.
Apostolova E, You D, Xue Z, Antani S, Demner-Fushman D, Thoma GR. Image retrieval from scientific publications: Text and image content processing to separate multipanel figures. J Am Soc Inf Sci. 2013;64(5):893–908.
Belkin NJ. Anomalous States of Knowledge as a Basis for Information Retrieval. Canadian journal of information and library science. 1980 Jan 1;5:133–43.
Bennett PN, Radlinski F, White RW, Yilmaz E. Inferring and Using Location Metadata to Personalize Web Search. Proceedings of SIGIR '11. pp. 135–44.
Bing L, Guo R, Lam W, Niu Z-Y, Wang H. Web Page Segmentation with Structured Prediction and Its Application in Web Page Classification. Proceedings of SIGIR '14. pp. 767–76.
Caballero KL, Akella R. Incorporating Statistical Topic Information in Relevance Feedback. Proceedings of SIGIR '12. pp. 1093–4.
Chappell T, Geva S, Nguyen A, Zuccon G. Efficient Top-k Retrieval with Signatures. Proceedings of the 18th Australasian Document Computing Symposium, 2013. pp. 10–17.
Cheng Z, Caverlee J, Barthwal H, Bachani V. Who is the Barbecue King of Texas?: A Geo-spatial Approach to Finding Local Experts on Twitter. Proceedings of SIGIR '14. pp. 335–44.
Clarke CLA, Agichtein E, Dumais S, White RW. The Influence of Caption Features on Clickthrough Patterns in Web Search. Proceedings of SIGIR '07, pp. 135–42.
Cleverdon CW. The significance of the Cranfield tests on index languages. Proceedings of SIGIR '91, pp. 3–12.
Cohen T, Schvaneveldt R, Widdows D. Reflective Random Indexing and indirect inference: A scalable method for discovery of implicit connections. J Biomed Inform. 2010;43(2):240–56.
Zhou D, Truran M, Brailsford T, Wade V, Ashman H. Translation techniques in cross-language information retrieval. Computing Surveys 2012 Nov;45(1):1–44.
Guan Z, Cutrell E. An Eye Tracking Study of the Effect of Target Rank on Web Search. CHI '07. pp. 417–20.
Hersh, WR. Information Retrieval and Digital Libraries. in Shortliffe EH, Cimino JJ, Biomedical Informatics. Springer, 2014.
Huang J, Efthimiadis EN. Analyzing and Evaluating Query Reformulation Strategies in Web Search Logs. Proceedings of CIKM '09. pp. 77–86.
Jain A, Ozertem U, Velipasaoglu E. Synthesizing High Utility Suggestions for Rare Web Search Queries. Proceedings of SIGIR '11. pp. 805–14.
Jiang D, Pei J, Li H. Mining Search and Browse Logs for Web Search: A Survey. ACM Trans Intell Syst Technol. 2013;4(4):57:1–57:37.
Jones R, Klinkner KL. Beyond the Session Timeout: Automatic Hierarchical Segmentation of Search Topics in Query Logs. Proceedings of CIKM '08. pp. 699–708.
Käki M, Aula A. Controlling the complexity in comparing search user interfaces via user studies. Information Processing and Management. 2008;44(1):82–91.
Kennedy LS, Naaman M. Generating Diverse and Representative Image Search Results for Landmarks. Proceedings of WWW '08, pp. 297–306.
Kurland O, Lee L. PageRank Without Hyperlinks: Structural Reranking Using Links Induced by Language Models. ACM Trans Inf Syst. 2010;28(4):18:1–18:38.
Lagun D, Hsieh C-H, Webster D, Navalpakkam V. Towards Better Measurement of Attention and Satisfaction in Mobile Search. Proceedings of SIGIR '14. pp. 113–22.
Larson M, Jones GJF. Spoken Content Retrieval: A Survey of Techniques and Technologies. Found Trends Inf Retr. 2012;5(4–5):235–422.
Lee KS, Croft WB. A deterministic resampling method using overlapping document clusters for pseudo-relevance feedback. Information Processing and Management. 2013;49(4):792–806.
Lin J, DiCuccio M, Grigoryan V, Wilbur W. Navigating information spaces: A case study of related article search in PubMed. Information Processing and Management. 2008;44(5):1771–83.
Liu T-Y. Learning to Rank for Information Retrieval. Found Trends Inf Retr. 2009;3(3):225–331.
Mei T, Rui Y, Li S, Tian Q. Multimedia Search Reranking: A Literature Survey. ACM Comput Surv. 2014;46(3):38:1–38:38.
Mishra N, White RW, Ieong S, Horvitz E. Time-critical Search. Proceedings of SIGIR '14. pp. 747–56.
Nikoulina V, Kovachev B, Lagos N, Monz C. Adaptation of statistical machine translation model for cross-lingual information retrieval in a service context. Association for Computational Linguistics; 2012.
Oard DW, He D, Wang J. User-assisted query translation for interactive cross-language information retrieval. Information Processing and Management. 2008;44:181–211.
Ozertem U, Chapelle O, Donmez P, Velipasaoglu E. Learning to Suggest: A Machine Learning Framework for Ranking Query Suggestions. Proceedings of SIGIR '12. pp. 25–34.
Patterson ES, Roth EM, Woods DD. Predicting vulnerabilities in computer-supported inferential analysis under data overload. Cognition, Technology & Work. 2001;3(4):224–37.
Robertson S. Understanding inverse document frequency: on theoretical arguments for IDF. Journal of Documentation. 2004;60(5):503–20.
Slaney M, Casey M. Locality-Sensitive Hashing for Finding Nearest Neighbors. Signal Processing Magazine, IEEE. 2008 Mar;25(2):128–31.
Sparck Jones K. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation. 1972;28(1):11–21.
Stanton I, Ieong S, Mishra N. Circumlocution in Diagnostic Medical Queries. Proceedings of SIGIR '14. pp. 133–42.
Steichen B, Freund L. Supporting the Modern Polyglot: A Comparison of Multilingual Search Interfaces. Proceedings of 'CHI 2015. pp. 3483–92.
Teevan J, Ramage D, Morris MR. #TwitterSearch: A Comparison of Microblog Search and Web Search. WSDM '11. pp. 35–44.
Woodward, A and Kleppmann, M. Real-time full-text search with Luwak and Samza. Presented at FOSDEM 2015. (Blog post, retrieved 5/14/2015)
Wu W-C, Kelly D, Huang K. User Evaluation of Query Quality. Proceedings of SIGIR '12. pp. 215–24.
Zhai C. Statistical Language Models for Information Retrieval A Critical Review. Foundations and Trends in Information Retrieval. 2007;2(3):137–213.
Zhang YC, Séaghdha DO, Quercia D, Jambor T. Auralist: Introducing Serendipity into Music Recommendation. Proceedings of ACM Web Search and Data Mining 2012. pp. 13–22.
Zhu Y, Lan Y, Guo J, Cheng X, Niu S. Learning for Search Result Diversification. Proceedings of SIGIR '14. pp. 293–302.

Student Access Statement

Our program is committed to all students achieving their potential. If you have a disability or think you may have a disability (physical, learning, hearing, vision, psychological) which may need a reasonable accommodation please contact Student Access at (503) 494-0082 or e-mail studentaccess@ohsu.edu to discuss your needs. You can also find more information at www.ohsu.edu/student-access. Because accommodations can take time to implement, it is important to have this discussion as soon as possible. All information regarding a student’s disability is kept in accordance with relevant state and federal laws.

CS/EE 5/606: Information Retrieval

Spring 2015, Tuesdays & Thursdays at 2:15