Leveraging "The Wisdom of the Crowds" for Efficient Tagging and Retrieval of Documents from the Historic Newspaper Archive of the New York Public Library
Funding for this work is provided by the National Endowment for Humanities,
NEH HD-51153-10. The project has been designated a "WE THE PEOPLE" project.
Personnel | Collaborators | Summary | Publications | BODHI Project Details
- Barbara Taranto, New York Public Library
Computers may have defeated humans in chess and arithmetic, but there are many areas where the human mind still excels such as visual cognition and language processing (Comm. of ACM, Vol 52, No 3, March ’09). If one mind is good, it has been argued that several minds are likely to be superior in certain tasks than individuals and even experts. This project aims to leverage the wisdom of the crowds (von Ahn, 2008) to collaboratively tag historical newspaper articles in the holdings of the New York Public Library1 (NYPL). Patrons and scholars will be encouraged to generate custom tags for articles they read and use often; these will be integrated into a meta-data library and evaluated for their contribution to improving retrieval performance. The text in the newspaper articles along with user-generated tags will be subjected to statistical analysis and machine learning for automatic categorization. The creation and analysis of this corpus is likely to enable advanced search mechanisms on these holdings making them more useful to the general public.
BODHI Project Details
- Haimonti Dutta, Rebecca J. Passonneau, Austin Lee, Axinia Radeva, Boyi Xie, David Waltz and Barbara Taranto, "Learning Parameters of the K-Means Algorithm from Subjective Human Annotation.", The 24th International FLAIRS Conference, Special Track on Data Mining, Palm Beach, FL. May 18-20, 2011.
- Austin Lee, Haimonti Dutta, Rebecca Passonneau, David Waltz and Barbara Taranto, "Topic Identification from Historic Newspaper Articles of the New York Public Library: A Case Study", 5th Annual Machine Learning Symposium, NYAS, 2010.
System Architecture Diagram
Screen shots of the OCR corrector