Previous Research Projects
Center for Computational Learning Systems (CCLS), Columbia University
Machine Learning on the Historic Newspaper Archive of the NYPL

This research is funded by the National Endowment for Humanities and the project is designated as a "We the people" project.

Computers may have defeated humans in chess and arithmetic, but there are many areas where the human mind still excels such as visual cognition and language processing (Comm. of ACM, Vol 52, No 3, March ’09). If one mind is good, it has been argued that several minds are likely to be superior in certain tasks than individuals and even experts. This project aims to leverage the wisdom of the crowds (von Ahn, 2008) to collaboratively tag historical newspaper articles in the holdings of the New York Public Library (NYPL). Patrons and scholars will be encouraged to generate custom tags for articles they read and use often; these will be integrated into a metadata library and evaluated for their contribution to improving retrieval of documents. Novel machine learning algorithms will be designed for automatic categorization of newspaper articles. The creation and analysis of this corpus will enable advanced search mechanisms on these holdings making them more useful to the general public.


Related Projects: Chronicling America, National Digital Newspaper Program

Related Publications
  • Haimonti Dutta and William Chan, "Using Community Structure Detection to Rank Annotators when Ground Truth is subjective", NIPS Workshop on Human Computation in Science and Computational Sustainability, Lake Tahoe, Dec 7-8, 2012.
  • Haimonti Dutta, Rebecca J. Passonneau, Austin Lee, Axinia Radeva, Boyi Xie, David Waltz and Barbara Taranto, "Learning Parameters of the K-Means Algorithm from Subjective Human Annotation.", The 24th International FLAIRS Conference, Special Track on Data Mining, Palm Beach, FL. May 18-20, 2011.
  • Austin Lee, Haimonti Dutta, Rebecca Passonneau, David Waltz and Barbara Taranto, "Topic Identification from Historic Newspaper Articles of the New York Public Library: A Case Study", 5th Annual Machine Learning Symposium, NYAS, 2010.
Smart Grid Related Projects

This work is sponsored by the Consolidated Edison Company of New York.
  • Estimating the Mean Time Between Failures (MTBF) for Electrical Feeders in Primary Distribution System - Development of regression models such as Classification and Regression Trees (CART) and Support Vector Regression (SVR). The work is particularly challenging since feeders have different characteristics in different seasons, depending on temperature and load fluctuations in the system.
  • Ranking Electrical Feeders acccording to their susceptibility to Failure - The goal of this project is to rank feeder cables according to their susceptibility to failure. We have been experimenting with different machine learning techniques such as Rank Boost, Martingale Ranking and Scores obtained from Support Vector Machine. In addition, we are interested in doing analysis of features to find out what actually causes the feeders to fail. This involves investigation of univariate and multivariate feature extraction techniques.
  • Predicting Manhole Events - The goal of this project was to rank structures (manholes, service boxes) according to their susceptibility to explosions, fires and other serious events.
Related Publications
  • Rebecca J. Passonneau, Ashish Tomar, Somnath Sarkar, Haimonti Dutta and Axinia Radeva, "Multivariate Assessment of a Repair Program for a New York City Electrical Grid", 11th International Conference on Machine Learning and Applications ICMLA, Special Session on Machine Learning in Energy Applications, Boca Raton, FL, Dec 13 - 15, 2012.
  • Boyi Xie, Rebecca J. Passonneau, Haimonti Dutta, Jing-Yeu Miaw, Axinia Radeva, Ashish Tomar and Cynthia Rudin. "Progressive Clustering with Learned Seeds: An Event Categorization System for Power Grid." 24th International Conference on Software Engineering and Knowledge Engineering (SEKE 2012). Redwood City, CA. July 1-3, 2012.
  • Phil Gross, Ansaf Salleb-Aouissi, Haimonti Dutta and Albert Boulanger, "Ranking Electrical Feeders of the New York Power Grid", 3rd Annual Machine Learning Symposium at the New York Academy of Sciences (NYAS), New York, October, 2008.
  • Haoyun Feng, Haimonti Dutta and Ansaf Salleb-Aouissi, "On Improving Probability Estimate Trees", 3rd Annual Workshop for Women in Machine Learning (WiML) held in conjunction with Neural Information Processing Systems (NIPS), Vancouver, B.C., 2008.
  • Phil Gross, Ansaf Salleb-Aouissi, Haimonti Dutta and Albert Boulanger, "Susceptibility Ranking of Electrical Feeders: A Case Study", Technical Report, CCLS-08-04.
  • Cynthia Rudin, Becky Passanneau, Axinia Radeva, Haimonti Dutta, Steve Ierome and Delfina Isaac, "Predicting Vulnerability to Manhole Events in Manhattan: A Preliminary Machine Learning Approach", Submitted to Machine Learning Research.
  • Haimonti Dutta, Cynthia Rudin, Becky Passonneau, Fred Seibel, Nandini Bhardwaj, Axinia Radeva, Zhi An Liu, Steve Ierome and Delfina Isaac, "Visualization of Manhole and Precursor-Type Events for the Manhattan Electrical Distribution System", Workshop on Geo-Visualization of Dynamics, Movement and Change, 11th AGILE International Conference on Geographic Information Science, Girona, Spain, 2008.
Medicine and Healthcare Research

Machine Learning for Understanding Epilepsy - This work is a collaborative effort with Drs. Catherine Schevon and Ron Emerson from the Columbia University Medical School. It is funded by the a seed grant (Research Initiatives in Science and Engineering RISE) from Columbia University and a National Science Foundation grant (IIS-0916186). The goal of this project is to use machine learning techniques to develop seizure prediction and detection algorithms as well as to understand the underlying causes of the disorder. The main challenge is that there is a huge volume (30+ TB) of EEG data stored from a small number of patients. Visit the project page here.

Learning from Hospital Readmission Data - In a bid to reduce the costs incurred by the hospitals for multiple readmissions, the project team was asked to study hospital readmission data from a hospital in the East Coast of USA and analyze/predict possible readmissions from factors such as longetivity of hospital stay, type of disease and treatments offered, and from other sensitive patient health record information.
  • Distributed Data Mining for Large Astronomy Databases.
  • Distributed Kernel Density Estimation.
  • Ensemble Classification Techniques.
  • Grid Mining
  • Data Stream Monitoring
BioInformatics Research Center (BRC), UMBC
Computation of Phylogenetic trees from protein / DNA sequences
This project was done at the Bioinformatics Research Center, under the supervision of Dr Madhu Nayakkankuppam, Department of Mathematics and Statistics, Univeristy of Maryland Baltimore County.  View Abstract
It involved
  • Study of existing methods for computation of phylogenetic trees from DNA / protein sequences and relevant literature review
  • Parallelization of the problem of obtaining phylogenetic trees to obtain better performance in determination of the best tree
  • Examination of optimization techniques
  • Use of statistical methods to reduce search space of trees generated to obtain best tree
  • Comparison of results with existing software (PHYLIP)
  • Extensive coding in C++ and MPI (message passing interface)
Temple University, Philadelphia
  • Analysis of performance of aircrafts using the FltWinds software and a Neural Network based approach (Supervisor: Dr Zoran Obradovic. Collaborators: Lockheed Martin Inc, Valley Forge, Philadelphia.) Technical Report Presentation
  • Study of Isomorphism in Trees (Supervisor: Dr Igor Rivin, Department of Mathematics, Temple University.)
  • Feature Extraction, Classification and Searching in Medical Image Databases (Supervisor: Dr Vasilis Megaloikonomou)
  • Shape Representation and Matching in Medical Tumor Databases (Supervisor: Dr Zoran Obradovic)
  • Design and Analysis of Feature Hidden Markov Models for Protein Classification (Supervisor: Dr Vasilis Megaloikonomou, Collaborator: Dr Predrag Radivojac)