Current Research Projects
Large-scale Machine Learning | Digital Humanities Research | Smart Grid Projects | Medicine and Healthcare Research | Theory and Algorithms |
Machine Learning on the Historic Newspaper Archive of the NYPL

This research is funded by the National Endowment for Humanities and the project is designated as a "We the people" project.

Computers may have defeated humans in chess and arithmetic, but there are many areas where the human mind still excels such as visual cognition and language processing (Comm. of ACM, Vol 52, No 3, March ’09). If one mind is good, it has been argued that several minds are likely to be superior in certain tasks than individuals and even experts. This project aims to leverage the wisdom of the crowds (von Ahn, 2008) to collaboratively tag historical newspaper articles in the holdings of the New York Public Library (NYPL). Patrons and scholars will be encouraged to generate custom tags for articles they read and use often; these will be integrated into a metadata library and evaluated for their contribution to improving retrieval of documents. Novel machine learning algorithms will be designed for automatic categorization of newspaper articles. The creation and analysis of this corpus will enable advanced search mechanisms on these holdings making them more useful to the general public.


Related Projects: Chronicling America, National Digital Newspaper Program

Related Publications
  • Haimonti Dutta and William Chan, "Using Community Structure Detection to Rank Annotators when Ground Truth is subjective", NIPS Workshop on Human Computation in Science and Computational Sustainability, Lake Tahoe, Dec 7-8, 2012.
  • Haimonti Dutta, Rebecca J. Passonneau, Austin Lee, Axinia Radeva, Boyi Xie, David Waltz and Barbara Taranto, "Learning Parameters of the K-Means Algorithm from Subjective Human Annotation.", The 24th International FLAIRS Conference, Special Track on Data Mining, Palm Beach, FL. May 18-20, 2011.
  • Austin Lee, Haimonti Dutta, Rebecca Passonneau, David Waltz and Barbara Taranto, "Topic Identification from Historic Newspaper Articles of the New York Public Library: A Case Study", 5th Annual Machine Learning Symposium, NYAS, 2010.
Smart Grid Related Projects

This work is sponsored by the Consolidated Edison Company of New York.
  • Estimating the Mean Time Between Failures (MTBF) for Electrical Feeders in Primary Distribution System - Development of regression models such as Classification and Regression Trees (CART) and Support Vector Regression (SVR). The work is particularly challenging since feeders have different characteristics in different seasons, depending on temperature and load fluctuations in the system.
  • Ranking Electrical Feeders acccording to their susceptibility to Failure - The goal of this project is to rank feeder cables according to their susceptibility to failure. We have been experimenting with different machine learning techniques such as Rank Boost, Martingale Ranking and Scores obtained from Support Vector Machine. In addition, we are interested in doing analysis of features to find out what actually causes the feeders to fail. This involves investigation of univariate and multivariate feature extraction techniques.
  • Predicting Manhole Events - The goal of this project was to rank structures (manholes, service boxes) according to their susceptibility to explosions, fires and other serious events.
Related Publications
  • Rebecca J. Passonneau, Ashish Tomar, Somnath Sarkar, Haimonti Dutta and Axinia Radeva, "Multivariate Assessment of a Repair Program for a New York City Electrical Grid", 11th International Conference on Machine Learning and Applications ICMLA, Special Session on Machine Learning in Energy Applications, Boca Raton, FL, Dec 13 - 15, 2012.
  • Boyi Xie, Rebecca J. Passonneau, Haimonti Dutta, Jing-Yeu Miaw, Axinia Radeva, Ashish Tomar and Cynthia Rudin. "Progressive Clustering with Learned Seeds: An Event Categorization System for Power Grid." 24th International Conference on Software Engineering and Knowledge Engineering (SEKE 2012). Redwood City, CA. July 1-3, 2012.
  • Phil Gross, Ansaf Salleb-Aouissi, Haimonti Dutta and Albert Boulanger, "Ranking Electrical Feeders of the New York Power Grid", 3rd Annual Machine Learning Symposium at the New York Academy of Sciences (NYAS), New York, October, 2008.
  • Haoyun Feng, Haimonti Dutta and Ansaf Salleb-Aouissi, "On Improving Probability Estimate Trees", 3rd Annual Workshop for Women in Machine Learning (WiML) held in conjunction with Neural Information Processing Systems (NIPS), Vancouver, B.C., 2008.
  • Phil Gross, Ansaf Salleb-Aouissi, Haimonti Dutta and Albert Boulanger, "Susceptibility Ranking of Electrical Feeders: A Case Study", Technical Report, CCLS-08-04.
  • Cynthia Rudin, Becky Passanneau, Axinia Radeva, Haimonti Dutta, Steve Ierome and Delfina Isaac, "Predicting Vulnerability to Manhole Events in Manhattan: A Preliminary Machine Learning Approach", Submitted to Machine Learning Research.
  • Haimonti Dutta, Cynthia Rudin, Becky Passonneau, Fred Seibel, Nandini Bhardwaj, Axinia Radeva, Zhi An Liu, Steve Ierome and Delfina Isaac, "Visualization of Manhole and Precursor-Type Events for the Manhattan Electrical Distribution System", Workshop on Geo-Visualization of Dynamics, Movement and Change, 11th AGILE International Conference on Geographic Information Science, Girona, Spain, 2008.
Medicine and Healthcare Research

Machine Learning for Understanding Epilepsy - This work is a collaborative effort with Drs. Catherine Schevon and Ron Emerson from the Columbia University Medical School. It is funded by the a seed grant (Research Initiatives in Science and Engineering RISE) from Columbia University and a National Science Foundation grant (IIS-0916186). The goal of this project is to use machine learning techniques to develop seizure prediction and detection algorithms as well as to understand the underlying causes of the disorder. The main challenge is that there is a huge volume (30+ TB) of EEG data stored from a small number of patients. Visit the project page here.

Learning from Hospital Readmission Data - In a bid to reduce the costs incurred by the hospitals for multiple readmissions, the project team was asked to study hospital readmission data from a hospital in the East Coast of USA and analyze/predict possible readmissions from factors such as longetivity of hospital stay, type of disease and treatments offered, and from other sensitive patient health record information.
Large-scale Machine Learning - Development of efficient distributed and parallel algorithms for pattern recognition

Related Publications
    Distributed Support Vector Machines
  • Chase Hensel and Haimonti Dutta, "GERMS: a distributed sub-Gradient ERM Solver", 4th Annual Machine Learning Symposium at the New York Academy of Sciences (NYAS), New York, November, 2009.
  • Chase Hensel and Haimonti Dutta, "GADGET SVM: a Gossip-bAseD sub-GradiEnT SVM Solver", International Conference on Machine Learning (ICML), Numerical Mathematics in Machine Learning Workshop, Montreal, Quebec, 2009. Watch Video

    Releasing GADGET SVM ver1.0

  • Distributed Linear Programming
  • Xianshu Zhu, Tushar Mahule, Haimonti Dutta, Sugandha Arora, Hillol Kargupta, Kirk D. Borne: Peer-to-peer distributed text classifier learning in PADMINI. Statistical Analysis and Data Mining 5 (5): 446-462, 2012.
  • Haimonti Dutta, “A Randomized Gossip-based Algorithm for Classification on Peer-to-Peer Net- works", In Proceedings of the NIPS Workshop on Big Learning: Algorithms, Systems, and Tools for Learning at Scale, Grenada, Spain, Dec 2011
  • Haimonti Dutta and Hillol Kargupta, "Distributed Linear Programming and Resource Management for Data Mining in Distributed Environments", 10th International Workshop on High Performance Data Mining (HPDM) held in conjunction with the International Conference on Data Mining (ICDM), Pisa Italy.
  • Haimonti Dutta and Ananda Mathur, "Distributed Optimization Strategies for Mining on Peer-to-Peer Networks", Accepted for publication in International Conference on Machine Learning and Applications (ICMLA), 2008. (Nominated for the Best Paper Award)
Theory and Algorithms
  • Ranking from Decision Trees
Past Research Projects