Combating Disinformation: Detection and Attribution

This effort is focused on using natural language processing and deep learning methods to first identify disinformation in news and social media, and more importantly, provide evidence that they reflect false or misleading information. A key initial step is to first identify phrases or sentences that represent key claims being made. Each of then is used to search a set of reputable news sources for either supporting or refuting evidence. The data sets span different topics, including political news and more recently, Covid related fake news. A paper describing the data set used in this research was presented and published at ACL 2019.

Conversational AI Systems (Chatbots)

This is a recent effort, focusing on the development of chatbots that are able to leverage rich content in generating responses (similar to question answering systems) as well as reflect empathy in generating tone appropriate responses. During the Covid pandemic, people find themselves increasingly isolated, including patients in hospitals. Intelligent and empathetic chatbots can help alleviate some of the despondence due to isolation. These models leverage state-of-the-art language models (such as GPT2 and GPT3) along with innovative deep learning architectures. We are also exploring the use of knowledge graph embeddings for deeper reasoning.

Social Unrest Prediction

One aspect of this research has focused on leveraging diverse global data sources (including locally sourced data, “big data” including sensor data) and predictive analytics to provide early warning of social and economic disruption in emerging economies. It involves several components, including text mining and classification, fusion of information sources to quantify the stress associated with various indicators, and finally, computational models for predicting disruption, that is specific to a place and time. This research is focused on using heterogeneous data sources to provide early warning of social and economic disruption in emerging economies. An IARPA challenge defined the problem as being able to accurately predict the expected number of riots and protests in a given city on a given date. The ability to increase the lead time in prediction makes the information more useful for peacekeepers and development/aid organizations. The data sources include text (news, social media), climate data, commodity prices, as well as numerous other structured and unstructured data sources. Ground truth is obtained through resources such as ACLED - a manually curated database of conflict events throughout the globe. The approach involves the use of deep learning models including neural embeddings, RNNs, and more recently, attention based models.

Multilingual Text Mining

My work on multilingual text mining, includes languages that are less commonly taught (LCTL). These languages are characterized by the lack of critical resources such as electronic dictionaries, annotated corpora (for training machine learning modules) and other rich resources used in English NLP such as WordNet. I have supervised PhD students in adapting resources from other languages to automatically generate critical tools and data resources. This work, which was published in COLING and the journal TALIP, discussed the use of transfer learning to project English semantic role labels (from PropBank) to an Urdu corpus. We showed how Hindi resources could be adapted for this task. Finally, some work on Urdu sentiment analysis, namely identifying opinion entities has demonstrated the need for different features when compared to English. I have also worked on multilingual social media which is characterized by phenomena such as code switching: my team competed in a task that involved processing code-switched search queries. A 2016 COLING paper discussed the summarization of globally trending hashtags, including a robust, language agnostic system for extracting multiword expressions.