Ying Sun

Home      Research      Teaching      Publications      Dissertation        CV


My research interests related broadly to new approaches to searching and accessing information.

Machine Learning for Access and Retrieval

Currently, my research focuses on automatic text classification with attention to characteristics above and beyond topicality. This involves research in two areas: representing a document's non-topical properties and Machine Learning for best classification or clustering rules.


The first group of characteristics beyond topicality for representing texts in information systems include: level of difficulty, reliability of the sources, objectivity of content, etc., are called "Qualitative Properties."  They are usually associated with the "quality" or "style" of a text.  My research has shown that it is possible to identify automatically certain qualitative properties of text.

Currently, we are making some progress on identifying the second set of text characteristics. These are called "content aspects", which refer to broad perspectives in the text. To give an immediate example, as described in the proposal for this work,


A discussion of "Syrian military capability" might appear in an article which is primarily concerned with technology issues, primarily concerned simply with comparison of military forces, or primarily concerned with an assessment of an overall political situation.


My specific research focus in this project goes beyond the traditional bag-of-words representation. I am exploring representations based on linguistic categories, and on entity identification. I have examined four levels of linguistic features.  Character-level features are mainly punctuation marks, special symbols, capitalized words and character-based length counting. Lexical-level features include a list of special words.  Structural-level features include counts of part-of-speech tags.  Derivative features are ratios and other features derived from previous three levels.  Future works include identifying patterns that compose of multiple individual language features.

With regard to Machine Learning techniques, I am working to compare learning methods for assigning weights to the many possible features.
There is some evidence on which learning methods are especially good for texts classification based on their content. My own work is testing
whether these conclusion hold true for the task of text classification on non-topical properties, and to find suitable learning techniques.  A variety of statistical and machine learning techniques have been explored, which include linear regression, logistic regression, decision tree (C4.5), and support vector machines (SVM).

These works are motivated by the HITIQA project which is supported by ARDA. The representation and learning results have been applied to HITIQA, a Question Answering system. Another application of the work is to cooperative filtering.  The identified non-topicality text characteristics could be incorporated in user profiles.  Typically the similarity between a query and a topic is measured by the "diagonal" linear combination of the index terms.  However, we believe that in many cases beyond topicality, the terms and linguistics features are not orthogonal. To achieve better result, it is desirable to introduce new metric defining the similarity.  


System Evaluation

In addition to system design and implementation, I am equally interested in various issues of system evaluation. I have worked on the AntWorld system, and developed the user interface tools for an evaluation of the impact of the system on the work product of user groups.  This "Cross Evaluation" technique has since been utilized in a complex Challenge Workshop, sponsored by ARDA, for the evaluation of Question Answering Systems in the AQUAINT program, and is used continuously in the HITIQA system evaluation work.

User Modeling


My view of information systems includes the human factors involved.  I am also interested in doing some work on user modeling. I have done some work studying user logs and seeking to identify rules that can select useful query terms which appear later in a user's session. The sources include the system's dialog with users, the system generated answers (clustered text massages), and users' copied texts. If this can be accomplished, it will make information system respond more properly at every step of interaction.


Professional Experience:


Jan 2006 – Aug 2006, Senior Datamining Engineer, RelevantNoise Inc. New Jersey

Web Blog mining for business purposes

- Tone assessment

- Splog identification

- Blog searching


2002 – Dec 2005, Research Assistant

HITIQA -- High Quality Interactive Question-answering project
Dr. Paul Kantor, Director, SCILS, Rutgers University
Dr. Tomek Strzalkowski, Principal Investigator, SUNY, Albany


-       Identify language features and use machine learning approaches to automatically classify documents on criteria other than topical relevance;

-       Design and monitor system evaluation;

-       Conduct system log/user behavior analysis to predict potential query terms. 


Experience includes running several search engines and language analysis tools; conducting various kinds of statistical analysis; writing JAVA program and Perl script for pattern recognition and text processing; writing web application in JSP, Java Servlet; designing database schemas to support data collection; cooperating with system developers on site.


2004, Summer Internship at Pacific Northwest National Laboratory, Richland, WA

Metrics for Question Answering Systems (ARDA Challenge Workshop 2004)
Dr. Emile Morse, NIST


Conducted research on identifying new metrics for Question Answering systems evaluation. The work included four QA systems.

-       Designed the cross-evaluation experiment;

-       Developed and maintained data collection tools (JSP, MySQL);

-       Analyzed data and contributed to the final report;

-       contributed to the final report. 


April 2004 – January 2005, Research Team Member

Dr. Nickolas Belkin, Principal Investigator, SCILS, Rutgers University

Conducted research on document genre classification with linguistic features as indicators.


May 2001 – July 2002, Research Assistant

IM-EVAL-- Evaluation of Collaborative Information Management Tool: AntWorld
Dr. Paul Kantor, Principal Investigator, SCILS, Rutgers University


-       Designed and developed a new evaluation model – Cross Evaluation for collaborative systems;

-       Implemented a set of systems for evaluation purposes;

-       coordinated experiments;

-       conducted statistical analyses and contributed to the final report. 


September 1999 – 2002, Research Team Member

RDLDL -- Rutgers Distributed Laboratory for Digital Libraries
Dr. Paul Kantor, Director, SCILS, Rutgers University

Attended and organized seminars on related topics.


April 2000 - August 2001, Research Team Member

Rutgers Interactive TREC
Dr. Nickolas Belkin, Principal Investigator, SCILS, Rutgers University

Studied the effectiveness of various system features in helping users with query formulation.


Sept 1997 - June 1999, Programmer, R&D Department, Peking University Library, Beijing China.

-       Collaborated in all aspects of design and implementation of the Peking University Integrated Library System;

-       Responsible for building the classification subsystem;

-       Designed and implemented the client-end interface and server application in JAVA;

-       Designed and managed the data storage and retrieval schemas (Oracle DBMS).

Last updated June 2004