main | research | publications | pc | cv | pics | links
[ Sentiment Analysis | Document Clustering | Duplicate Identification | Resource Navigation | Component Analysis ]

Applied Research

Sentiment Analysis

Description:

Analyzing text with respect to its sentiment can be extremely valuable to an individual who is looking for information about a company, a product, or a service. Beyond such individual needs, companies may also benefit from automatic sentiment analysis by obtaining a timely picture of how their products, services or more generally their name is viewed by their customers. In addition, sentiment analysis may play a role in monitoring a company’s competition. We also note that the Blogosphere is a rapidly expanding environment where consumers go to find or submit opinions that may be ripe for mining.

Current approaches tend to divide the problem space into sub-problems, for example, creating a lexicon of useful features that can help classify sentences (or portions of sentences) into categories of positive, negative or neutral. Existing techniques often try to identify words, phrases and patterns that indicate viewpoints. This has proven difficult, however, since it is not just the presence of a keyword that matters, but its context. For instance, This is a great decision conveys clear sentiment, but The announcement of this decision produced a great amount of media attention is neutral. We examine the Blawgosphere (legal blogs) in a two-phase process, the first using subjectivity analysis to determine whether a given sentence is neutral or subjective, and the second using polarity analysis to determine whether the resultant subjective sentences are positive or negative in their sentiment. Recently the primary focus of our research has expanded beyond legal blogs to include negative news articles as well.

Publications:


Document Clustering

Description:

Computational resources for research in legal environments have historically implied remote access to large databases of legal documents such as case law, statutes, law reviews and administrative materials. Today, by contrast, there exists enormous growth in lawyers' electronic work product within these environments, specifically within law firms. Along with this growth has come the need for accelerated knowledge management---automated assistance in organizing, analyzing, retrieving and presenting this content in a useful and distributed manner. In cases where a relevant legal taxonomy is available, together with representative labeled data, automated text classification tools can be applied. In the absence of these resources, document clustering offers an alternative approach to organizing collections, and an adjunct to search.

To explore this approach further, we have conducted sets of successively more complex clustering experiments using primary and secondary law documents as well as actual law firm data. Tests were run to determine the efficiency and effectiveness of a number of essential clustering functions. After examining the performance of traditional or hard clustering applications, we investigate soft clustering (multiple cluster assignments) as well as hierarchical clustering. We show how these latter clustering approaches are effective, in terms of both internal and external quality measures, and useful to legal researchers. Moreover, such techniques can ultimately assist in the automatic or semi-automatic generation of taxonomies for subsequent use by classification programs.

Publications:


Duplicate Identification

Description:

As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close matches. Our goal in this work is to investigate the phenomenon and determine one or more approaches that minimize its impact on search results. Recent work has focused on using some form of signature to characterize a document in order to reduce the complexity of document comparisons. A representative technique constructs a `fingerprint' of the rarest or richest features in a document using collection statistics as criteria for feature selection. One of the challenges of this approach, however, arises from the fact that in production environments, collections of documents are always changing, with new documents, or new versions of documents, arriving frequently, and other documents periodically removed. When an enterprise proceeds to freeze a training collection in order to stabilize the underlying repository of such features and its associated collection statistics, issues of coverage and completeness arise. We show that even with very large training collections possessing extremely high feature correlations before and after updates, underlying fingerprints remain sensitive to subtle changes. We explore alternative solutions that benefit from the development of massive meta-collections made up of sizable components from multiple domains. This technique appears to offer a practical foundation for fingerprint stability. We also consider mechanisms for updating training collections while mitigating signature instability.

Our research is divided into three parts. We begin with a study of the distribution of duplicate types in two broad-ranging news collections consisting of approximately 50 million documents. We then examine the utility of document signatures in addressing identical or nearly identical duplicate documents and their sensitivity to collection updates. Finally, we investigate a flexible method of characterizing and comparing documents in order to permit the identification of non-identical duplicates. This method has produced promising results following an extensive evaluation using a production-based test collection created by domain experts.

Publications:


Resource Navigation

Description:

The continued growth of very large data environments such as Westlaw and Dialog, in addition to the World Wide Web, increases the importance of effective and efficient database selection and searching. Current research focuses largely on completely autonomous and automatic selection, searching, and results merging in distributed environments. This fully automatic approach has significant deficiencies, including reliance upon thresholds below which databases with relevant documents are not searched (compromised recall). It also merges documents, often from disparate data sources that users may have discarded before their source selection task proceeded (diluted precision). We examine the impact that early user interaction can have on the process of database selection. After analyzing thousands of real user queries, we show that precision can be significantly increased when queries are categorized by the users themselves, then handled effectively by the system. Such query categorization strategies may eliminate limitations of fully automated query processing approaches. Our system harnesses the WIN search engine, a sibling to INQUERY, run against one or more authority sources when search is required. We compare our approach to one that does not recognize or utilize distinct features associated with user queries. We show that by avoiding a one-size-fits-all approach that restricts the role users can play in information discovery, database selection effectiveness can be appreciably improved.

We also compare standard global IR searching with user-centric localized techniques to address the {\em database selection problem\/}. We conduct a series of experiments to compare the retrieval effectiveness of three separate search modes applied to a hierarchically structured data environment of textual database representations. The data environment is represented as a tree-like directory containing over 15,000 unique databases and over 100,000 total leaf nodes. Our search modes consist of varying degrees of {\em browse and search\/}, from a global search at the root node to a refined search at a sub-node using dynamically-calculated inverse document frequencies ($idfs$) to score candidate databases for probable relevance. Our findings indicate that a browse and search approach that relies upon localized searching from sub-nodes is capable of producing the most effective results.

Publications:


Open-Ended Research

Component Analysis (Legal Domain)

Description:

Empirical research on basic components of American judicial opinions has only scratched the surface. Lack of a coordinated pool of legal experts or adequate computational resources are but two reasons responsible for this deficiency. We have undertaken a study to uncover fundamental components of judicial opinions found in American case law. The study was aided by a team of twelve expert attorney-editors with a combined total of 135 years of legal editing experience. The scientific hypothesis underlying the experiment was that after years of working closely with thousands of judicial opinions, expert attorneys would develop a refined and internalized schema of the content and structure of legal cases. In this study participants were permitted to describe both concept-related and format-related components. The resultant components, representing a combination of these two broad categories, are reported on in this paper. Additional experiments are currently under way which further validate and refine this set of components and apply them to new search paradigms.

Publications:



Last updated: Mon Apr 08 22:53:17 CET 2013