[ Document Clustering | Sentiment Analysis | Duplicate Identification | Resource Navigation | Component Analysis ]

Applied Research

Question Answering in the Legal Domain

Description:

Historically, when legal professionals performed natural language search, they would be required to sift through exhaustive lists of results, ranked by probability of relevance, in order to identify materials relevant to their search. The task could be a time consuming and laborious effort. Over time, we began to see an interest in more focused question answering systems taking the place of traditional information retrieval systems. In the field of AI and Law, Quaresma and Rodrigues were among the first to implement a question answering system for legal documents [13], one that focused on Portuguese legal decisions. More recently, however, developments in deep learning-based approaches for tasks like open domain question answering have resulted in major gains in answer rate performance. They have also been responsible for comparable advances in closed domain question answering in fields such as Legal QA [1]. Such progress has resulted in performance gains for both factoid and non-factoid question answering.

Transformer architectures have delivered impressive performance gains over baselines for standard natural language processing (NLP) tasks. Open domain language modeling as a pretraining step, followed by domain specific fine-tuning on another domain has delivered state-of-the-art performance for tasks in a specific domain, including the legal domain. One should thus expect to see significant performance gains in legal question answer retrieval by utilizing the output of a transformer based classifier which has been fine-tuned on legal QA pairs.

Publications:

Domingo Huh, Julian Brooke, Elnaz Davoodi, Jack G. Conrad, Systems and Methods for Context Aware Searching, Thomson Reuters, TR Labs, U.S. Patent Number 11,222,027, January 11, 2022.

Andrew Vold and Jack G. Conrad, "Using Transformers to Improve Answer Retrieval for Legal Questions," To appear in Proceedings of the 18th International Conference on Artificial Intelligence and Law (ICAIL 2021) (Sao Paulo, Brazil) [Online], ACM Press, 245-249, 2021.

Filippo Pompili, Jack G. Conrad and Carter Kolbeck, "Exploiting Search Logs to Aid in Training and Automating Infrastructure for Question Answering in Professional Domains," In Proceedings of the 17th International Conference on Artificial Intelligence and Law (ICAIL 2019) (Montreal, QB), ACM Press, pp. 93-102, 2019.

Document Clustering

Description:

Computational resources for research in legal environments have historically implied remote access to large databases of legal documents such as case law, statutes, law reviews and administrative materials. Today, by contrast, there exists enormous growth in lawyers' electronic work product within these environments, specifically within law firms. Along with this growth has come the need for accelerated knowledge management---automated assistance in organizing, analyzing, retrieving and presenting this content in a useful and distributed manner. In cases where a relevant legal taxonomy is available, together with representative labeled data, automated text classification tools can be applied. In the absence of these resources, document clustering offers an alternative approach to organizing collections, and an adjunct to search.

To explore this approach further, we have conducted sets of successively more complex clustering experiments using primary and secondary law documents as well as actual law firm data. Tests were run to determine the efficiency and effectiveness of a number of essential clustering functions. After examining the performance of traditional or hard clustering applications, we investigate soft clustering (multiple cluster assignments) as well as hierarchical clustering. We show how these latter clustering approaches are effective, in terms of both internal and external quality measures, and useful to legal researchers. Moreover, such techniques can ultimately assist in the automatic or semi-automatic generation of taxonomies for subsequent use by classification programs.

Publications:

Jack G. Conrad and Michael Bender, "Semi-Supervised Events Clustering in News Retrieval," Proceedings of the First International Workshop on Recent Trends in News Retrieval (NewsIR'16), in conjunction with ECIR 2016 (Padua, Italy), CEUR-WS Online, pp. 21-26, 2016.

Jack G. Conrad and Qiang Lu,"Next Generation Legal Search -- It's Already Here", In VoxPopulII blog, Legal Information Institute (LII), Cornell University, NY, 28 March 2013.

Qiang Lu and Jack G. Conrad, "Bringing Order to Legal Documents: An Issue-based Recommendation System via Cluster Association," In Proceedings of the Fourth International Conference on Knowledge Engineering and Ontology Development (KEOD 2012) (Barcelona, Spain), SciTePress DL, pp. 76-88, 2012.

Qiang Lu, Jack G. Conrad, Khalid Al-Kofahi, William Keenan, "Legal Document Clustering With Build-in Topic Segmentation," In Proceedings of the 2011 ACM-CIKM Twentieth International Conference on Information and Knowledge Management (CIKM 2011) (Glasgow, Scotland), ACM Press, New York, pp. 383-392, 2011.

Jack G. Conrad, Khalid Al-Kofahi, Ying Zhao and George Karypis, "Effective Document Clustering for Large Heterogeneous Law Firm Collections," In Proceedings of the 10th International Conference on Artificial Intelligence and Law (ICAIL 2005) (Bologna, Italy), ACM Press, New York, pp. 177-187, 2005.

Sentiment Analysis

Description:

Analyzing text with respect to its sentiment can be extremely valuable to an individual who is looking for information about a company, a product, or a service. Beyond such individual needs, companies may also benefit from automatic sentiment analysis by obtaining a timely picture of how their products, services or more generally their name is viewed by their customers. In addition, sentiment analysis may play a role in monitoring a company’s competition. We also note that the Blogosphere is a rapidly expanding environment where consumers go to find or submit opinions that may be ripe for mining.

Current approaches tend to divide the problem space into sub-problems, for example, creating a lexicon of useful features that can help classify sentences (or portions of sentences) into categories of positive, negative or neutral. Existing techniques often try to identify words, phrases and patterns that indicate viewpoints. This has proven difficult, however, since it is not just the presence of a keyword that matters, but its context. For instance, This is a great decision conveys clear sentiment, but The announcement of this decision produced a great amount of media attention is neutral. We examine the Blawgosphere (legal blogs) in a two-phase process, the first using subjectivity analysis to determine whether a given sentence is neutral or subjective, and the second using polarity analysis to determine whether the resultant subjective sentences are positive or negative in their sentiment. Recently the primary focus of our research has expanded beyond legal blogs to include negative news articles as well.

Publications:

Jack G. Conrad, Jochen L. Leidner, Frank Schilder and Ravi Kondadadi, "Query-based Opinion Summarization for Legal Blog Entries," In Proceedings of the 12th International Conference on Artificial Intelligence and Law (ICAIL 2009) (Barcelona, Spain), ACM Press, New York, pp. 167-176, 2009.

Frank Schilder, Jochen L. Leidner, Jack G. Conrad, and Ravikumar Kondadadi, "Polarity Filtering for Sentiment Summarization," Poster presented at the First Text Analysis Conference (TAC08) (Gaithersburg, MD), NIST, Washington, D.C., Nov., 2008.

Frank Schilder, Ravi Kondadadi, Jochen Leidner and Jack G. Conrad, "Thomson Reuters at TAC 2008: Aggressive Filtering with FastSum for Update and Opinion Summarization," In Proceedings of the 2008 Text Analysis Conference (TAC08) (Gaithersburg, MD), NIST, Washington, D.C., Nov. 2008.

Jack G. Conrad, Jochen Leidner, Frank Schilder, "Professional Credibility: Authority on the Web," Proceedings of the Second Workshop on Information Credibility on the Web (CIKM08 Credibility Workshop -- WICOW08) (Napa Valley, CA), ACM Press, New York, pp. 82-85, 2008.

Jack G. Conrad and Frank Schilder, "Opinion Mining in Legal Blogs," In Proceedings of the 11th International Conference on Artificial Intelligence and Law (ICAIL 2007) (Stanford University, Palo Alto, CA), ACM Press, New York, pp. 231-236, 2007.

Duplicate Identification

Description:

As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close matches. Our goal in this work is to investigate the phenomenon and determine one or more approaches that minimize its impact on search results. Recent work has focused on using some form of signature to characterize a document in order to reduce the complexity of document comparisons. A representative technique constructs a `fingerprint' of the rarest or richest features in a document using collection statistics as criteria for feature selection. One of the challenges of this approach, however, arises from the fact that in production environments, collections of documents are always changing, with new documents, or new versions of documents, arriving frequently, and other documents periodically removed. When an enterprise proceeds to freeze a training collection in order to stabilize the underlying repository of such features and its associated collection statistics, issues of coverage and completeness arise. We show that even with very large training collections possessing extremely high feature correlations before and after updates, underlying fingerprints remain sensitive to subtle changes. We explore alternative solutions that benefit from the development of massive meta-collections made up of sizable components from multiple domains. This technique appears to offer a practical foundation for fingerprint stability. We also consider mechanisms for updating training collections while mitigating signature instability.

Our research is divided into three parts. We begin with a study of the distribution of duplicate types in two broad-ranging news collections consisting of approximately 50 million documents. We then examine the utility of document signatures in addressing identical or nearly identical duplicate documents and their sensitivity to collection updates. Finally, we investigate a flexible method of characterizing and comparing documents in order to permit the identification of non-identical duplicates. This method has produced promising results following an extensive evaluation using a production-based test collection created by domain experts.

Publications:

Jack G. Conrad and Edward L. Raymond, Jr., "Essential Deduplication Functions for Transactional Databases in Law Firms," In Proceedings of the 11th International Conference on Artificial Intelligence and Law (ICAIL 2007) (Stanford University, Palo Alto, CA), ACM Press, New York, pp. 261-270, 2007.

Jack G. Conrad and Cindy P. Schriber, "Managing Déjà Vu: Collection Building for Identifying Non-Identical Duplicate Documents," Journal of the American Society for Information Science and Technology (JASIST), 57(7), John Wiley & Sons, Hoboken, NJ, pp. 919-930, 2006.

Jack G. Conrad and Cindy P. Schriber, "Constructing a Text Corpus for Inexact Duplicate Detection," In Proceedings of the 27th Annual International ACM-SIGIR Conference on Research & Development in Information Retrieval (SIGIR 2004) (Sheffield, England), ACM Press, New York, pp. 582-583, 2004.

Jack G. Conrad, Xi S. Guo, and Cindy P. Schriber "Online Duplicate Document Detection: Signature Reliability in a Dynamic Retrieval Environment," In Proceedings of the 2003 ACM-CIKM Twelfth International Conference on Information and Knowledge Management (CIKM03) (New Orleans, Louisiana), ACM Press, New York, pp. 243-252, 2003.

Resource Navigation

Description:

The continued growth of very large data environments such as Westlaw and Dialog, in addition to the World Wide Web, increases the importance of effective and efficient database selection and searching. Current research focuses largely on completely autonomous and automatic selection, searching, and results merging in distributed environments. This fully automatic approach has significant deficiencies, including reliance upon thresholds below which databases with relevant documents are not searched (compromised recall). It also merges documents, often from disparate data sources that users may have discarded before their source selection task proceeded (diluted precision). We examine the impact that early user interaction can have on the process of database selection. After analyzing thousands of real user queries, we show that precision can be significantly increased when queries are categorized by the users themselves, then handled effectively by the system. Such query categorization strategies may eliminate limitations of fully automated query processing approaches. Our system harnesses the WIN search engine, a sibling to INQUERY, run against one or more authority sources when search is required. We compare our approach to one that does not recognize or utilize distinct features associated with user queries. We show that by avoiding a one-size-fits-all approach that restricts the role users can play in information discovery, database selection effectiveness can be appreciably improved.

We also compare standard global IR searching with user-centric localized techniques to address the {\em database selection problem\/}. We conduct a series of experiments to compare the retrieval effectiveness of three separate search modes applied to a hierarchically structured data environment of textual database representations. The data environment is represented as a tree-like directory containing over 15,000 unique databases and over 100,000 total leaf nodes. Our search modes consist of varying degrees of {\em browse and search\/}, from a global search at the root node to a refined search at a sub-node using dynamically-calculated inverse document frequencies ($idfs$) to score candidate databases for probable relevance. Our findings indicate that a browse and search approach that relies upon localized searching from sub-nodes is capable of producing the most effective results.

Publications:

Jack G. Conrad and Joanne R.S. Claussen, "Early User-System Interaction for Database Selection in Massive Domain-specific Online Environments," Transactions on Information Systems (TOIS), 21(1), ACM Press, New York, pp. 94-131, 2003.

Jack G. Conrad, Xi S. Guo, Peter Jackson, and Monem Meziou, "Database Selection Using Complete Physical and Acquired Logical Collection Resources in a Massive Domain-Specific Operational Environment," In Proceedings of the 28th International Conference on Very Large Databases (VLDB02) (Hong Kong), pp. 71-82, 2002.

Jack G. Conrad, Changwen Yang, and Joanne S. Claussen, "Effective Collection Metasearch in a Hierarchical Environment: Global vs. Localized Retrieval Performance," In Proceedings of the 25th Annual International ACM-SIGIR Conference on Research & Development in Information Retrieval (SIGIR 2002) (Tampere, Finland), Springer-Verlag, London, pp. 371-372, 2002.

Open-Ended Research

Component Analysis (Legal Domain)

Description:

Empirical research on basic components of American judicial opinions has only scratched the surface. Lack of a coordinated pool of legal experts or adequate computational resources are but two reasons responsible for this deficiency. We have undertaken a study to uncover fundamental components of judicial opinions found in American case law. The study was aided by a team of twelve expert attorney-editors with a combined total of 135 years of legal editing experience. The scientific hypothesis underlying the experiment was that after years of working closely with thousands of judicial opinions, expert attorneys would develop a refined and internalized schema of the content and structure of legal cases. In this study participants were permitted to describe both concept-related and format-related components. The resultant components, representing a combination of these two broad categories, are reported on in this paper. Additional experiments are currently under way which further validate and refine this set of components and apply them to new search paradigms.

Publications:

Jack G. Conrad and Daniel P. Dabney, "A Cognitive Approach to Judicial Opinion Structure: Applying Domain Expertise to Component Analysis," In Proceedings of 8th International Conference on Artificial Intelligence and Law (ICAIL 2001) (St. Louis, Missouri), ACM Press, New York, pp. 1-11, 2001.

Jack G. Conrad and Daniel P. Dabney, "The Structure of Judicial Opinions: Identifying Internal Components and their Relationships." In El Hadi, Maniez, and Pollitt (Eds.), Structures and Relations in Knowledge Organization, Proceedings of the 5th International ISKO Conference (ISKO 1998) (Lille, France), Ergon-Verlag Press, Wurtzburg, Germany, pp. 413, 1998.

Last updated: Thu Feb 10 021:06:17 CST 2022