Monday, May 18, 2009

Content Analyst: a Latent Semantic Indexing tool for Issue Coding, Concept Categorization and Searching of Litigation documents

By: Joe Thorpe

Content Analyst: performs categorization (for issue coding) and searching by concept. This is a purely mathematical approach which pays close attention to words and "phrase concept" co-occurrence over a large body of documents. If in your documents for example there are a number of references made to my property management group called the Endurance Management Group (EMG) and one of the properties that they manage for me is called the Thorpe Towers, the software will figure that out. A search for Endurance Management Group will return documents with the Thorpe property even if no reference is made to EMG or Endurance. And, if that document involves discussion of a complicated 1031 exchange, you can use that document to train a category for 1031 and it will get each reference where found in other documents -- even when the transaction is discussed in other ways not specifically referencing the term 1031. Here is a link that speaks generically of the LSI approach which does a much better job of describing how (and why) it works.
http://knowledgesearch.org/lsi/lsa_definition.htm

There is a competing technology to the LSI approach stemming from the "Bayesian" set of algorithms -- the objective of categorization software using the Bayesian approach is to create a taxonomy and structure thusly.
This is a linguistics based approach, the idea behind which is to index the entire corpus of data and let the program generate the common topics that it finds and organize the documents against these topics. This approach tends to work better when the documents are all articles each following a well organized theme.
In business communications (i.e. e-mail, etc.) conversations are typically all over the map. Too often the subject line bears no resemblance to the key points in an e-mail thread nor does the writer put much organization into the content. For documents in litigation databases, we found very little value in the Bayesian methodology.

Our preference for the Content Analyst engine using the LSI method culminated a four year search for a categorization tool that really provided value to the mining and organization for an unstructured document database as is the case with a large collection of litigation documents. This product would not be as useful in a Concordance or a Summation environment – it’s not yet integrated so it wouldn’t be interactive through a document review process - but it is supported in any of the following three hosted review platforms: Relativity (by kCura), iConect and Eclipse by IPRO. Each has integration to the Content Analyst system. All are very well respected litigation database platforms (for online document review) and each provides the flexibility of being able to re analyze and index throughout the document review process.

International Litigation Services, Inc.

jthorpe@ils-ipp.com