Wednesday, December 9, 2009

Need To Review Non-English Language Documents? Using Latent Semantic Indexing (LSI) and Machine Language Translation (MT) May Be Your Answer

Posted by: Joe Thorpe on December 9, 2009

The litigation document review process is demanding enough when the files being reviewed involve electronically stored information (ESI). When international parties are involved, foreign language documents may be added to the mix. For many litigation case teams, the first impulse is to begin their document review either by hiring translators or by using machine language translation software.

Just finding bilingual people to work on the review team can be problematical. It also tends to be very expensive. Sending boxes of paper to a translation service is even more expensive. Costs can range anywhere from 20-40 cents per word! $200-$300 per page is not uncommon.

When that volume of foreign language documents (or data) amps up as it tends to do with ESI, many firms have resorted to machine (language) translation (MT). The problem with most MT programs is that they use a dictionary based translation schema which tends to translate words within sentences non-contextually. The resulting narrative is usually lacking in coherence often missing the point of the document being translated. Even if the proper mix of “trainable” MT tools are used to provide a decipherable text, the output frequently contains words evocative of but not the exact key search terms “anticipated” by native English speaking writers of the discovery order. Rather than risk missing responsive or relevant documents by “searching” for just those terms, review teams will turn to performing a page-by-page review of all the machine translated text. Plainly stated, typical searching and culling methods cannot be trusted so instead each page of the machine translated text will be reviewed.

My company’s projects regularly involve international parties and with  many of these requiring the review of foreign language documents (from paper and ESI collections); we soon realized that we needed a better way to identify responsive documents other than eyeballs on each page.

For the last couple of years, our team has successfully employed a "Programmatic Issue Coding" approach using concept categorization and searching tools. Within Concept Categorization technologies, there are several underlying and vastly different technologies competing for that space. Most of these are linguistics based systems which attempt to create a language based taxonomy. These systems do not index every word in the document collection, but rather attempt to use indexed keywords to create a categorization based upon a fixed hierarchy. One may think of a Thesaurus as a taxonomy in that it is a multi-level arrangement of the English language words. Linguistics based systems will “crawl” a document collection searching for keywords (and their frequency of use within a document) and then classify that document putting it into the group of like documents recognized as belonging to that category.

This approach works pretty well where documents have a distinct, recurring and unifying idea, as do articles, compositions, etc. Email, business correspondence and much of what is found in organizational files aren’t as thematic. Oftentimes the subject listed in an email is not even close to the actual subject of the discussion as it has morphed over multiple threads of that same email.

Most importantly though, this approach is a non-starter where foreign language documents are concerned since linguistics based Concept Categorization programs used in US litigation based projects are formed on English taxonomies.

Fortunately, the concept tools that work better for us with English language based litigation documents also work well with documents of other languages. We decided to use tools built on Latent Semantic Indexing (LSI). The software tool looks at each word or phrase in text across the document universe noting co-occurrences statistically. By recognizing these, the tool can categorize (or search across) documents conceptually. As an example, it would note that the word “riot” frequently occurs in documents with the phrase “public unrest” as well as “tear gas” and “rubber bullets”. Using this tool, if a search were performed for “riot”, any of the phrases in my example would return a hit even absent the word riot. Documents found and identified by the case team as being responsive, can then be used as exemplars to train the system to find all “similar” documents and create a resulting category.

The LSI algorithm is mathematical – it understands nothing about what the word by itself means – it creates the index by the pattern it sees. It therefore is functional across English language based text or any other language. So if a riot is described conceptually in any language, it will be captured!

In our practice, we use the LSI tool as a methodology for finding and categorizing hot documents across multiple languages. In preparation for this process, the case team and the project management team use a query methodology against the unstructured text database (all collected files and ESI) using the LSI tool for concept search. Once a sampling of particularly relevant documents is found for each interest category, we use these documents to help train each issue or hot document category. We run these against the entire document corpus to code the documents to each category.

For foreign language documents, we often lead with the LSI based concept categorization process. Not only do we create categories for documents we believe need to be reviewed, we also are categorizing for documents not responsive. This process typically reduces the collection of documents by 50% or more.

The system will also create language identification tags which will identify each non-English language document found and these will be flagged accordingly.

It is after we recognize clusters of documents that are potentially responsive that we then apply MT to only those documents necessary for review. After MT provides an English rendering, we review. From these documents, the handful of documents believed to be highly relevant can then be human translated so they can be used as evidence.