Tuesday, December 14, 2010

Why Plaintiffs Should e-Discover, What they Should be e-Discovering

By Nick Brestoff

(This article was published in Advocate, the Journal of the Consumer Attorneys Associations for Southern California, Volume 37, No. 12 (December 2010) Copyright © 2010 Consumer Attorneys Association of Los Angeles.

All rights reserved.  Reprinted with permission.)

In the fall of 2008, I tried a case before the Hon. Ann I. Jones.  When I saw her at a recent conference, she didn’t remember me, which I expected, and I took no offense because she at least remembered the case.

But it’s what she said about e-discovery that I remember.  I asked Her Honor if, because of budget cuts, she was seeing more criminal cases in her courtroom.  She told me no, because she was currently sitting in Complex Civil.  I had been there.  Then you must be seeing a lot more Law & Motion regarding e-discovery disputes, I surmised.  “No, not really,” she said.  That shocked me.  I asked why.  She said she didn’t know, but that she suspected that the lawyers were agreeing not to engage in it, probably because of the expense.

This exchange led me to write this article.  How could anyone, especially plaintiffs’ counsel, agree to not engage in e-discovery?  We live in a world of electronically stored information (ESI).  We are swimming in it.  You use it every day, when you write documents and send e-mails.  It is our present; and it is our future, too, isn’t it?

There are two primary characteristics of ESI:  its volume, which is already immense, and the rate at which it is growing, which is exponential.

This much is obvious by now.  Businesses generate mountains of data every day and it is clearly the case that their use of ESI is growing each year, and growing fast.  But ESI is not paper.  In fact, ESI is very different:

  • It comes in many forms, e.g., e-mails, voice-mail messages that appear as e-mails, and spreadsheets, to name only a few. 
  • It is known by many file names, some of them well known, e.g., .doc, .wpd, .pdf., .jpg and .tif, and some of them not so well known, such as .docx, .pst., .nsf, .pif., and .gif.  
  • It is stored in a wide variety of devices, such as hard drives in desktops and laptops (both in the office and at home), flash drives (including the backup flash drive you probably have), the digital memories in machines that photocopy and scan hard copy documents, and cell phones; it’s in “the cloud,” and so on.

So it is not hard to believe what the academics have been telling us:  ESI is somewhere between 95% and 99% of all the information that we generate.

Is ESI a problem?  No.  It’s not a problem.  It is a blessing.  This particular innovation is doing for the world, and for litigation in particular, what Gutenberg did for the spread of knowledge with his invention of the printing press.  The fact that ESI is now so much of all the information we create is a testament to the proposition that Electronically Stored Information – ESI – has been broadly accepted.

There are at least three reasons for this:  (1) Personal computers have become almost ubiquitous:  There are over a billion of them in the world today.  (2) The price per unit of memory has “declined by an average of 32 percent per year.”  (Source:  Congressional Budget Office, “The Rule of Computer Technology in the Growth of Productivity,” Ch. III, Prices of Computers and Components (May 2002).)  (3) The speed of processing data has gone way up.

What’s the bottom line?  By storing information electronically, information is more accessible now than ever before.

You should like the fact that ESI is accessible, because that’s where you’ll find the facts that supports your cases, and because there have been both legal and technological advances permitting you to review the information in the defendants’ possession, custody or control.  If data is stored electronically, you can use computer-based tools to access and search that data electronically.  Count your blessings:  Having to learn about and use ESI is a darn sight better than being told, “Of course, you can review those documents you requested; you can find them in, oh, I think it was something like 10,000 boxes.  We put them in a number of conveniently located warehouses for you.  Ready for those addresses?”

But the plaintiff’s bar appears to be ignoring ESI.  In a recent article (Deutchman, L., “E-Discovery Sanctions:  Not for Defendants Only,” Law Technology News (September 16, 2010)), the author wrote:  “the plaintiff[] bar knows far, far less about e-discovery than does the defense bar.”  Why?  There are many reasons:

  • The defense bar has taken notice and educated itself about ESI, and they have been assisted by software vendors who saw that corporate America and its outside counsel could pay their bills.  Why has the defense bar taken the initiative?  In 2005, a jury awarded a single plaintiff in a sex discrimination case $9 million in compensatory damages and $20 million for punitive damages because the defendant spoliated potentially relevant evidence and the court decided to give the jury an “adverse inference instruction.” (Zubulake v. UBS Warburg LLC, 2004 U.S. Dist. LEXIS 13574, 2004 WL 1620866 (S.D.N.Y. July 20, 2004)).  And in that same year, a jury awarded $600 million in compensatory damages and $850 million in punitive damages in a securities fraud case, largely because the defendants failed to properly handle e-discovery and also went to trial facing an adverse inference instruction.  (See Coleman (Parent) Holdings, Inc. v. Morgan Stanley & Co., Inc., 2005 WL 679071 (Fla.Cir.Ct. Mar. 1, 2005) (subsequently reversed on other grounds).)
  • Plaintiff attorneys generally believe they have little digital evidence to produce, and that the process is expensive.  As a result, they have failed to learn how to identify, collect, preserve, and produce digital evidence; but more than that, they have failed to learn how to ask for it or how to analyze the data when it comes in.

Is it too expensive?  Is ESI overwhelming?  Should you agree with defense counsel that neither side will request ESI from the other?

            No.  “Electronic evidence is frequently cheaper and easier to produce than paper evidence because it can be searched automatically, key words can be run for privilege checks, and the production can be made in electronic form obviating the need for mass photocopying.”  Zubulake v. UBS Warburg, 217 F.R.D. 309 (S.D.N.Y. May 13, 2003).  It is a mistake to turn data into the near-paper equivalents of image files like TIFFs and PDFs.  Without further manipulation, such files are not searchable and, to make matters worse, the costs are more than three times greater than with native files.   (See Diane E. Barry, Esq., Madison Spach, Jr., Esq., and Hon. James L. Smith (J., Ret., JAMS ADR), Keeping Up with E-Discovery, National Business Institute, at 4 (September 2010). 

To reiterate:  ESI is a blessing.  Instead of being overwhelmed by tens of thousands of pages (in thousands of boxes of documents), computer-based technologies make it possible to search through much, much more than that.  And quickly too.  Not long ago, our favorite search engine ingested 60,000 documents, consisting of 351,000 pages (which is approximately five gigabytes), in about 45 minutes.  After it “clustered” the documents automatically (with “concept search,” clustering may precede key word search), we were searching and finding relevant documents in about an hour.

The cost?  Well, twice as much, say ten gigabytes (about 750,000 pages) of basic Microsoft Office type ESI would cost between $6,000 and $10,000.  Remember, the costs a plaintiff faces after receiving a defendant’s production are much lower, because the processing and filtering have already been done, and a plaintiff need only pay for concept searching, hosting, and review.  Dealing with 100 gigabytes would scale up linearly, because the charges are per gigabyte, to about $60,000; but how many cases involve seven and a half million pages?

In fact, now you can see that ESI is a real time-saver.  You could get your arms around 750,000 pages in a year if you reviewed about 2,000 pages every single day without stopping.  But what is the value of that time?

This is why plaintiffs’ counsel need not feel overwhelmed when they are on the receiving end of an electronic document production.  ESI is a Treasure Chest.  The keys to open it do exist, and they have come way down in price. 

So, by learning about the tools and how to use them; with some good old-fashioned persistence; and with a healthy dose of curiosity, you can use ESI to win your case.  When defense counsel makes or agrees with a suggestion that both sides not request ESI from each other, you are right to be nervous.  ESI can be expensive, depending on the volume and type, but you must know that any such offer from the defense is self-serving.  If you agree to limit yourself to documents kept as paper, you are acting like the drunk who is looking for his car keys only by searching under the streetlight.  If your vision is limited to less than 5% of the available information, the defense will happily lead you to miss the other 95%.  

ESI offers too much to ignore.  Indeed, there are many reasons to focus on it.

            Let’s skip the problem of learning how to ask for e-discovery.  (Hint:  Always ask for native files with metadata intact.)  Let’s start with the idea of finding the evidence that supports your client’s case when the data arrives.   That’s a key part of your job, to play “Sherlock Holmes.”  Can you do this, even though you may be presented with ten gigabytes of data?  Yes, you can.

In fact, you can find that “smoking gun” e-mail, an e-mail which turns a defense contention into a lie.  A jury will devour such evidence because it was made so contemporaneously with the conduct in question that it takes on the mantle of being the concrete truth.  For a plaintiff, such an e-mail has the potential of destroying the credibility of a key defense witness.  Any piece of evidence that can do that is indeed a powerful weapon.  You want this.

In fact, the entire process can be a friend to a plaintiff’s counsel who understands this new world and is persistent.  Take, for example, the case of Doppes v. Bentley Motors, Inc. (2009) 174 Cal.App.4th 967.  In this “lemon law” case about a Bentley with an oil-wax stink, the appellate court held that the trial court abused its discretion when it denied plaintiff’s request for terminating sanctions against Bentley.  Imagine that — an appellate court ordering the imposition of terminating sanctions.  It happened, and the Supreme Court denied review.

The lesson is in why it happened.  In Doppes, defendant Bentley violated four discovery orders or discovery referee determinations prior to trial, such that the trial court was persuaded to give an adverse inference instruction.  But then, during trial, plaintiff’s counsel discovered impeaching e-mails and the deletion of potentially relevant e-mails, so that Bentley’s discovery violations was found to have been worse than had been previously known.  Still, the trial court would not grant terminating sanctions and instead gave another adverse inference instruction.  On appeal (after jury verdicts in favor of plaintiff), the appellate court affirmed the verdicts on two causes of action, and then ordered terminating sanctions and a default judgment on a cause of action which the jury had rejected — for fraud.  Not only that, the appellate court ordered an increase in the amount of attorney fees the trial court had awarded to plaintiff for having to make the successful discovery motions (from $344,600 to $402,187), with the potential for more on remand.    

            But take heed:  Sanctions can be visited upon plaintiffs, too, and that is why it is critical for you to know when your identification, collection, and preservation duties arise, so that you are not caught spoliating potentially relevant evidence.

As you may know, there may be no statutory obligation for parties to preserve evidence, but the law is clear:   parties must preserve all potentially relevant evidence when the facts and circumstances make it reasonable to expect that a dispute will ensue.  (See Cedars-Sinai Medical Center v. Superior Court (1998) 18 Cal.4th 1.)  And while there is no separate tort for destroying (spoliating) evidence in California, our discovery statutes authorize a range of penalties for “misuse of the discovery process.”  (Id. at 12.)

            So, don’t be surprised to learn that your preservation obligation attaches to you before a complaint is filed.  It attaches to you because you are an attorney and you know when your client has authorized you to write and file a complaint.  Once a client engages plaintiffs’ with the intention of moving forward with pre-litigation negotiations or simply filing the complaint, the preservation obligation attaches.  Yes, a plaintiff has an obligation to preserve ESI, and your duty is to explain this obligation to your client; to find out about your client’s IT environment and any document destruction policies and practices; to put a “hold” or “suspension” on these policies and/or practices as to the relevant custodians, subject matter, and time frames; and to do so in writing, and then to monitor and update the “hold” as the case develops.

Do you want defense counsel to ask you for ESI when you did not take reasonable steps to preserve it?  Do you want to be subject to sanctions for spoliation?  No.  But when you follow through on your obligations, you will want to hold the other side accountable.  Defense counsel has the same obligations and they can also attach before they receive the complaint.  The general rule is this:  the preservation duty attaches when litigation can be reasonably anticipated.

So plaintiff’s counsel should never ignore ESI:  it’s where to find most of the potentially relevant evidence, whether the case is large or small.   Here are three examples:  First, suppose you have a car crash.  The injuries might be large or small.  Either way, don’t you want to know which one of the drivers was “texting” just before the crash?  Or suppose you have train wreck, like the tragic and fatal crash when the Metrolink and Amtrak trains collided in Chatsworth.  The fact that the engineer was texting just before the crash was critical evidence, wasn’t it?  Or suppose you are plaintiff’s counsel in a “slip and fall.”  The ESI is in the surveillance tapes, the e-mail messages the plaintiff or any witnesses sent to friends or family, the medical ESI, and the photos or statements posted on social networking websites.

It is for this reason that ESI cannot be ignored.  How can you practice competently (and comply with Rule 3-110 of the Rules of Professional Conduct) if you ignore ESI, where such a high percentage of potentially relevant evidence may reside?

Are we consigned to be swamped by this tsunami of ESI?  No.  Our technologies may have succeeded in making ESI ubiquitous, but it is also true that our technologies make it possible for us to search it.  We can find those needles in this enormous electronic haystack.  We adjusted when published opinions were turned into electronic databases, and we can adjust to ESI, too.

But we must learn to use new tools.  In the mid-1970s we learned to search a case law database with key words, and we are used to doing this.  But a dataset consisting of e-mails and spreadsheets is quite different; it is unstructured and contains “metadata.”  New tools are needed – and they exist.

But, as I said, ESI is different.  Let me dwell on this point for a moment, because we are fooling ourselves into thinking that key words will do the trick.  Not so.  Let me test you.  Here’s the proposition:  Key words using Boolean connectors will find only about 25% of the relevant documents.  True or false?

True!  One of the founders of the “information retrieval” field, M. E. Maron (now professor emeritus, UC Berkeley) reported as long ago as 1985 that attorneys were over-estimating the efficacy of their searches.  The attorneys thought they were identifying 75% of the relevant documents, but they were wrong:  they were finding only about 20%.  More recently, studies show that key word searches are, even today, only a little more successful.  Tomlinson and others reported in 2008 that Boolean searches identified only 22% of the relevant documents, while Oard and others reported in 2009 that Boolean searches pinned only 24% of the relevant documents. (These reports come from the Legal Track of the Text Retrieval Conference (TREC), which is administered by the U.S. National Institute of Standards and Technology.)

                Is there an answer?  Yes.  First, recognize that ESI is data.  Agree with defense counsel on the ways they want to receive your data (if they want print-outs, say yes, realizing that as such, paper is not searchable electronically), but seek ESI from the other side in its native form, with metadata intact.  When you receive it (and you will likely receive gigabytes of it), treat it as data.  “Hash” the data (which means to give each “page” a unique number), and process out the exact duplicates (“de-duplication”) and system files.  Then use software tools that go beyond key words and the Boolean search techniques, e.g., “concept search.”  Then ask the concept search engine for “more like these,” and iterate the process.  Next, promote the “clustered,” potentially relevant documents to a database, and then use key words and Boolean connectors.  Cull the data for eyes-on review down to the point where it is manageable.  This is the way to reduce the cost of e-discovery, always remembering that the goal is to find admissible evidence that you can use in deposition, mediation, or trial.

            But to persuade you that choosing to grapple with ESI is worthwhile, let me approach the issue from four different perspectives.  In the first scenario, a plaintiff fails to preserve ESI and suffers the consequences.  In the second scenario, a plaintiff mishandles requests for ESI and the defendants happily under-produce the documents that the plaintiffs were seeking.  In the third scenario, I present three instances where the plaintiffs discover that the defendants have failed to preserve potentially relevant documents, and they make the most of it.  In the final scenario, I describe how plaintiffs used “concept search” to find the “smoking guns.”

            Scenario 1.  A Plaintiff fails to preserve evidence.  In Medcorp, Inc. v. Pinpoint Technologies, Inc., 2010 WL 2500301 (D.Colo. June 15, 2010), the plaintiff intentionally destroyed 43 hard drives which contained information relevant to the dispute by failing to “stop the presses” on their ordinary recycling schedule, but not with a motive to destroy evidence because at least some of the information that was lost was re-produced.  Although finding that a terminating sanction would be too severe, the district court was tough nevertheless, and decided to award reasonable attorney fees and costs in connection with the motions to compel and/or for sanctions ($89,365.88), and to issue an “adverse inference” instruction, to the effect that the jury could infer from the evidence that the lost evidence was favorable to the defendant.  Ouch.

            Scenario 2.  Defense counsel snookers a plaintiff.  In a different case, the plaintiff requested documents from the hard drives of 26 employees.  The defendants used de-duplication to narrow the documents to be produced down from 423,835 to 129,000, and then used search terms to narrow the actual production down to 4,000 documents.  The plaintiff objected, and wanted more, but the magistrate dismissed the plaintiff’s objections, stating “To the extent Plaintiff contests the adequacy of the search terms, it has not set forth an alternative search methodology; moreover, no specific challenge to the search terms has been brought and briefed before the Court.”  (In re CV Therapeutics, Inc. Sec. Litig., 2006 WL 2458720 (N.D.Cal. Aug. 22, 2006).)

            Scenario 3.  Plaintiffs uncover a defendant’s spoliation.  Laura Zubulake (pronounced “Zoo-boo-lake”) was a highly compensated executive who worked for UBS Warburg.  In April of 2001, UBS Warburg knew that Zubulake was contemplating a sex discrimination lawsuit.  She served an EEOC complaint in August.  She filed her lawsuit in February of 2002.  UBS Warburg failed to begin preserving documents until August of 2001, after receiving the EEOC complaint, but then botched the process after that, initially giving only oral instructions to key employees telling them not to delete or destroy materials that might be potentially relevant, and failing to mention that the preservation efforts applied to ESI as well as to paper documents.  Then, when a follow-up memorandum was issued, it failed to mention back-up tapes.  To make a longer story shorter, Zubulake discovered that key employees had deleted relevant e-mails, and that e-mails on the back-up tapes were lost because the tapes were overwritten.  The court not only imposed monetary sanctions but finally agreed to issue an adverse inference instruction, which indicated that the jury could infer that the lost evidence could have either been beneficial to Zubulake or harmful to UBS Warburg.  (See Zubulake v. UBS Warburg, 229 F.R.D. 422, 424 (S.D.N.Y. 2004) (the fifth in a series of seminal pre-trial e-discovery decisions by the Hon. Shira Scheindlin).   The result:  in a single-plaintiff sex discrimination case, the jury awarded $9 million in compensatory damages and $20 million in punitive damages.   

            In the second case, Magana v. Hyundai Motor Am., 2009 WL 4070952 (Wash. Nov. 25, 2009), the plaintiffs won a terminating sanction.  In response to discovery requests, Hyundai’s in-house counsel had searched for responsive documents, but only in its own legal department.  In the end, the trial court found that (1) the parties had not agreed to limit discovery in this way; (2) the defendant falsely responded to plaintiff’s request for production of documents and interrogatories; (3) the plaintiff was substantially prejudiced in preparing for trial; and (4) the potentially relevant evidence was lost forever.  The trial court considered lesser sanctions, but concluded that the only just remedy was the entry of a default judgment, for $8 million.  The appellate court reversed, but the Washington Supreme Court reinstated the trial court’s ruling and, in addition, awarded attorney fees to the plaintiff pertaining to both the trial and appellate proceedings.

Is there a similar case in California?  Yes.  In OZ Optics Limited v. Hakimoglu (2009) 2009 Cal.App. LEXIS 2952, an executive ran a “scrubbing” program on a company laptop prior to handing it over, which a forensic examiner was able to detect.  A $90,000 sanction was ordered.  The trial court refused to give a terminating sanction but only because there was no evidence that a claim or defense had been lost.

Scenario 4.  Plaintiffs find the smoking e-mails.  In a stock option back-dating case, a concept search pointed to documents whose common denominator (pattern) was the phrase “Let it roll.”  Now, technology plus a little curiosity is a powerful combination.  Why would key words associated with “back-dating” surface a cluster of documents related to “Let it roll”?  Remembering that concept search is designed to seek out hidden meanings, the consultants involved in the case called the “Let it roll” group to the attention of the litigators.  Sure enough, when these documents were reviewed, this phrase turned out to be the “go” signal the executives were using to authorize the back-dating.  Unless a power key word searcher made a lucky guess, the “Let it roll” documents – the key needles in a very large haystack – would have gone undetected.  As you might expect, the case (which is confidential) settled.

Most of these e-discovery decisions have been made in federal court cases over the past five to seven years, but the plaintiffs’ bar must not ignore them, despite their strong preference to litigate in state court.  After all, California’s e-discovery statutory changes, effective on June 29, 2009, were modeled on the changes to the Federal Rules of Civil Procedure, which became effective on December 1, 2006.  The federal cases will be influential.       

But in California, e-discovery issues will arise more quickly than in federal court.  You have to start thinking about e-discovery when you are writing the complaint because Plaintiffs’ counsel must be prepared to discuss “any” issues relating to the discovery of ESI, pursuant to Rule 3.724(8) and (9) of the California Rules of Court (effective August 14, 2009) at least 30 days before the Case Management Conference.  This deadline means that you must first address the issues with your client within the first 30 to 60 days after the complaint is filed, if not sooner.  And if you (or the other side) come to the “meet and confer” process or CMC unprepared, and so fail to participate in good faith, then you (or the other side) is engaging in a discovery abuse.  (Code of Civil Procedure section 2023.010(i); see Liberty Mutual Fire Ins. v. LCL Administrators, 163 Cal.App.4th 1093, 1104 (2008) (repeatedly ignoring “meet and confer” letters is a separate ground for discovery sanctions).)

The lesson is to come prepared with a list of custodians (yours and theirs); the search terms you wish to propose; the time frames you care about; and the format(s) you want the data in when it is produced.

What’s the bottom line?  ESI is here.  Compliance with the rules pertaining to ESI is mandatory.  Show the defendants you know what ESI is about.  And make it work to your advantage.

# # #

  • After graduating with a B.S. in engineering systems from the University of California at Los Angeles (U.C.L.A.), Nick Brestoff earned an M.S. in environmental engineering science from the California Institute of Technology (Caltech) and graduated from the Gould School of Law at the University of Southern California (U.S.C.).  During his litigation career, Mr. Brestoff litigated business tort, employment, environmental, and other civil disputes in state and federal court, winning 8 figures in one federal court case and succeeding in his only trip to the California Supreme Court.  He is currently the Western Regional Director, Discovery Strategy & Management, of International Litigation Services (www.ilsTeam.com).  Mr. Brestoff’s email address is nbrestoff@ilsTeam.com.

GIGO and MEGO in e-Discovery

Posted by Douglas Forrest on Dec 13, 2010 | 0 comments

GIGO – Garbage In, Garbage Out – is a seminal axiom of all data processing which applies with full force in the realm of e-discovery.  But, in e-discovery, there is another wrinkle, i.e., valid data that that washes out prematurely (or, beyond the scope of this entry, is never collected in the first place).  Yes, I’m talking about what can happen before data is fed into programs such as eCapture or Clearwell, viz., forensics and handling in forensic tools such as EnCase.

Now, before you claim technical incapacity or that the very topic induces MEGO – My Eyes Glaze Over – hear me out .As to MEGO, just snap out of it; this could be important: what you don’t know can hurt you.  And, with respect to forensic technical expertise (or the lack thereof), passing the EnCE exam is not a prerequisite to gaining valuable insights into current issues in the technical forensic community, an understanding which may stand you in very good stead someday.

It is In furtherance of gaining such insights and understanding that I recommend a few blogs produced by true stalwarts of the forensic community whom I know from my past tenure at Guidance Software.

Geoff Black, formerly a very much hands-on Regional Manger with Guidance’s Professional Services Division and now Director, High Tech Investigations, at a Fortune 100 company, blogs at geoffblack.com.  One recent post addressed new developments in matching digital photos to the specific digital camera that took them (think matching a bullet to the gun that fired it).

Jon Stewart, formerly Director of Development at Guidance , the founder of Lightbox Technologies, Inc and a programmer’s programmer, blogs both at Lightbox and at codeslack.blogspot.com.  Jon has addressed more squirrelly forensic data anomalies than there are reruns on TBS.

Lance Mueller, formerly Senior Director IT & Corporate Security at Guidance and now a Computer Forensic and Security Consultant as well as a Senior Instructor at the US State Department, publishes a digital forensic blog  at forensickb.com, where a recent post presented a decision tree for forensic hard drive imaging with volatile data collection.

Now, while much of the discussion at these blogs is either EnCase-specific, highly technical, or both, even a non-techie reading them can gain a new appreciation of the complexities and danger zones which can lurk behind blanket representations of forensic services.

Tuesday, December 7, 2010

A Strategy to Sample All the ESI You Need

By Nick Brestoff, M.S., J.D.

Reprinted with permission from the December 6, 2010 issue of Law Technology News © 2010 ALM Media Properties, LLC. Further duplication without permission is prohibited. All rights reserved.

I was re-reading the EDRM section on “validation of results” when it hit me. Most of us have been so busy mining the data from the mountain of it that we just received that we have been missing the other mountain of data available to us, the mountain we didn’t ask for. You know the adage: if you don’t ask; you don’t get. So I’m talking about the ESI we didn’t ask for and didn’t get.

I had been reading the last paragraph of the EDRM Search Guide, Section 9.5. You know the one: “Sampling and Quality Control Methodology for Searches.” (See http://edrm.net/resources/guides/edrm-search-guide/validation-of-results.)

“Sampling.” There’s a word that most attorneys don’t grasp; that is, unless they had a statistics class (and remember some of it) or pay close attention to the results of political polls, when the sample size is usually about 900 to 1,200 randomly selected individuals. Amazingly enough, poll results seem to be pretty good estimates for whole counties, states, or the entire nation. The size of the sample matters, but the size of the population doesn’t. (I’ll skip the math.)

The word “sample” is there in the rules. It was added when the Federal Rules of Civil Procedure were amended to provide for the discovery of electronically stored information (ESI). It shows up in the rules governing requests to produce documents, Rule 34(a)(1): “A party may … request … to inspect, copy, test, or sample … (A) … electronically stored information ….” In the case law preceding this amendment to Rule 34, sampling was used in the context of statistical sampling backup tapes to see if they contained potentially relevant information. See Zubulake v. UBS Warburg LLC, 217 F.R.D. 309, 324 (S.D.N.Y. 2003).

Of course, such sampling must be within the scope of Rule 26(b), and that means that the ESI can be “any nonprivileged matter that is relevant to any party’s claim or defense …,” and “need not be admissible at trial if the discovery appears reasonably calculated to lead to the discovery of admissible evidence.” (Italics added.)

So, the rules allow us to use sampling on any ESI that “appears reasonably calculated to lead to the discovery of admissible evidence.” So what? You can’t use sampling on the data you didn’t receive. What light bulb went on?

First, back to the clue. It was the third and last paragraph of Section 9.5 of the Search Guide. It reads, in part: “In general, a sampling effort takes into consideration broad knowledge of the population, and [devises] an unbiased selection [of the sample]. In most cases, the party performing the sample has some knowledge of the population and there is one party with that knowledge. In contrast, most litigations where there is an adversarial relationship between a Requesting Party and a Producing Party, and since only one party has access to the underlying population of documents, agreeing on a sampling strategy is hard. An effective methodology is one that would require no knowledge of the data, but is still able to apply random selection process central to the effectiveness of sampling.” (Italics added.)

Ah ha. “Adversarial relationship.” “Sampling strategy.” Several points hit me at almost the same time:

· the frank recognition of the adversarial relationship;

· when you’re on the side of the Producing Party, you’re the only one with access to the ESI; and

· a sampling strategy is in play, notwithstanding the Sedona Cooperation Proclamation (http://www.thesedonaconference.org/content/tsc_cooperation_proclamation/proclamation.pdf).

When you’re on the side of the Requesting and (eventually) Receiving Party, of course, you’re very busy. You’re likely to be immediately swimming in the ESI you just received. This data has been produced, sans privileged documents, and the task ahead is to search it for documents that support either a claim or a defense. The act of swimming in that ocean of data takes concentration. But that focus may also lead to tunnel vision.

I asked myself to remember what goes on when you’re on the side of the Producing Party. What have you been through when you’re wearing that hat? The answer is that you’ve been through a culling process that stripped out, among other things, exact duplicates (de-duping), system files (de-NISTing), and documents covered by the attorney-client and work product privileges.

But you and others on the e-discovery team may have also created folders with data that was “probably” irrelevant or “not responsive,” such as spam e-mails with Viagra ads. For quality control purposes, sampling may have been done, so that an expert could show that both the process and the sampling protocols were reasonable.

In the end, some judgment had to be exercised to produce the nonprivileged and relevant matter. But that also means that the “probably irrelevant or nonresponsive” data was not produced. I wondered about “probably?”

And in whose eyes? Does a Requesting Party ever seek to learn the sampling strategy used by the Producing Party? What about the sampling parameters? What if the sampling protocol is loosey-goosey? What if the criterion for sampling by the Producing Party is a confidence level of only 90%, with an error factor of 10%? What if documents were misclassified as not relevant or not responsive when in fact they were relevant or documents which might lead to the discovery of admissible evidence? Wouldn’t you want to know?

Was the Producing Party’s sampling process transparent in any way? If this issue had been raised during the Rule 26(f) “meetings and conferences,” yes; but thinking back on that last paragraph from the EDRM Search Guide, I realized that Requesting Parties almost never ask the Producing Parties to disclose their processes, including the software they’ve sued or their sampling protocols.

These considerations led me to think of propounding a second wave of requests, immediately after receiving documents from the initial request. The second wave would ask the Producing Party to exclude the exact duplicates, the system files, and the documents covered by the attorney-client or work product privileges, but then to produce all of the other ESI (in native format) that was collected from the appropriate custodians, during the appropriate timeframes, and regarding the stated issues in the case, but which was not previously produced.

This additional step might involve a second mountain of data, but then you then have control of it, and you can search it using your own statistical protocols. In other words, you might treat this data as if it consisted of backup tapes. Most of the data will prove to be not relevant. You could search all of it. But if you sample it first, using a confidence level of 99%, with a 1% error factor, you may find nothing; if so, then perhaps there is nothing to find.

But then again, your sampling may turn something up, and then you’ll want to search the “second mountain” more thoroughly. Perhaps in the data that you didn’t receive in the first place you will find the gold that you seek.

Thus, it may be vital to realize that somebody on the other side of the case decided that some amount of ESI was not relevant or not responsive, and so did not produce it. Here are the three easy steps: (1) during the Rule 26(f) process, ask the other side to disclose its processes and statistical sampling protocols; (2) after receiving data from the Producing Party, ask for the ESI that was not produced (not all of it; exclude the duplicates, the system files and the privileged data), and then (3) use your own sampling protocols on that data when you get it.

It takes curiosity and persistence to operate effectively in this new world of e-discovery. And that includes remembering to ask for the ESI you didn’t get.

# # #

Nick Brestoff, M.S., J.D. is the Western Regional Director for Discovery Strategy & Management at International Litigation Services (www.ilsTeam.com), based in Los Angeles. E-mail: nbrestoff@ilsTeam.com. He gratefully acknowledges comments on the draft by e-discovery attorney Helen Marsh.

Monday, November 22, 2010

Client Waives AC Privilege via Emails, Chats & Blogs … Oh My!

by Nick Brestoff, M.S., J.D.

How ironic. You’d think that attorneys at Electronic Frontier Foundation (“EFF”) would draw the line at letting a client talk about her conversations with her counsel to the point where the attorney-client privilege was held to have been waived.

But in Lenz v. Universal Music Group, No. 07-03787-JF (N.D.Cal. Oct. 22, 2010), that’s what happened. Plaintiff Stephanie Lenz, represented by attorneys at EFF, had sued Universal Music Group (UMG), alleging that UMG had harmed her First Amendment free speech rights when UMG issued a notice to YouTube demanding that it “take down” a 29 second video of Lenz’s toddler dancing to Prince’s “Let’s Go Crazy.”

While the lawsuit was pending, Lenz used e-mails, Gmail Chat, and a personal blog to repeatedly discuss her conversations with her attorney about the case. For example, she said that the lawsuit was an opportunity for EFF to “get their teeth into UMG” for sending takedown notices. As a result, U.S. Magistrate Judge Patricia Trumbull granted UMG’s motion to compel further discovery regarding Lenz’s motives for bringing the action.

Then, in her chats, she revealed legal strategies, including EFF’s plan that it was using her case to clarify a ruling in a different case. Judge Trumbull granted UMG’s motion for further discovery on that subject, too.

Finally, in her blog, Lenz spoke of her conversations with counsel pertaining to certain specific factual allegations. You know what happened. The court held that she had voluntarily waived her privilege in this regard as well.

The lesson here: All clients need a lecture about using e-mails, chats, blogs, and any other form of social media to talk about confidential attorney-client communications. As in, “Don’t.” As this case demonstrates, even the EFF now knows that, in litigation, too much freedom can be a dangerous thing.

Nick Brestoff, M.S., J.D. | Western Regional Director | Discovery Strategy & Management
International Litigation Services | www.ilsTEAM.com | nbrestoff@ilsTEAM.com | (213) 674-4334

Thursday, November 4, 2010

e-Discovery: Proportionality, Technology and Practice Standardization

Posted by Douglas Forrest on Nov 4, 2010

 

Principal 6 of the just released Sedona Conference Commentary on Proportionality in
Electronic Discovery
provides that:

Technologies to reduce cost and burden should be considered in the
proportionality analysis

While most of the provided commentary (parties should meet and confer, etc.) will be familiar to e-discovery adepts, there is some that is more novel (clue: no supporting footnotes or citations), viz.,

Parties and law firms that are involved in a significant amount of electronic discovery may choose a standard tool that meets their overall needs. The fact that the standard tool is not the best fit for an individual case should not be held against the firm or the party unless it is conspicuously inadequate for the case, as might happen where the volume of information is unusually high. Parties and law firms may have to consider other tools for cases that exceed the capacity of the standard tool. (Italics added.)

A few thoughts:

Except in those still relatively uncommon instances where parties are hosting review platforms themselves, the standard tools chosen by parties that would be relevant here would seem to be those used for identification, preservation and collection. However, the only specific caveats raised relate to the capacity to handle high volumes, which speaks almost exclusively to post-collection processing and review platforms.

The statement that the choice of a standard tool which is not the best fit for a particular case “should not be held against the firm or the party unless it is conspicuously inadequate for the case, as might happen where the volume of information is unusually high” is no safe harbor, but a standard of conspicuous inadequacy could still be rather useful as a bulwark in some cases.

The language could also assist law firms and general counsel offices in making  a case not only for selecting and deploying the right standard tools but also, by arguably giving some measure of protection against technological obsolescence,  doing so sooner rather than later.

Of course, what is a rule without exceptions? In addition to the capacity caveats, the commentary ends on this note:

While technology may create efficiencies and cost savings, it is not a panacea and there may be circumstances where the costs of technological tools outweigh the benefits of their use.

Sunday, October 10, 2010

E-Discovery Search: The Truth, the Statistical Truth, and Nothing But the Statistical Truth

Posted by Nick Brestoff on Sep 17, 2010 |

By

Nick Brestoff, M.S., J.D.

First published in the ABA E-Discovery & Digital Evidence Journal, Vol. 1, Issue 4 (Autumn 2010)

This article is a call to revisit Rule 26(g)(1) of the Federal Rules of Civil Procedure, which requires attorneys to certify “to the best of the person’s knowledge, information, and belief formed after a reasonable inquiry” that disclosures are “complete and correct.”[1] Given the exponentially growing mountain of electronically stored information (ESI), and the incompleteness and statistical nature of search technologies, which this article will explain, no attorney can honestly so “certify.” One day, this gap, a loophole between the law of yesterday and the technology of today, will cause a monumental waste of judicial, attorney, and client resources.

Most of us know the meaning of a “loophole.” These days, when one seeks a definition, or perhaps an example, we look online and, more often than not, we turn to Wikipedia. According to Wikipedia, “[a] loophole is a weakness or exception that allows a system, such as a law or security, to be circumvented or otherwise avoided. Loopholes are searched for and used strategically in a variety of circumstances, including taxes, elections, politics, the criminal justice system, or in breaches of security.”[2]

Wikipedia mentions the “criminal justice system.” But to this entry we must add our system of “civil justice,” and, in particular, the giant middle of every lawsuit, discovery. As most attorneys are now aware, what used to be thought of as “discovery” is now dominated by e-discovery.

But e-discovery is a hybrid, a confluence of slowly changing laws and rules, on the one hand, and rapidly changing computer-based technologies, on the other. In this dynamic context, which besets every system of justice in the world, loopholes may be expected. Here we explore a rather large disconnect (or loophole) in the U.S. system of justice which comes as a result of the new complexities of e-discovery.

Loopholes

Loopholes can be large or small. In 2005, for example, Wal-Mart proposed a large store in Calvert County, Maryland. Because Calvert County restricted the size of a retail store to 75,000 square feet, Wal-Mart’s executives and attorneys proposed building two separate smaller stores, which, technically speaking, would not have violated the restriction. The plan was controversial, and Wal-Mart later withdrew it.[3] Until Wal-Mart made the proposal, however, this legal loophole went undetected.

One further example will serve to demonstrate that when loopholes are exploited, money – big money — is usually at stake. Ford imports a vehicle called Transit Connect from Turkey, but pieces of its interior are shredded when they arrive in Baltimore to circumvent the 1963 Chicken Tax, which imposes a 25% tariff on imported light trucks. Ford avoids this 25% tariff on its Transit Connects because it does not import these vehicles as light trucks; instead, they are imported as passenger vehicles with rear windows, rear seats and rear seatbelts, and are immediately converted into light trucks when they arrive, by replacing the rear windows with metal panels and by removing the rear seats. This change costs Ford hundreds of dollars, but it saves thousands in taxes.[4]

In the context of e-discovery, lawyers have attempted to exploit what they thought were loopholes right from the start. Examples abound. In one case, for example, when the format for producing ESI was not specified and emails (and only emails) were requested, they were produced, but they were “divorced” from their attachments, which were not produced.[5] In another case, a producing party converted searchable documents into nonsearchable TIFF files before producing the ESI.[6]

These gambits revealed certain weaknesses in the system, and some of them have been addressed. Now, for example, the federal rules provide that when a party is seeking documents from an opposing party, or from a third party pursuant to a subpoena, the requesting party may specify the form or forms of the documents when they are produced.[7] California’s e-discovery statutes also provide that the requesting party may specify the “form or forms in which each type of [ESI] is to be produced,” but, like the federal rules, the requesting party has only an opportunity to specify the forms once.[8]

Even though its growth-rate is prodigious, the hallmark of e-discovery is the immense volume of ESI that must be addressed. In 2003, researchers at UC Berkeley published an update to their study, How Much Information? At that point in time (and now hopelessly outdated), they explained that each year almost 800 megabytes of recorded information was produced per person, and that 92% of that information was stored on computers or a computer-based storage system.[9] Eight hundred megabytes is enough to fill a set of books stacked 30 feet high. Today, if each person generated only 25% more information than in 2003, or 1,000 megabytes, then each person would generate a gigabyte of data per year, and that amount is roughly equivalent to 75,000 pages, if printed.[10] It is easy to imagine that today we generate much more than that. Indeed, it is often said that 98% or 99% of all the information generated today, by everyone in the world, is generated as ESI. Why? Because today the digital universe includes not only servers, desktops, laptops, cell phones, hard drives, flash drives, and photocopy/fax machines,[11] the digital universe includes data from TV and radio transmissions, telephone calls received as emails, surveillance cameras, datacenters supporting “cloud computing,” and, of course, social networks.[12]

So, in lawsuits, parties and attorneys must often deal not just with gigabytes of data, but with several terabytes of data, and a single terabyte is roughly equivalent to 75 million pages, if printed. Even if a requesting party asks for readily accessible data, meaning data in native format with metadata intact, there is still the problem of how to search through a much, much bigger haystack than lawyers ever faced when, e.g., 10,000 boxes of documents were produced.

Key Words and Boolean Searches

Now, how can anyone get their arms around this much data? They can’t. The volume of data today is far greater than those times when parties attempted to hide the needle in the haystack by producing truckloads, or worse, warehouses full of boxes stuffed with papers. E-discovery expertise is partly the domain of an information technologist and partly the domain of lawyers. The technologist’s approach is to cull the data by removing exact duplicates (de-duping) and system files. Culling will certainly reduce the size of the data set. Now the lawyer’s task is to query that data set with key words and “field” terms, just as they did when searching opinion databases for applicable case law. Because they are familiar with key words, the receiving attorneys include key words describing the subject matter of the dispute and the names of the key players and employees who had “any involvement with the issues raised in the litigation or anticipated litigation.”[13] An oft-used field term is a date or a range of dates.

Indeed, in the context of online legal research, teams of lawyers and other law firm denizens have become “power users” of key words and field terms. It was not always so. Fifty years ago, lawyers relied on their memories and library tables populated with books. Their search technique was non-linear and depended on a more personal skillset. But once the published cases were uploaded and computers could be used to hunt through databases, key words, date ranges, and Boolean connectors (e.g., AND, OR, NOT, term X “within 7 of” term Y, etc.) were deployed. Lawyers have been using this technique for over 35 years.[14]

But now the scope of the data is vastly increased and the problem is different. The problem is different because we are not querying databases of published opinions in which courts use familiar legal terms. In the e-discovery context, we are working within the context of the law, but we are not looking for it. We are trying to find the facts, and we are trying to find them in a mountain of data that is not only enormous, it is contained in numerous places. In this endeavor, opposite sides have different goals, especially because they treat the discovery process as an adversarial adventure and notwithstanding the platitudes spoken about cooperation.

For example, a requesting party may attempt to use key words to over-collect documents. In one recent case, for example, where, pursuant to a stipulated order the defendant had sole discretion to specify search terms, the defendant submitted 400 search terms. Over the producing party’s objections based on cost ($6 million), which the court denied because of the stipulated order, these 400 terms yielded 660,000 documents.[15]

On the other hand, a producing party may attempt to under-produce. They may use key words to narrow the scope of the documents they must produce. In a different case, the plaintiff requested documents from the hard drives of 26 employees. The defendants used de-duplication to narrow the documents to be produced down from 423,835 to 129,000, and then used search terms to narrow the actual production down to 4,000 documents. The plaintiff objected, and wanted more, but the magistrate dismissed the plaintiff’s objections, stating “To the extent Plaintiff contests the adequacy of the search terms, it has not set forth an alternative search methodology; moreover, no specific challenge to the search terms has been brought and briefed before the Court.”[16]

Ah, now there’s a rub. Is there an alternative search methodology? Yes. But before describing it, let’s stay with key words for a moment. The goal, after all, is to use automated, computer-based searches to find as many of the potentially relevant documents as we can. All non-privileged information relevant to a claim or defense must be produced.[17]

But just how successful are key word searches? Test yourself. Here’s the proposition: Key words using Boolean connectors will find only about 25% of the relevant documents. True or false?

True! One of the founders of the “information retrieval” field, M. E. Maron (now professor emeritus, UC Berkeley) reported as long ago as 1985 that attorneys were over-estimating the efficacy of their searches. The attorneys thought they were identifying 75% of the relevant documents, but they were wrong: they were finding only about 20%.[18] More recently, studies show that key word searches are, even today, only a little more successful. Tomlinson and others reported in 2008 that Boolean searches identified only 22% of the relevant documents,[19] while Oard and others reported in 2009 that Boolean searches pinned only 24% of the relevant documents.[20] (These reports come from the Legal Track of the Text Retrieval Conference (TREC), which is administered by the U.S. National Institute of Standards and Technology.)

Now for attorneys used to key word searches, these reports are not good news. As previously noted, in the process of “early disclosure” and responding to document requests, an attorney must certify that, “to the best of [their] knowledge . . . formed after a reasonable inquiry,” their response to a document request is “complete and correct.”[21]

Is there an alternative methodology to key words and field terms? Yes. We come to it now: concept search.

Concept Search

What is concept search? Concept search is a way of finding patterns in unstructured data sets. Its sounds technical, doesn’t it? Yes, it is. It involves matrix algebra, formulas you don’t want to see (ever), and statistical concepts you don’t want to know about, but will be forced to learn anyway (note: more on this point, later).

Let’s stick with key words for a moment. Key words approach a document collection in a simplistic way; either a document contains the key word (or a variation of it) or it does not contain that word. Let’s say we have only two key words, w1 and w2, for our query, and that we find w1 in document 1, which we’ll call d1, and w2 in document 2, or d2; but we do not find w1 in d2 and we do not find w2 in d1. In the four-square box at the end of this sentence, a “1” means that the word in question is present, while a “0” means that the same word is not present:

Dox —> d1 d2

w1 1 0
w2 0 1



This simple “picture” is a hypothetical word-document matrix. It is clear that using w1 as “input” will result in d1 as “output,” but not d2. If we use w2 as input, we will get d2, but not d1. But if we are looking for a document with both w1 AND w2, we will get nothing.

But wait. This matrix is too simplistic. It consists of only two key words and only two documents. The documents in our collection, which will likely consist of gigabytes and terabytes of data, are certain to have many more than one word each. Here is the key to understanding what concept search engines do: they find with “co-occurrences” of words that are not used as search terms.

If a picture is worth many words, a bigger matrix should help. You can see what co-occurrence means by looking at the next matrix.


Dox d1 d2 d3 d4 d5 d6 d7 d8
Word
w1 1 0 0 0 0 0 0 0
w2 0 1 0 0 0 0 0 0
w3 1 1 0 0 0 0 0 0
w4 0 1 0 1 1 1 1 1
w5 0 1 0 0 1 1 1 1
w6 0 1 0 0 0 1 1 1
w7 0 1 0 0 0 0 1 1
w8 0 1 0 0 0 0 0 1





It starts in the upper left hand corner with the simple four square matrix of (w1, w2) and (d1, d2) that we first described. But then this matrix adds more words (w3 through w8) and more documents (d3 through d8).

Let’s begin with w3. It appears in both d1 and d2. When we were considering the four-square matrix, inputting w1 AND w2 did not result in either d1 or d2; it resulted in nothing. In the matrix below, if we input w3, we will get d1 and d2, because it is contained in both documents.

Now look at w4. It is contained in d2 and d4, d5, d6, d7, and d8. Similarly, w5 is in d2 as well as in d5, d6, d7 and d8 (so one less; w5 is not in d4). And so on. Now we can make some observations about our collection (or corpus).

First, note that neither w1 nor w2 are in any of the other documents, d3 thorugh d8, which is why, for the w1 and w2 rows, there are nothing but “0s” in the columns after d2. For both the rows for w1 and w2, the columns d3 through d8 are all zeros.

Also, no matter what word we use to query this matrix, will we ever get back d3? No. It has none of the words on the list.

Now let’s look at words w4, w5, w6, w7, and w8. Notice that w4 shows up in d4 through d8. Fine, that word is used frequently. But frequency is not the test.

The big idea of concept search is to find documents (as output) that are responsive to a query (using key words as input), based on co-occurrences. As output, we want documents that have key words in them, but also the documents that do not contain any of the key words but which are nevertheless potentially related and, thus, potentially relevant. We are looking for patterns.

In this regard, patterns can be strong or weak. Which document exhibits the strongest pattern? It’s d8. Although d8 does not have our input key words, w1 or w2, column d8 has five of the same content words contained in d2; that is, both d2 and d8 have words w4 through w8 in common. The weakest pattern involves the most documents but the weakest link: d4 through d8 all share only one word – w4 – with d2.

Computers do not understand “patterns.” They go through a process (a series of steps) which eventually leads to a measurable threshold, a cut-off point. To scholars in the field of Information Retrieval, such steps, including the mathematical scissors, is called an “algorithm.” In our simplistic hypothetical, if we want all documents that are potentially relevant, we might choose a cut-off where there is only one matching co-occurrence, a low threshold. If we want to find a “smoking gun,” we might search again, this time adjusting our process (algorithm) to find only the strongest co-occurrences. In this example, if we want, say, more than four (4) co-occurrences, the search output would be only d8.

See how this works? With concept searching, computers are going through gigabytes and terabytes of data consisting of documents and words, using a strictly mathematical approach.

This search methodology is called Latent Semantic Indexing or LSI. This term is best understood “inside out.” The “Index” part is simple. You have seen indexes before. They are at the end of nearly every book. Indexes indicate which words are on which page. Here, the computer ingests all of the documents and all of the words, and creates an index of each word that is contained in each document. We have just done this with two hypothetical matrices, one with two words and two documents, the other with eight words and eight documents.

What does “Latent” mean? Roughly speaking, it means “hidden.” And “Semantic” means, again roughly, “meaning.”

So, the phrase is actually descriptive of what we are trying to accomplish: find the hidden meanings (patterns) in a collection of documents, not because of the specific words we choose as input, but because of the other words in the documents containing the words we did choose and their “co-occurrence” with words in other documents, documents which do not contain our search terms.

Let’s deepen our understanding. As we did with the documents themselves, culling out exact duplicates and system files, let us cull our words. In LSI, we discard articles (like “a” and “an”); prepositions, conjunctions, common verbs (like known, see, do, be); pronouns (e.g., it, they); common adjectives (like big, late, and high); pointer or frilly words (like thus, therefore, however, and albeit); any words that appear in every document; and any words that appear in only one document. Now we are down to the core words that have semantic value; they have “content.” It is with these words that we form the word-document matrix.

Now we do some “weighting” (think “handicapping”). Some content words appear more than once in a single document. They are given greater weight; and the process of giving them more weight is called “local weighting.” Still other words show up frequently throughout the entire set, and because of this, they are “commonplace.” Words that appear in only a small handful of documents may have special significance. They get greater weight. This is “global weighting.” And there is a scaling step, called “normalization,” which is just like handicapping in golf. Some documents may be long ones and have many key words. To keep them from overwhelming the shorter documents, the larger ones are penalized a bit, so that every document has, approximately, equal significance.

Because LSI is mathematical, it is a search engine that “likes” addressing large collections of data. The more words and documents in the set, the better LSI performs at finding documents responsive to a query. And, after a fruitful search puts some documents into a “shopping cart,” a human being can learn from the initial results and iterate the process. With this feedback, the input terms are more focused and the LSI search engine is likely to produce even better results.[22]

LSI was not conceived to address the problem of search in the e-discovery context.[23] But it has found application in the world of e-discovery. Moreover, because many business and governmental endeavors involve more than one language, LSI is useful because it does not pretend to understand anything about the words it is considering. The words are, in a sense, digitized; then LSI creates the word-document matrix, and seeks out the patterns based on statistical co-occurrences. It is therefore as functional with words in Chinese, Korean and Japanese (or Arabic) as it is with words in English. Using LSI, “hot” documents across different languages can be identified. The next step is machine translation, which is not known for precision. So, the step after that is human review. And if certain documents appear to a human to be suitable for use in deposition, in a motion, or at trial, the final step is human translation, so that the translated documents can be certified and offered into evidence.

In the e-discovery context, you have likely seen LSI in action. You just didn’t know what was “under the hood.” Simply put, concept search based on LSI, or a variant of LSI, is now at the heart of programs that are offered by a number of different vendors, each of which has provided different “bells and whistles” to differentiate themselves.[24]

Why is LSI powerful? Because, when LSI is used on unstructured data, such as business communications, LSI returns documents that may be highly relevant that even power key word searching would miss. Here’s an example. In a stock option back-dating case, an LSI-based search returned documents whose common denominator (pattern) was the phrase “Let it roll.” Why return these documents? Remembering that LSI is designed to seek out hidden meanings, the consultants involved in the case called the “Let it roll” group to the attention of the litigators. Sure enough, this phrase turned out to be the “go” signal the executives were using to authorize the back-dating. Unless a power key word searcher made a lucky guess, the “Let it roll” documents – the key needles in a very large haystack – would have gone undetected.

So LSI has proven to be more efficient than key words, even though key words are still used in the queries that are framed. But could you have explained LSI to a court, in case you were challenged by opposing counsel to do so?[25]

It’s All Statistical

Now, finally, we come back around to whether an attorney can honestly sign off on the Rule 26 certification concerning the documents he or she has disclosed or produced. With a new appreciation for what goes into searching a collection for potentially responsive documents, the answer is “no.” We have a loophole. Attorneys are, by rule, being forced to certify to a degree of certainty that just is not there; and they put their licenses on the line when they sign.[26]

Suppose we have collected 100 million documents; how many should be produced? A suitably sized random sample will accurately reflect the number of responsive documents to be produced, no matter how large the set may be.[27] For a confidence level of 95%, with an error of plus or minus 5%, a random sampling of 1,537 documents must be examined. For a confidence level of 99%, with an error of plus or minus 1%, a sampling of 66,358 documents is needed. Thus, “if we have 100 million documents in the unretrieved set, we need to examine only 1,537 documents to determine within 95% confidence that the number of responsive documents in the unretrieved set is within the margin of error. If we find that there are 30 documents that were responsive in the unretrieved set, we can state that we have 95% confidence that the number of responsive documents in the sampled set is between 28 and 32 (rounding up the document count on the high end, rounding down on the low end). Extending that to the 100 million population, approximately 1,951,854 plus or minus 97,593 are responsive in the unretrieved set. [Para.] In the case of a review where errors are expensive (such as a review for privilege), 99% confidence with 1% error condition would require 66,358 samples. If we identify 200 privileged documents in such a sample, you will have 99% confidence that the number of privileged documents in the sample is between 198 and 202 privileged documents. ”[28]

Some Proposals and a Grand Conclusion

As previously mentioned, responding attorneys must currently certify that, “to the best of [their] knowledge . . . formed after a reasonable inquiry,” the disclosure or response to a document request is “complete and correct.” But in this digital era, attorneys must face up to understanding some of the math they hoped to avoid (forever) by going to law school, because attorneys are ill-equipped to flatly certify the “completeness” of their disclosures or responses. “[T]he assumption on the part of lawyers that any form of present-day search methodology will fully find ‘all’ or ‘nearly all’ available documents in a large, heterogeneous collection of data is wrong in the extreme.”[29] So how can attorneys vouch for “completeness”? Clearly, attorneys who continue to sign off on Rule 26(g) certifications are over-promising. They are venturing into areas where an expert’s opinion is warranted, if not necessary.[30] If a client is prejudiced when a court agrees, after some future battle over the alleged impropriety of an attorney’s certification, that “completeness” was promised but not achieved, will that attorney have fallen below the standard of care? Having likely ventured beyond his or her competence, will that attorney have violated a rule of professional conduct? Is a malpractice lawsuit in that attorney’s future?

We come now to four concrete proposals for change, and one grand conclusion:

Rule 26(g)(1)(A) should be changed to indicate (for example) that, with the assistance of experts, the document production is complete and correct, with a 95% confidence level and an error rate of plus or minus 5%;
Attorneys would be wise (as a matter of best practices) to sample for privileged documents, so that they are withheld with a 99% confidence level and an error rate of plus or minus 1%;
Malpractice insurers should be actively revising their applications for errors and omissions insurance to force attorneys to disclose the level of their e-discovery competence, and insurers should be monitoring, if not mandating, the continuing education of attorneys in e-discovery matters.
Besides being able to choose the format for the production of ESI, requesting parties should be able to designate the search methodologies used by the responding parties to search for potentially relevant documents. Otherwise, responding parties may use key words and search methodologies that under-produce to the requesting party.
The grand conclusion brings us back to loopholes. In an adversarial system, attorneys will exploit loopholes. And now you know that a large technical loophole besets our system. It besets every judicial system in the world, and we have not yet faced up to it.

We seek the truth. But now that there’s so much data, the best we can say about the truth is this: it’s statistical.

# # #

After graduating with a B.S. in engineering systems from the University of California at Los Angeles (U.C.L.A.), Nick Brestoff earned an M.S. in environmental engineering science from the California Institute of Technology (Caltech) and graduated from the Gould School of Law at the University of Southern California (U.S.C.) in 1975. For the next 35 years, Mr. Brestoff litigated business, employment, environmental, and other civil disputes in state and federal court. He is currently a consultant to businesses and attorneys through International Litigation Services (www.ilsTeam.com). Mr. Brestoff’s email address is nbrestoff@ilsTeam.com. He gratefully acknowledges editorial comments on drafts from Helen Marsh, attorney at law (California), Ken Rashbaum, attorney at law (New York), and Nicolas Nunez, P. Eng. (California).



--------------------------------------------------------------------------------
[1] Rule 26(g)(1)(B) applies the certification to discovery responses, and requires a certification that is “consistent” with the rules, which includes Rule 26(g)(1)(A).

[2] The Wikipedia entry for “Loophole,” as modified on 27 July 2010, was viewed by the author on August 27, 2010.

[3] Paley, Amit R. (May 17, 2005) “Wal-Mart Drops Plan for Side-by-Side Calvert Stores.” The Washington Post. http://www.washingtonpost.com/wp-dyn/content/article/2005/05/16/AR2005051601271.html.

[4] Dolan, Matthew (September 22, 2009) “To Outfox the Chicken Tax, Ford Strips Its Own Vans.” The Wall Street Journal. http://online.wsj.com/article/SB125357990638429655.html.

[5] See PSEG Power N.Y., Inc. v. Alberici Constructors, Inc., No. 1-:05-CV-657 (N.D.N.Y. 2007) (producing party ordered to re-produce ESI at its cost).

[6] See Goodbys Creek, LLC v. Arch Ins. Co., No. 3:07-cv-947-J-34 HTS (M.D.Fla. 2008) (conversion held improper; producing party order to re-produce ESI); L.H. v. Schwarzenegger, 2008 U.S. Dist. LEXIS 86829 (C.D.Cal. 2008) (sanctions were imposed for the untimely (late) production of non-sortable PDFs).

[7] Federal Rules of Civil Procedure, Rules 26(f)(3)(C) [discovery plan] and 34(b)(1)(C) [content of the request].

[8] California Code of Civil Procedure §2031.030(a).

[9] Lyman, Peter and Varian, Hal, How Much Information? (2003); see http://www.sims.berkeleye.edu/how-much-info-2003 (reviewed on August 28, 2010).

[10] Ibid.

[11] Keteyian, Armen, “Digital Photocopiers Loaded with Secrets: Your Office Copy Machine Might Digitally Store Thousands of Documents That Get Passed on at Resale,” CBS News (New York, April 15, 2010); See http://www.cbsnews.com/stories/2010/04/19/eveningnews/main6412439.shtml?tag=mncol;txt.

[12] Gantz, et al., The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth Through 2011 (March 2008) (Executive Summary). See http://www.idc.com.

[13] See Pension Comm. Of the Univ. of Montreal Pension Plan v. Banc of Am. Sec., LLC, 2010 WL 184312 (S.D.N.Y. Jan. 15, 2010; as amended May 18, 2010) (Scheindlin, J.)

[14] Here is an example of a “power key word search” using Boolean operators (which were borrowed from computer programming): (successor /5 corporation) /p (toxic or hazardous or chemical or dangerous /5 waste) /p clean! and da(aft 1/1/90). In plain language, this search is for cases where a successor corporation is liable for the cleanup of hazardous (toxic) waste. The sample Boolean search looks for the combination of successor within five words of corporation, in the same paragraph as the combination of toxic or hazardous or chemical or dangerous within five words of waste, within the same paragraph as clean or cleanup or cleans or cleaned or cleaning (the exclamation mark in clean! causes the computer to search for all words with clean as a root). Cases are limited to those dated after January 1, 1990.

[15] See In re Fannie Mae Secs. Litig., 552 F.3d 814, 818-819 (D.C.Cir. 2009).

[16] See In re CV Therapeutics, Inc. Sec. Litig., 2006 WL 2458720 (N.D.Ca. Aug. 22, 2006).

[17] Federal Rules of Civil Procedure, Rule 26(b)(1); see Zubulake v. UBS Warburg LLC, 217 F.R.D. 309, 316 (S.D.N.Y. 2003); SEC v. Collins & Aikman Corp., 256 F.R.D. 403, 417-418 (S.D.N.Y. 2009) (over objections based on cost, SEC ordered to produce emails; parties required to establish a reasonable search protocol).

[18] Maron, M. E., An Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval Sys.,” 28(3) Comm. of the ACM 289 (1985).

[19] Tomlinson, Stephen, et al., Overview of the 2007 TREC Legal Track (April 30, 2008).

[20] Oard, Douglas W., et al., Overview of the 2008 TREC Legal Track (March 17, 2009).

[21] Federal Rules of Civil Procedure, Rule 26(g)(1)(A).

[22] Two tests are “recall” and “precision.” Recall is the proportion of relevant documents identified out of the total number of relevant documents that exist. If the total number of relevant documents is 100, but a search identified 80, the recall rate is 80%. Precision is the percentage of identified documents that were actually relevant. If 100 documents were identified but only 75% of them were relevant, the precision would be 75%. Using LSI, recall and precision rates just under 90% have been achieved. Source: Content Analyst Company, LLC (“Content Analyst”) in Reston, Virginia (http://contentanalyst.com). Content Analyst is the original patent-holder of LSI.

[23] See Landauer, T. K. and Dumais, S. T., “Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge,” Psychological Review, 104(2), 211-240 (1977).

[24] There are at least three hosted review platforms that have integrated an LSI solution from Content Analyst: Relativity (by kCura), iCONECT, and Eclipse by IPRO. In addition, a variation of LSI called Probabilistic LSI is “under the hood” of Axcelerate by Recommind.

[25] For that matter, could you differentiate LSI from still other computer-based search approaches, including taxonomies, ontologies, and Bayesian classifiers? These topics are beyond the scope of this article.

[26] See Qualcomm, Inc. v. Broadcom Corp., No. 05 Civ. 1958-B, 2008 U.S. Dist. (S.D.C al. Jan. 7, 2008); and id., Order Declining to Impose Sanctions, Etc. (Document 998; filed Apr. 2, 2010).

[27] Search Guide, Electronic Discovery Reference Model Draft v.1.17 at p. 79 of 83 (May 7, 2009).

[28] Ibid.

[29] Victor Stanley, Inc. v. Creative Pipe, Inc., 250 F.R.D. 251, 262 (D.Md. 2008) (Grimm, J.) (quoting from “Information Inflation: Can the Legal System Adapt,” 13 Rich. J. L. & Tech. 10 (2007), at *38, 40). See Mt. Hawley Ins. Co. v. Felman Prod., Inc., 2010 WL 1990555*10 (S.D.W.Va. May 18, 2010) (failure to sample in order to identify and remove privileged documents was “imprudent”).

[30] In several recent cases, courts have made statements supporting the proposition that a certification of completeness of a large document product by an expert should replace certification by an attorney. For example, in United States v. O’Keefe, 537 F.Supp.2d 14, 24 (D.D.C. 2008), the court stated, “Whether search terms or ‘keywords’ will yield the information sought is a complicated question involving the interplay, at least, of the sciences of computer technology, statistics and linguistics . . . . Given this complexity, for lawyers and judges to dare opine that a certain search term or terms would be more likely to produce information than the terms that were used is truly to go where angels fear to tread.” In Equity Analytics, LLC v. Lundin, 248 F.R.D. 331, 333 (D.D.C. 2008), the court stated, “Determining whether a particular search methodology, such as keywords, will or will not be effective certainly requires knowledge beyond the ken of a lay person (and a lay lawyer) . . . .” And in In re Seroquel Prods. Liab. Litig., 244 F.R.D. 650, 660 n. 6, 662 (M.D.Fla. 2007), the court criticized the defendant’s use of keyword search to select ESI for production, in particular because the defendant failed to provide information “as to how it organized its search for relevant material, [or] what steps it took to assure reasonable completeness and quality control,” and noting that “while key word searching is a recognized method to winnow relevant documents from large repositories . . . [c]ommon sense dictates that sampling and other quality assurance techniques must be employed to meet requirements of completeness.” (Emphasis added.)

Frederick Jelinek — A Semantic Giant Passes

Posted by Douglas Forrest on Oct 5, 2010 | 0 comments

Frederick Jelinek, who revolutionized language recognition by using statistical theory and probabilities instead of codifying rules, died in his office at John Hopkins on September 14 at the age of 77. The approach that he pioneered in the context of computer speech recognition, analyzing text databases for word patterns and the probability of words appearing relative to other words in text databases, became the foundation for many applications beyond voice recognition, including, most importantly, automated classification and organization, e.g., predictive coding, in today’s advanced e-discovery tools and systems.

Dr. Jelinek survived the Nazi occupation of Czechoslovakia, where, as the child of a Jewish father and a mother who had converted to Judaism, he was barred from attending school and compelled to study underground. He emigrated to the United States in 1949. After earning three degrees from MIT, Dr. Jelinek taught at MIT, Harvard and Cornell before joining IBM, where he rose from a summer position to heading a team using supercomputers to analyze speech. After retiring from IBM in 1993, he was was recruited by Johns Hopkins to head its Center for Language and Speech Processing.

Tuesday, January 12, 2010

Approaches For Triaging Foreign-Language Documents

Posted by: Joe Thorpe January 12, 2010

One of the many complications encountered with litigation involving international parties is dealing with large volumes of foreign language documents. Typical approaches range anywhere from asking ones international client for translation support, hiring bilingual reviewers to the case team, using Machine Translation (MT) to translate all of the documents and outsourcing documents for full translation.

In this post, I will discuss advantages and limitations of each of the above and add a few more options for your consideration as well.

In an earlier post, I referred to cross lingual concept searching and categorization. This critical process should be run in advance of any translation or review in order to reduce the volume of documents (and costs associated with that effort).

Asking client to provide staff for foreign language document review and translation support: this is a very good option if your international client has staff to spare. Client’s employees will already have some unique understanding as to their employers products and services, industry unique nomenclature and perhaps some idea as to the issues in question. At least some of these employees will need to be bilingual with good command of English in order to communicate well with the case team. It's less likely that these people will be trained in US law so the roles would be limited to that of helping case team identify potentially responsive/relevant documents for the US case team to evaluate.

In the event that the international client cannot provide any (or enough) staff for this function, you may want to consider outsourcing. Bilingual and native speakers can be made available either on site or by remote access to work with the case team. When remote, these people are usually billable by the quarter hour and can be utilized cost-effectively.

Using MT to Translate All of the Documents: efficacy of machine translation is determined by a wide range of factors. Generally speaking, European languages translated to English are far more accurate than Asian and Middle Eastern language machine translations to English. If the documents are converted directly from native text (computer created by Word processor, spreadsheet, presentation software etc.) the results will be much more readable than if they were scanned documents first converted by OCR. Scans of handwritten documents cannot be recognized by MT. (click here to see an example of enhanced MT)

Documents translated by machine will never be confused with documents originally written in English. Sentence structure, grammar and word usage simply will not be right. That's not even mentioning idiomatic problems which are abound. That being said, as often as not, the reader will get a gist of what is being said in the document; certainly useful in helping to decide documents which can be eliminated from the review. Also, useful in determining which documents require further treatment.

Post edited MT: this can range from lightly post edited to fully edited and ranges in cost from $.04 a word to $.10 a word in my experience. Lightly post edited helps tremendously in getting the context of a document and fully post edited MT is hard to differentiate from English originated text.

Abstracts: these are summaries that can be a simple title and a one line description at a cost of approximately 5 dollars per document to more full summaries ranging in cost from $10-$15 per document. These are particularly useful where documents are handwritten or otherwise not good candidates for MT.

Human Translation: by far the most expensive approach (costs usually range from $.25-$.35 per word) and given the above options, should only be used for documents expected to be used as evidence.