Case StudyPartner Page

Our client planned to migrate millions of digital client-matter files to a DMS. Missing data would have required thousands of manual classification hours, so we worked to build text extraction and machine learning models to efficiently and accurately classify documents. Our solution saved the law firm time and money on a successful digital migration.


Our client is one of the largest global legal firms in the world. This firm, located in the United States, has over 850 attorneys, nearly half of whom are spread worldwide. Their attorneys speak 60 distinct languages and practice law in the United States, United Kingdom, France, Germany, Italy, Hong Kong, the Organization for the Harmonization of Corporate Law in Africa (OHADA), and Saudi Arabia.

The firm had accumulated several decades of client and matter files: millions of digital documents stored in various filing systems. The ability to provide premier commercial advice to help clients achieve their ambitions hinges on vast legal knowledge and industry expertise as well as reliable access to past and present documents. Therefore, to improve client services through enhanced data security and retrieval, the firm decided to move all digital files to a formal Document Management System (DMS).

As the Chief Information Officer (CIO) and his team began planning the migration of these files to a DMS, they discovered many files were missing client and matter metadata. While the DMS alone would improve security and control, without identifiers for the client and matter attached to each file, the DMS’s utility would be limited. The information technology (IT) department determined a manual approach, tagging and classifying documents “by hand” would be too costly and overwhelming. Instead, they partnered with us to investigate how a machine learning system could efficiently identify the client and matter for the document.


Our initial approach began with a proof of concept (POC). We developed a machine learning system that takes as input a corpus of documents, one folder per client-matter, and produces as output a model that:

  • Probabilistically maps folders (and documents within) to client-matters;
  • Identifies supporting evidence in the documents for the mapping and;
  • Operates at a reasonable and measurable level of accuracy.

The POC took as input a set of folders containing documents in various formats, including Word, PDF, and Excel. Our team used Apache Tika to extract the text content from each document. Then we extracted named entities (i.e., people, places, and organizations) from that content. The documents were then represented as bags of words and named entities. These bags could be used to find the most similar row(s) in a spreadsheet that contained metadata about every client matter. The top K most similar client matters were surfaced for a final human verification step. We also developed an innovative method for extracting and matching the text snippets to make verification must faster and easier.

How Smart Content and Machine Learning Helped

We built a model with over 80% accuracy in identifying the correct client and matters for a given document.  Based on the POC results, our client asked us to build out a production-ready system so they can prepare for complete digital document migration to the new DMS within a fiscal quarter.