Synaptiq.ai

Machine Learning for Document Classification in a Secure DMS

Machine Learning for Document Classification in a Secure DMS

binding-books-bound-272980.jpg

 

The Problem

Our client is one of the largest global legal firms in the world. This firm, located in the United States, has over 850 attorneys, nearly half of whom are spread throughout the world. Their attorneys speak 60 distinct languages and practice law in the United States, United Kingdom, France, Germany, Italy, Hong Kong, the Organization for the Harmonization of Corporate Law in Africa (OHADA), and Saudi Arabia.

The firm had accumulated several decades of client and matter files; millions of digital documents stored in a variety of filing systems. The ability to provide premier commercial advice to help clients achieve their ambitions hinges on vast legal knowledge and industry expertise as well as reliable access to past and present documents. Therefore, in an effort to improve client services through improved data security and retrieval, the firm decided to move all digital files to a formal Document Management System, (DMS).

As the Chief Information Officer (CIO) and his team began planning the migration of these files to a DMS, they discovered a large number of files were missing client and matter metadata. While the DMS alone would improve security and control, without identifiers for the client and matter attached to each file, the utility of the DMS would be limited. The information technology (IT) department determined a manual approach, tagging and classifying documents “by hand,” would be too costly and overwhelming. Instead, they partnered with us to investigate how a machine learning system could more efficiently identify the client and matter for the document.

The Solution

Our initial approach began with a proof of concept (POC). We developed a machine learning system that takes as input a corpus of documents, one folder per client-matter, and produces as output a model that:

  • Probabilistically maps folders (and documents within) to client-matters;

  • Identifies supporting evidence in the documents for the mapping and;

  • Operates at a reasonable and measurable level of accuracy.

The POC took as input a set of folders containing documents in a variety of formats, including Word, PDF, and Excel. Our team used Apache Tika to extract the text content from each document. Then we extracted named entities (i.e., people, places, and organizations) from that content. The documents were then represented as bags of words and named entities. These bags could be used to find the most similar row(s) in a spreadsheet that contained metadata about every client matter. The top K most similar client matters were surfaced for a final human verification step. We also developed an innovative method for extracting and matching the text snippets to make verification must faster and easier.

The Results

We built a model with over 80% accuracy in identifying the correct client and matters for a given document. Based on the POC results, our client has asked us to build out a production-ready system so they can prepare for complete digital document migration to the new DMS within a fiscal quarter.