Case Study

Read about how we helped the Chief Information Officer at a global law firm by building an application that ingests PDFs, extracts text, and automatically find answers to questions. We used natural language processing, eliminating the need for attorneys to manually find information in public documents and enter it into an online survey.


A law firm needed to extract information from corporate governance documents and store it in a database. Those documents, generally available as PDFs, include information on roles, rights, and responsibilities of different parties (e.g., the board of directors, shareholders, managers), how decisions are made, methods for monitoring actions and their outcomes that affect stakeholders, executive compensation, and many other vital elements of the structure of the corporation. The existing method was entirely manual, with lawyers reading the documents, filling in a web-based form with the desired information, and then saving the form’s contents to a database. If you’ve ever had the pleasure of reading corporate governance documents, you’ll know that they are long, detailed, and boring, making this an extremely labor-intensive and costly task.


We leveraged our deep knowledge of natural language processing and past experience working with PDF documents to build a system that ingested PDFs, extracted text, and automatically found answers for various questions that were previously handled manually. We compared a hand-coded approach that involved writing a regular expression to find answers to specific questions in the text based on machine learning. This machine learning model was trained using a corpus of documents and ground truth answers extracted by lawyers in the past. Ultimately, it learned to extract answers more accurately than the hand-coded regular expressions, with no work required when adding new questions other than extracting ground truth from historical databases.

How Natural Language Processing Helped

We delivered python notebooks with the code required to go from a dataset of governance documents and a set of questions and their answers to accurate trained models for extracting answers to the same questions in new documents.  That included code to perform rigorous evaluation to focus collection of additional training data for harder questions. This allowed the business unit to build a case for turning our prototype into a production system, freeing lawyers to make more productive use of their skills.