A law firm needed to extract information from corporate governance documents and store it in a database. Those documents, generally available as PDFs, include information on roles, rights, and responsibilities of different parties (e.g., board of directors, shareholders, managers), how decisions are made, methods for monitoring actions and their outcomes that affect stakeholders, executive compensation, and many other vital elements of the structure of the corporation. The existing method was completely manual, with lawyers reading the documents, filling in a web-based form with the desired information, and then saving the form’s contents to a database. If you’ve ever had the pleasure of reading corporate governance documents, you’ll know that they are long, detailed, and boring, making this an extremely labor intensive and costly task.Our Approach
We used an ensemble of Natural Language Processing (NLP) models to match identical but differently described vehicles in multiple used-car valuation books.
We leveraged our deep knowledge of natural language processing and past experience working with PDF documents to build a system that ingested PDFs, extracted text, and automatically found answers for various questions that were previously handled manually. We compared a hand-coded approach that involved writing regular expression to find answers to specific questions in the text with one based on machine learning. The latter was trained using a corpus of documents and ground truth answers extracted by lawyers in the past, and learned to extract answers more accurately than the hand-coded regular expressions, with no work required when adding new questions other than extracting ground truth from historical databases.
We delivered python notebooks with the code required to go from a dataset of governance documents and a set of questions and their answers to accurate trained models for extracting answers to the same questions in new documents. That included code to perform rigorous evaluation to focus collection of additional training data for harder questions. This allowed the business unit to build a case for turning our prototype into a production system, freeing lawyers to make more productive use of their skills.
Click to download case study PDF