A pure mathematician woke up and smelled smoke coming from his kitchen. The fire left simmering under his pot of morning oats had caught onto a dirty paper towel, and a fire was quickly spreading across the counter. Luckily for our protagonist, there was a fire extinguisher fortuitously placed on the very next counter; salvation was nigh. The mathematician looked from the fire to the extinguisher and back, thinking furiously. After a few minutes of careful deliberation, combined with some frantic scribbling with the chalk he always kept handy, he shouted “solution exists!” and went back to sleep, satisfied.
While this joke may not be funny to those who never had to listen to mathematical seminars, the hours of daylight between finding a solution and actually implementing it may be familiar. In addition, said solution may have impressive general applicability but underperform simpler, more targeted solutions on specific domains. In the machine learning world, we call this second tension the “bias/variance tradeoff.” The greater the specificity of a solution, the less it can be generalized; conversely, the more generalizable, the more accuracy you sacrifice.
When applied scientists and engineers have to solve problems in the field, we have to deal with these questions: how implementable is a solution? And how generalizable is a solution? How general should it be? We don’t want to solve general problems with specific solutions, and we may also not want to solve specific problems with general (often worse) solutions. I would like to walk you through how we grappled with this tradeoff while dealing with one of the thorniest problems in PDF data extraction.
Tables in documents often contain some of the most essential information that the document wishes to convey. Yet, while humans can easily detect the shape and substance of a tabular form, doing this in an automated fashion algorithmically is still one of the significant, open problems in this field. While many have tried, a silver bullet has yet to be found. There are a variety of reasons for this, starting with the difficulty of defining what should constitute a table. Does it need to have lines? Must it be organized into clearly demarcated columns and rows? Or can it just be an ordered collection of data? The answer to all of these questions is yes, no, maybe, and sometimes.
Secondly, this is a problem that spans many disciplines. It has the elements of a computer vision problem. It’s essentially a PDF parsing problem. Some solutions may lie in the world of natural language processing and classical computer science. A cursory look at the Google Scholar search results for ‘table extraction’ reveals over 5,000,000 results spanning an overwhelming number of fields.
At Trullion, our business is extracting actionable information from PDF contracts and financial statements, so as you can imagine, solving this challenge is an exceptionally high priority. Yet despite the plethora of published solutions, many were written by the intellectual brethren of our dear scorched, mathematician. There were a wealth of equations, but few actionable approaches. Alternatively, some solutions worked on the Platonic ideal of a table but failed when real life got involved. It soon became clear that we had to figure this out independently.
Here is where we had to make the bias/variance decision: were we going to solve table extraction as a general problem, or were we going to solve financial data extraction from financial PDFs? Based on the variance of the tabular forms found in our data combined with the uniformity of purpose (financial professionals just want their payments), we made a choice to do the latter. Our approach would leverage the problem domain and the ambient context to solve our problem, not the problem.
After months of research, testing, and development, we developed an approach that combined the fields of computer vision, NLP, classical machine learning, old-school computer science, and mathematics to tackle this challenge. But most importantly, we based the solution on the knowledge of our clients’ needs. By doing this, by limiting what we were solving to what needed to be solved, we sacrificed general applicability in pursuit of the actual business goal. While an exhaustive description of our algorithm is beyond the scope of this article, I would like to give a broad overview of our approach as an illustrative example of how we grappled with the bias/variance trade-off.
Before we dive into the details of the algorithm, I would like to preface with a brief overview of two tools that were essential to this task. The first is called Named Entity Recognition, NER. NER is a natural language processing task that attempts to assign a predefined set of labels to spans of text. If we ran the sentence “Homer ate a case of donuts from his favorite Springfield Dunkin Donuts” through our NER model we would get the following spans: “Homer person ate a case of donuts from his favorite Springfield location Dunkin Donuts organization.” Our NER model is more sophisticated and can detect a wide range of labels that we then use to extract the data we need.
The second tool comes from the world of machine learning. When doing data analysis, researchers are often interested in grouping the data into clusters, or groups. A trivial example would be grouping location data into neighborhoods or communities. While many clustering algorithms exist, I would like to talk about the two most relevant to this task: DBscan and Kmeans.
The former algorithm is designed for a use case where the data can be clustered into groups, but you have no idea how many. To find these groupings, the user sets two parameters: the maximum distance (the epsilon) that data points can be from each other and still be part of a cluster, and the minimum size of a cluster. The algorithm returns to you which elements belong to the same cluster. The second algorithm is used when you have an intuition as to the number of clusters you are looking for, but you are searching for which elements belong to which cluster. Thus in this approach, you don’t have to make any assumptions about how close they need to be, or how many need to be in a group for it to be valid.
With this background in mind, our algorithm runs using the following steps:
- Run the contract text through our NER engine and collect the labeled spans, now called entities.
- Collect the dates, monetary, and payment name (Year 1, Month 1, etc) entities and send them for clustering using DBScan. The intuition is that we expect that a ‘table’ will be a cluster of payments and dates close to each other in the text.
- Use the graphical coordinates (where on the page these passages of text are found) of the entities to cluster them into rows and columns. This is done by first using Kmeans to estimate our epsilon, and then DBScan to cluster them.
- Use text parsing and search algorithms to find the table headers and assign the header titles to the corresponding columns.
- Parse the data into our native payment stream format, including the payment term, amount, period, name, and location in the PDF.
As you can see from this algorithm, it was built for ‘payment extraction’ not generic table extraction. We ‘biased’ our algorithm and therefore we limited its ‘variance’, what it could be used for. However, what we lost in generality with regards to what sorts of tables we parse, we gained in what shapes and forms our payment tables could be in. Lines or no lines, square or rectangular, the shape no longer matters. If we can’t identify the rows and columns, we can still return the payments we found, and point the user to its location in an oftentimes voluminous document.
When working in a dual role, as a researcher, and as an engineer, it becomes essential to clearly define, and then constrain the problem that we are trying to solve. We need to be ambitious and creative when searching for answers, but sober and realistic to ensure that it is solved. Ours is the task of Prometheus, as we ascend to the academic clouds searching for the tools addressing the needs of the swamp. The general solution is not always the best, and a bit of bias can go a long way.