What I do at work underwhelms my kids. They understand that I work with A.I. and until I rob them of their innocence, this means that I build robot soldiers. The sad reality that their father is excited about teaching a computer to classify images makes for some painful dinner conversations. So to make them feel just a bit worse I asked them the following question: how can you tell, without reading, if a page contains words or just pictures?
At first, my children questioned my level of literacy, but after a few minutes of trying to describe what a word, a character looks like, the challenge became evident. Without first reading the page(i.e. trying to recognize individual characters) it was unclear to them, and to me, how to make this distinction.
At Trullion, we provide a real-time service that ingests Excel and PDF documents of varying quality and extracts actionable data for our clients. There is a constant struggle to balance the dueling values of performance vs. efficiency. We weigh each improvement by its runtime cost and the value provided to our users. When we find an area where we can improve speed without sacrificing accuracy, we dance. These moments are rare, but this was one of them.
One of the costliest (in terms of run time) parts of our workflow is the step where we extract the text information from the source document. This is done using OCR (Optical Character Recognition) which attempts to locate text, and ‘read’ text found on an image. This process can be very time-consuming. Our goal was to distinguish between pages containing mainly text, and those containing floor plans, blueprints, or other images to avoid “OCRing” non-text pages. Yet, without reading, how would the computer know what the page contains?
When the topic of image recognition comes up in conversation amongst data scientists it inevitably centers on deep learning. The vast majority of the kittens on the internet are classified by various forms of neural networks. Yet, deep learning has its costs. Training a neural network requires a large amount of labeled data. Collecting and labeling the data we needed for the task would have been too arduous a task. This was compounded by the difficulty of locating images that would naturally occur in leases or revenue contracts.
We decided to approach this problem differently. Instead of looking at the image itself, we would look at its Fourier transform. The idea behind this approach (credit to my father, Dr. Michael Agishtein: it pays to call your parents) is to train a simple classifier that would look at the images in frequency space. While a dive into the mechanics of the Fourier Transform is beyond the scope of this article, it is an incredibly powerful tool in the field of spectral analysis. This transformation allows one to convert a signal from a messy function into a collection of sine and cosine functions with complex coefficients. These complex numbers capture really important information about the frequency and the amplitude of the underlying signal.
This insight suggests that we consider the image not as a picture, but as a two-dimensional signal that can be decomposed into its Fourier components. Maybe a computer could exploit the differences in how text (sharp, frequent edges) and images (lines and smooth edges) affect the amplitude and the frequency of the signal. This method is also very fast as the Fourier transform can be computed through a very efficient method called The Fast Fourier Transform, or FFT for short.
To test this idea, we collected a training set of images, half containing blueprints, and the rest containing mostly text in order to train a machine learning classifier. After transforming them into their Fourier representations using FFT, we trained a classifier (SVM on the 1-D projection if you’re curious) on the labeled dataset. Our results were outstanding (if we may say so ourselves): it achieved an accuracy of 97%, which was far more than we had prayed for. With a tool with this level of accuracy, we could really optimize our processing time.
When working in the field of A.I. there is often a strong tunnel vision where you can only see the methods that have generated recent excitement. The past is forgotten in the pursuit of the future. The Fourier Transform is a really old tool, not one that makes headlines anymore. At Trullion we believe that true innovation borrows from the past to forge the future.
Alexander (Shlomo) Agishtein
A.I. Team Lead at Trullion