Semantic Image Clustering with Neural Networks

From DH 2016 to PixPlot

At the annual Digital Humanities conference in Krakow (2016) I saw a terrific paper given by Benoit Seguin et al on "Visual Patterns Discovery in Large Databases of Paintings". It was the first time I had considered the possibilities of pre-trained captioning neural networks in making sense of enormous digitized visual collections. Since then, my colleagues and I at the Yale DHLab have worked on our own interpretations of the ideas shared by Benoit and his team, with a focus on a WebGL front-end visualization that scales to the tens of thousands of images. The resulting software project, which was authored by DH Developer Doug Duhaime, is called PixPlot, and is available on GitHub for anyone to use.

CNNs as Shortcut to Featurization

PixPlot's fundamental approach remain the same as what Seguin proposed in Poland. We use the penultimate layer of a pre-trained convolutional neural network for image captioning (specifically Google's Inception), to obtain a robust featurization space in 2,048 abstract dimensions of vision.

Pre-trained convolutional neural networks for image captioning deserve more attention than I can give them here, but their goal is to describe images they have never seen before. Such descriptions – even if they are imperfect – can then be used to search through the thousands of cell phone images we carry around with us, or to provide a textual proxy for a sight-impaired user.

The only problem with these large pre-trained captioning networks is that the labels they provide represent the kinds of images users are most likely to take (or encounter) in 2018. There are likely be to labels such as "cup of coffee", "cat", and "automobile" – but you're unlikely to find "parasol" or "steam engine."

Still, these captioning networks still have important uses. In order to arrive at their final textual descriptions, convolutional neural networks pass numbers representing pixel values through a series of complicated layers, each optimized to detect salient aspects of visual details. Although the final layer of the network contains "cat" and "automobile", the layer before that final captioning layer is doing something more abstract: discriminating round shapes low in the frame (wheels on cars) from certain variegated patterns (fur on cats). It's impossible to really describe what this penultimate layer "sees" – but whatever it sees is, empirically, very effective at distinguishing automobiles from domesticated felines. In that sense, the network has learned to see in a certain number of highly sophisticated ways. And thus it has provided us a shortcut in the traditionally-difficult task of featurization – deciding what separates a car from a cat.

Better Dimensionality Reduction

These thousands of visual features can be thought of as a feature space – an imaginary place where all images can be placed in relationship to one another. If we had only two dimensions (cars and cats), all the automobiles would appear near each other, and all the pets would cluster on the diagonal opposite side of the graph. (A hypothetical car painted to look like Garfield would appear in the middle somewhere.).

Now imagine 2,048 dimensions, instead of two. Cool, but hard to visualize for humans. We need to collapse this high-dimensional space down to something that can be shown on a computer screen. There are a number of statistically-sophisticated ways of collapsing this space down to X and Y, each trying to do the minimum of violence to the underlying patterns, while still resulting in an image you can see. A decade ago we would have used PCA; a year ago we would have used t-SNE, and for this project we wanted to try UMAP. All these algorithms are trying to accomplish the same thing, but t-SNE and UMAP seek to preserve both local clusters and an interpretable, global shape.

Rich User Experience

Even when we've successfully reduced our neural network-provided high-dimensional space down to two directions, we still have the problem of showing lots of images on the screen at the same time. Having a visualization of tens of thousands of objects at once requires some special tricks. Interfaces designed with SVG in tools such as d3.js work well until a certain threshold where performance declines. Going further requires a lower-level approach – one more in common with 3D game design than traditional web programming. Luckily Doug was up to the challenge of learning WebGL, a programming language that uses low-level access to high-performance 3D hardware in most modern computers. He has a great writeup of visualizing dimensionality reductions using three.js. What's amazing about the result is the fluidity and performance of the user experience -- once you learn some navigation tricks, you're able to zoom in and out and focus in on specific clusters of similar images with ease. It's easy to forget you're looking at over 27,000 images!

Example: Early 20th-century Photography from Lund

I've assembled 30,000 photographs taken by the Swedish photographer Per Bagge in the southern city of Lund and loaded them into PixPlot. The resulting visualization shows both a global shape:

...as well as tantalizing clusters of specific patterns in how Bagge framed and presented his subjects. Let's take a look:

Children in Chairs

Weddings

Women in White

(Click on each square above to jump to that cluster in the visualization)

It's worthwhile pointing out that each "cluster" of similar photos really exists in a continuum with all other images, to the best of the ability of the dimensionality reduction algorithm to represent that complexity in only two dimensions. For example, in the Bagge archive there's one cluster of buildings, and a nearby cluster of parks:

But between the buildings and the parks tend to be those images that have characteristics of both -- buildings surrounded by large trees, etc. Go further away from the parks cluster, and you'll find more urban buildings with fewer natural features. The relationship isn't exact, because there are thousands of other dimensions at play here – dark sky versus bright sky? people in the picture or not? – but overall the visualization makes a kind of intuitive sense as you navigate around it.

The interpretability of the visualization is actually kind of miraculous, since we're standing on the shoulders of systems designed to work with contemporary color photography, instead of early 20th-century Sweden. Although the convolutional neural network was designed to caption the images of today, backing up one level to the penultimate layer of the network, and performing dimensionality reduction on the resulting 2,048 ways of seeing, results in a natural clustering of images with similar semantic content.

In discussing PixPlot I like to use the analogy of all 30,000 photos from the archive out of their boxes and throwing them up in the air. Imagine them landing on the floor of the library in groups of similar imagery, as if a ghost had intervened in the path and direction of each print mid-flight, ensuring it came to rest close to all of the photos with which it shared visual characteristics. That's the promise of harnessing the incredible power of pre-trained captioning networks, and redirecting them to novel uses such as semantic image clustering of photographic archives.