Automating coregistration, Part 2

02/03/2021

I'm writing today's blog post for two reasons. First, the program I created to facilitate automatic registration includes two phases, and I only discussed the first phase in my earlier blog post. Second, the same algorithm I used in that phase has come up again, somewhat unexpectedly, in a very different way. I found the convergence pretty interesting, and I hope you do, too.

What I did before (and why more was needed)
In my first blog post about automatic registration, I discussed using feature detection to match the same feature (basically, a corner) in each of two images. Although that worked rather well, it didn't quite get me to where I wanted to be. When processing those putatively corresponding features (using RANSAC), one must specify a tolerance that roughly corresponds to how many pixels a corner can be away from where it's expected to be and still be considered a match. This tolerance allows for some level of inter-image warping (e.g., by topography and perspective). However, the tolerance also limits the final precision of the alignment, and upon inspecting the results, I suspected more precision was possible.

The other option

It turns out that there are basically two broad ways to align images, and I found that using both yielded the best results. One method, as already discussed, is feature detection, which identifies specific features (corners) in one image and matches them to another. This is largely analogous to how a human would manually identify tie points. The other method could be broadly described as "pixel based' and looks at patterns in pixel values rather than distinct features. Such pixel-based approaches have a number of benefits. For example, they're appropriate to use when there aren't many unique point features but broader patterns may be recognizable, such as gradational variations in grass color or coverage density in a meadow. At least in my experience, they also provide great precision potential, specifically those that use what is called a 'global direct search'.

A global direct search essentially means that many, many possible transformations of the (target) Image B are attempted to find the optimal alignment to the (trusted reference) Image A. The downside of a direct search is that it can be very computationally expensive, especially if Image A and Image B start far from alignment. It simply takes a lot of trials for the algorithm to works its way to offsets and distortions that large! For this reason, I found it best to start with feature detection, which by its nature, is very efficient for such strong original misalignment: it takes no more (or less) time to identify features, and match them, whether A and B are well aligned or very far from it. Then, I use a global direct search to fine-tune this alignment, potentially at the subpixel level.

Enhanced Correlation Coefficient Maximization

The specific pixel-based algorithm I use is called Enhanced Correlation Coefficient (ECC) Maximization. The original paper can be found here. To be honest, I primarily used it because it is implemented in the popular OpenCV library, which is available in Python (and other languages), but it appears to work quite well (which is probably part of the reason it was implemented in that library, of course). A nice tutorial can be found here, and uses the example image below.

Conceptually, the way ECC Maximization works is fairly straightforward. Hopefully you're old enough to remember someone in school who used transparencies, those flimsy sheets of transparent plastic onto which images could be printed or notes marked in ink and then projected onto a screen, as shown below.

Imagine taking a few such transparencies, with similar images on them, and trying to align them. The image below is a rough representation of that situation. You might try rotating or shifting each image until it aligned as closely as possible. In essence, this is the "direct search" that I described earlier. But how would you know when the images were aligned? In some sense, it would be because each image correlates. For example, it might not be that the brightest part of each image aligns; I could have thrown a negative in there! But rather, you're looking for the patterns to align and thus for there to be some sort of correlation between 'pixel values'.

At a conceptual level, this is what ECC Maximization does. It attempts different combinations of transformation parameters--rotations and translations (as in the transparency analogy), but also rescaling, that is, stretching the image outward from all corners, or compressing it inward--until it maximizes the correlation between images, specifically, the ECC value. Because we're considering correlation and not simply matching brightest to brightest and darkest to darkest, this technique works even if the lighting has changed between the images (at least, better than a simpler approach could). You can probably intuit how such a technique could give you subpixel accuracy! You can also see why starting with a very misaligned images would require a longer search, especially if the images do not represent the same entire scene. For example, one image might represent just a small part of the other, possibly rotated or rescaled. It's a bit like finding where a puzzle piece would fit given the final image. You don't know how that piece should be rotated or where it should be translated to fit in the image, so it can take a while to place it.

How this all came up again unexpectedly

I recently received favorable reviews for a paper that I'd submitted, but the reviewer asked me to put one value on a more rigorous footing. In essence, that value was the error that I estimated for 'mapping' lava flow margins. These margins were mapped by carrying a piece of equipment called a differential GNSS rover (think: fancy GPS), and unless the equipment was perfectly vertical, there would be an error in measurement. Obviously, it's hard to keep something like this (stock image below) perfectly vertical while walking, so there was bound to be some error due to tilt. I estimated about 15 cm of tilt on average, based on what I saw in the field, but the reviewed wanted me to show this in the data.

The way I did this was comparing a few hundreds meters of a margin that was walked by both me and another person. Although I could have simply measured the typical difference between these respective margin traces, that would only be part of the story. It turns out that she and I naturally adopted walking the margin at a different distance out from the margin. One doesn't want to knock the equipment up against the flow itself, so one always has to keep a little space between the rover and the margin. However, if you're walking faster, you'll probably naturally keep the rover a little further from the margin than someone who is walking slowly, in order to avoid collisions with the margin. This offset is primarily (at a local level) translational, yet translation error is not relevant to our analysis. So, how could I process these overlapping traces so as to align them translationally as well as possible before measuring the error? I converted the trace and all the area to one side of it to black pixels, and the area on the other side to white pixels. I thus created two binary images, each representing the traced walked by me or by the other mapper. I then used ECC Maximization with a setting to only consider translation. After finding the optimal transform, I applied that transform back to the target margin trace and then measured the offsets.

As it turns out, it worked like a charm, showing that the mean error was not 22 cm (without ECC Maximization) but 18 cm, and the median error was not 18 cm but 12 cm. This rigorous exploration therefore corroborates my estimate of 15 cm as reasonable.