Finding repeated images - Part 3 - Extracting features
Wednesday, July 30th, 2008
At this point we want to extract a set of features to use with the similarity measures discussed in the previous posts. One of the important things we need to keep in mind is the context of the application, namely clip art repeated images. The size of the database is also relevant since a 1% error in a 3 million record database is way more than 1% in 10 thousand record database.
So our application is for relatively smaller databases. Clker.com database is approximately 19 thousand images.
For every picture in the database the feature vector consists of two part. The first part is an unrolled 3×3 RGB resampled version of the image. This would make a vector of 27 numbers between 0 - 255. I can hear all the synics saying “That won’t work dude”, well it somewhat did and the reason is again considering the context of the application.
Just using a 3×3 resampled version and comparing the result with Tanimoto’s coefficient, almost all repeated images were captured. False positives showed as well, and the reason is although the resampled images might end up being similar, yet the original images might not be the same. Having the same number of black and white pixels in the upper left third of the picture, will result in exactly the same intensity in the resampled image regardless of the original pixel distribution.
The false positive rate was pretty small, less than 5% of all images tagged as repeated were wrong. However, it would be nice to even reduce it further. This is done by using Hu moments, thus the feature vector now consists of 27 intensities, and 7 moments. As you might guess the values of the moments are usually very small compared to the intensities. Actually some of the moments are of the order 10^-34. This makes it impossible to use Tanimoto’s coefficient to generate a joint decision using both intensities and moments, and that’s where our
similarity measure kicks in.
Using both Tanimoto’s similarity and
, we were able to pinpoint exactly the repeated images. I didn’t see any false positives, so I guess the false positve rate was way lower than I can measure with my data set.
It is important to realize that we have some assumptions that might not be feasible with other databases including:
- Rotated images are new images: An arrow pointing up is called “an up arrow clipart”, while an arrow pointing right is called “a right arrow clipart” and they are different images.
- Scale is dealt with since we scale all images down to 3×3 RGB
- Database size is relatively small
- Moments are calculated on grayscale version of the image. You can calculate 3 sets of moments one on every channel (R,G,B) but that would result in another 21 features beside the 27 intensity features, and will take more time to process. Keep in mind that this is a web application, it needs to be relatively fast.
In the following posts I will share samples of the results as well as the implementation details.














, the the variance is
, which results in a constant Mahalanobis distance of 0.5.




