corner image corner image
corner image corner image
corner image corner image
corner image corner image
corner image corner image
corner image corner image
corner image corner image

Finding repeated images - part 1 - Vector Similarity Measures

Series Part 1, Part 2, Part 3

I’ve been playing around with spatial matching on clker.com . My goal was to figure out whether an image being submitted already exists or not, and to do that very fast. Titles, tags and all information in the image can change, so basically they are useless when it comes to know whether an image is repeated with high confidence. What is really needed is a set of features, that can be extracted fast enough and stored in the database, and indexed in a practically searchable manner.

So the problem was not only a mathematical one, it is also how to match the math with postgresql and the server capabilities in order to finally work.

The core of the problem is comparison between vectors. Those vectors contain sub vectors of features of different types. So the feature vector can have a sub vector of chroma numbers, and another vector storing shape information. Each sub vector stores information of only one type.

Comparing those vectors has to be done in a smart and fast manner. In this post I will discuss different comparison metrics that can be used for quantities with defined range. For example a quantity like pixel intensity can be from 0 - 255, so it has a defined range. In the following posts I will deal with similarity measures for quantities with unknown or wit no predefined range, i.e. they can be negative, positive across the whole domain of real numbers. I will also show how features are extracted and finally how we perform the search and narrow down the similarity set.

The three candidate similarity measures are Dice coefficient, Jaccard’s similarity (also known as Tanimoto’s similarity), and the overlap coefficient. (See the simetrics package page for more metrics and more background info). Why those three? They are superfast and easy to implement. Again, the simetrics package website shows an experiment results where they tested every coefficient they support, and reported their speed. The experiment result picture, has very bad colors especially with that gray background however with careful inspection one will recognize that those three coefficients are the fastest and actually appear on the lowest chart. In order to find the matching lines, I had to use GIMP’s color selection tool since the lines and tickers are blurry from the JPEG compression. Reasons for those metrics being so fast is the simplicity of implementation, no powers, no logs or square roots.

Given two sets A and B, the three coefficients are defined as follows:

Tanimoto: T=\frac{|A\cap B|}{|A\cup B|}

Dice: D=\frac{2|A\cap B|}{|A|+|B|}

Overlap: O=\frac{A\cap B}{min(|A|,|B|)}

Where the |.| operator is the number of elements of the set inside.

Figure 1: Sets & intersection

Now lets closely examine each of those metrics and see how they react to similarities and differences. Given the areas of A & B are defined as in figure 1 above, we can rewrite them as:

T=\frac{Y}{X+Y+Z}

D=\frac{2Y}{X+2Y+Z}

O=\frac{Y}{min(X+Y,Y+Z)}

CASE 1: |A|=|B|, for simplicity |A|=|B|=1

In this case, A and B have exactly the same number of elements. In realty, and in our application this happens when the number of features extracted from an image are independent on the image.

T=\frac{Y}{2-Y}

D=\frac{2Y}{2}=Y

O=\frac{Y}{min(1,1)}=Y

So O & D are equal.

Figure 2: Comparison between T(Y) & D(Y)

It is very clear from figure 2 that both T(Y) and D(Y) are monotonic, and that T(Y) is more sensitive to differences. This is clear if we start tracing the chart from the end at Y=1, and go left. The T(Y) tends to dip lower for the same amount of change in Y than D(Y). This means that for smaller changes in the image, T(Y) will be more sensitive than D(Y), and thus will help detect smaller changes. Therefore, in our application case we choose to use T(Y) over D(Y) for that purpose.

CASE 2: |A|>|B|, for simplicity |A|=1 & |B|=0.5

This case happens when we are comparing two vectors having different number of elements. This is the only case where the overlap coefficient O gains more significance than the other two. Since this is not our case, we will not continue its analysis.

Conceptual & practical application

Until now, we’ve been discussing the similarity measures conceptually referring to a Venn diagram with sets intersecting. Practically, we don’t have sets like that but rather vectors of numbers. This means that we have to put forward assumptions and derive the similarity measures out of the assumptions. And thus, one might find other derivations for the same measures in other parts of the literature because of the context of their application and because of the assumptions made.

Assumption 1: Both vectors (A & B) being compared have the same number of elements n

Assumption 2: The range of all elements are the same

In order to find the intersection and the union, we consider the following case.

Figure 3: Depiction of two vectors A & B drawn against their domain. One should notice that the vectors as defined here are not continuous functions but rather discrete. The continuous lines drawn in the figure are only used for clarity.

From figure 3, one can deduct that the intersection between A & B can possibly be described using the minimum at any point over the domain, while the union can be described using the maximum.

Thus, we write Tanimoto’s coefficient as:

T=\frac{\sum_{i=1}^nmin(A_i,B_i)}{\sum_{i=1}^nmax(A_i,B_i)}

Technorati Tags: , , , , , , , ,

Tags: , , , , , , , ,

corner image corner image

corner image corner image

2 Responses to “Finding repeated images - part 1 - Vector Similarity Measures”

  1. Clker.com - weblog » Blog Archive » GNUPlot wordpress plugin Says:

    [...] « Finding repeated images - part 1 - Vector Similarity Measures [...]

  2. Clker.com - weblog » Blog Archive » Finding repeated images - Part 2 - More on vector similarity Says:

    [...] repeated images - Part 2 - More on vector similarity Series Part 1, Part [...]

Leave a Reply

corner image corner image
3,533 spam comments
blocked by
Akismet