corner image corner image
corner image corner image
corner image corner image
corner image corner image
corner image corner image
corner image corner image
corner image corner image

Archive for the ‘Internet programming’ Category

Finding repeated images - Part 3 - Extracting features

Wednesday, July 30th, 2008

Series Part 1, Part 2, Part 3

At this point we want to extract a set of features to use with the similarity measures discussed in the previous posts. One of the important things we need to keep in mind is the context of the application, namely clip art repeated images. The size of the database is also relevant since a 1% error in a 3 million record database is way more than 1% in 10 thousand record database.

So our application is for relatively smaller databases. Clker.com database is approximately 19 thousand images.

For every picture in the database the feature vector consists of two part. The first part is an unrolled 3×3 RGB resampled version of the image. This would make a vector of 27 numbers between 0 - 255. I can hear all the synics saying “That won’t work dude”, well it somewhat did and the reason is again considering the context of the application.

Just using a 3×3 resampled version and comparing the result with Tanimoto’s coefficient, almost all repeated images were captured. False positives showed as well, and the reason is although the resampled images might end up being similar, yet the original images might not be the same. Having the same number of black and white pixels in the upper left third of the picture, will result in exactly the same intensity in the resampled image regardless of the original pixel distribution.

The false positive rate was pretty small, less than 5% of all images tagged as repeated were wrong. However, it would be nice to even reduce it further. This is done by using Hu moments, thus the feature vector now consists of 27 intensities, and 7 moments. As you might guess the values of the moments are usually very small compared to the intensities. Actually some of the moments are of the order 10^-34. This makes it impossible to use Tanimoto’s coefficient to generate a joint decision using both intensities and moments, and that’s where our \alpha similarity measure kicks in.

Using both Tanimoto’s similarity and \alpha, we were able to pinpoint exactly the repeated images. I didn’t see any false positives, so I guess the false positve rate was way lower than I can measure with my data set.

It is important to realize that we have some assumptions that might not be feasible with other databases including:

  • Rotated images are new images: An arrow pointing up is called “an up arrow clipart”, while an arrow pointing right is called “a right arrow clipart” and they are different images.
  • Scale is dealt with since we scale all images down to 3×3 RGB
  • Database size is relatively small
  • Moments are calculated on grayscale version of the image. You can calculate 3 sets of moments one on every channel (R,G,B) but that would result in another 21 features beside the 27 intensity features, and will take more time to process. Keep in mind that this is a web application, it needs to be relatively fast.

In the following posts I will share samples of the results as well as the implementation details.

Technorati Tags: , , ,

Finding repeated images - Part 2 - More on vector similarity

Saturday, July 19th, 2008

Series Part 1, Part 2, Part 3

This post took more time to write than the previous ones. Reason is when we write down our ideas, we see where are the problems, and that’s what exactly happened here. After writing what I was doing I figured out mistakes, which could lead to enhancing my current implementation.

Objective: Our objective is to figure out a way to compare elements of two vectors A & B. We have no knowledge about the range occupied by those elements. They can be positive or negative, and can be of different orders of magnitude.

Problem: Tanimoto’s coefficient discussed in Part 1 will not work due to the difference in orders of magnitude. Bigger elements will mask smaller ones.

Possible solution: We should normalize the value of elements to something that we know and is of the same order of magnitude. Mahalanobis distance accomplishes this by dividing the difference between a point and the mean by the variance. An approach similar to this would work, it is just that we have only two points to compare and thus the variance is 0.25 the square of the distance. So if the two elements were a_1, b_1, the the variance is 0.25(a-b)^2, which results in a constant Mahalanobis distance of 0.5.

Old and wrong ways of comparison

So the old and wrong idea was to divide the modulus of the difference by the modulus of the mean. By this way if the two values are similar to each other, the metric is small and if they are different the metric is big and is normalized against their order of magnitude. In order to make the value of this similarity metric between 0 and 1, we do this:

Sim(a_i,b_i)=\left(1+\frac{|a_i-b_i|}{(|a_i+b_i|)/2}\right)^{-1}

Actaully I used this similarity measure, and it seemed to improve my results than before. However, after I plotted it I saw a big flaw.

(more…)

GNUPlot wordpress plugin v1.1

Friday, July 18th, 2008

Plots GNUPlot charts without GNUPlot on your server. This plugin communicates with our custom version of GNUPlot hosted at clker.com, and responds with a PNG chart or errors in case of errors.

Write your GNUPlot code between [ gplot] and [ /gplot] (without spaces). Maximum chart size is 1×1.

To install

  1. Copy the file ( gnuplot plugin ) in you wp-content/plugins directory, and rename to .php instead of .phptxt.
  2. Create wp-content/cache directory, and make sure it is write able to the webserver
  3. Activate the plugin from the plugins tab inside wordpress

Example:

[ gplot]

set size 1,0.7
set dummy u,v
unset key
set parametric
set view 60, 30, 1.1, 1.33
set isosamples 50, 20
set title "Interlocking Tori - PM3D surface with depth sorting"
set urange [ -3.14159 : 3.14159 ] noreverse nowriteback
set vrange [ -3.14159 : 3.14159 ] noreverse nowriteback
set pm3d depthorder
splot cos(u)+.5*cos(u)*cos(v),sin(u)+.5*sin(u)*cos(v),.5*sin(v) with pm3d,\
1+cos(u)+.5*cos(u)*cos(v),.5*sin(v),sin(u)+.5*sin(u)*cos(v) with pm3d

[/ gplot]

would produce this:

… Enjoy

GNUPlot wordpress plugin

Wednesday, July 16th, 2008

Newer version is here

While I was writing the repeated images identification post, I modified the mimetex wordpress plugin to be the GNUPlot wordpress plugin.

The plugin executes GNUPlot over any portion of the text enclosed between [ gplot] and [/ gplot] tags, without the spaces of course.

Example:

[ gplot]

set size 0.75, 0.3

set xrange[0:5]

plot sin(x) title “sin(x)”, sin(2*x) title “sin(2x)”

[/ gplot]

would generate:

Download: Download the GNUPlot plugin for wordpress

Installation:

- Make sure that your server has gnuplot installed

- Create the directory <wordpress>/wp-content/cache, and make sure it is writable by the web server

Enjoy :)

Technorati Tags: , , ,

corner image corner image
3,515 spam comments
blocked by
Akismet