corner image corner image
corner image corner image
corner image corner image
corner image corner image
corner image corner image
corner image corner image
corner image corner image

Archive for the ‘development’ Category

Finding repeated images - Part 2 - More on vector similarity

Saturday, July 19th, 2008

Series Part 1, Part 2

This post took more time to write than the previous ones. Reason is when we write down our ideas, we see where are the problems, and that’s what exactly happened here. After writing what I was doing I figured out mistakes, which could lead to enhancing my current implementation.

Objective: Our objective is to figure out a way to compare elements of two vectors A & B. We have no knowledge about the range occupied by those elements. They can be positive or negative, and can be of different orders of magnitude.

Problem: Tanimoto’s coefficient discussed in Part 1 will not work due to the difference in orders of magnitude. Bigger elements will mask smaller ones.

Possible solution: We should normalize the value of elements to something that we know and is of the same order of magnitude. Mahalanobis distance accomplishes this by dividing the difference between a point and the mean by the variance. An approach similar to this would work, it is just that we have only two points to compare and thus the variance is 0.25 the square of the distance. So if the two elements were a_1, b_1, the the variance is 0.25(a-b)^2, which results in a constant Mahalanobis distance of 0.5.

Old and wrong ways of comparison

So the old and wrong idea was to divide the modulus of the difference by the modulus of the mean. By this way if the two values are similar to each other, the metric is small and if they are different the metric is big and is normalized against their order of magnitude. In order to make the value of this similarity metric between 0 and 1, we do this:

Sim(a_i,b_i)=\left(1+\frac{|a_i-b_i|}{(|a_i+b_i|)/2}\right)^{-1}

Actaully I used this similarity measure, and it seemed to improve my results than before. However, after I plotted it I saw a big flaw.

(more…)

GNUPlot wordpress plugin v1.1

Friday, July 18th, 2008

Plots GNUPlot charts without GNUPlot on your server. This plugin communicates with our custom version of GNUPlot hosted at clker.com, and responds with a PNG chart or errors in case of errors.

Write your GNUPlot code between [ gplot] and [ /gplot] (without spaces). Maximum chart size is 1×1.

To install

  1. Copy the file ( gnuplot plugin ) in you wp-content/plugins directory, and rename to .php instead of .phptxt.
  2. Create wp-content/cache directory, and make sure it is write able to the webserver
  3. Activate the plugin from the plugins tab inside wordpress

Example:

[ gplot]

set size 1,0.7
set dummy u,v
unset key
set parametric
set view 60, 30, 1.1, 1.33
set isosamples 50, 20
set title "Interlocking Tori - PM3D surface with depth sorting"
set urange [ -3.14159 : 3.14159 ] noreverse nowriteback
set vrange [ -3.14159 : 3.14159 ] noreverse nowriteback
set pm3d depthorder
splot cos(u)+.5*cos(u)*cos(v),sin(u)+.5*sin(u)*cos(v),.5*sin(v) with pm3d,\
1+cos(u)+.5*cos(u)*cos(v),.5*sin(v),sin(u)+.5*sin(u)*cos(v) with pm3d

[/ gplot]

would produce this:

… Enjoy

GNUPlot wordpress plugin

Wednesday, July 16th, 2008

While I was writing the repeated images identification post, I modified the mimetex wordpress plugin to be the GNUPlot wordpress plugin.

The plugin executes GNUPlot over any portion of the text enclosed between [ gplot] and [/ gplot] tags, without the spaces of course.

Example:

[ gplot]

set size 0.75, 0.3

set xrange[0:5]

plot sin(x) title “sin(x)”, sin(2*x) title “sin(2x)”

[/ gplot]

would generate:

Download: Download the GNUPlot plugin for wordpress

Installation:

- Make sure that your server has gnuplot installed

- Create the directory <wordpress>/wp-content/cache, and make sure it is writable by the web server

Enjoy :)

Technorati Tags: , , ,

Finding repeated images - part 1 - Vector Similarity Measures

Wednesday, July 16th, 2008

Series Part 1, Part 2

I’ve been playing around with spatial matching on clker.com . My goal was to figure out whether an image being submitted already exists or not, and to do that very fast. Titles, tags and all information in the image can change, so basically they are useless when it comes to know whether an image is repeated with high confidence. What is really needed is a set of features, that can be extracted fast enough and stored in the database, and indexed in a practically searchable manner.

(more…)

Caching SQL results with PHP

Monday, May 12th, 2008

I’ve been looking around lately on the best way to cache SQL results in PHP. I found some interesting articles posted in lots of places, but I didn’t find any that exactly matches my needs. The problem I have on hand is basically the same every growing website faces: decreasing mean resource usage per page request.

Now - this is my plan A to keep up with the website’s growth without a lot of hardware upgrades. There is a plan B, but I will keep that to a later post.
(more…)

Creating a tar gz on the fly using PHP

Thursday, March 27th, 2008

A while ago, I thought about creating a tar.gz file for every download, so that if someone runs a search, he/she then can download all the images in the results. After a little bit of research, I found that PHP has a function for gzip. I also knew that the tar format just sticks files after one another, so if I can implement the tar format in PHP then I can gzip all images in the results.

I found this LGPL code that implemented the tar format. I used it (and modified it a little bit) to produce the online tar.gz functions:

  1. // Computes the unsigned Checksum of a file’s header
  2. // to try to ensure valid file
  3. // PRIVATE ACCESS FUNCTION
  4. function __computeUnsignedChecksum($bytestring)
  5. {
  6.   for($i=0; $i<512; $i++)
  7.     $unsigned_chksum += ord($bytestring[$i]);
  8.   for($i=0; $i<8; $i++)
  9.     $unsigned_chksum -= ord($bytestring[148 + $i]);
  10.   $unsigned_chksum += ord(" ") * 8;
  11.  
  12.   return $unsigned_chksum;
  13. }
  14.  
  15. // Generates a TAR file from the processed data
  16. // PRIVATE ACCESS FUNCTION
  17. function tarSection($Name, $Data, $information=NULL)
  18. {
  19.   // Generate the TAR header for this file
  20.  
  21.   $header .= str_pad($Name,100,chr(0));
  22.   $header .= str_pad("777",7,"0",STR_PAD_LEFT) . chr(0);
  23.   $header .= str_pad(decoct($information["user_id"]),7,"0",STR_PAD_LEFT) . chr(0);
  24.   $header .= str_pad(decoct($information["group_id"]),7,"0",STR_PAD_LEFT) . chr(0);
  25.   $header .= str_pad(decoct(strlen($Data)),11,"0",STR_PAD_LEFT) . chr(0);
  26.   $header .= str_pad(decoct(time(0)),11,"0",STR_PAD_LEFT) . chr(0);
  27.   $header .= str_repeat(" ",8);
  28.   $header .= "0";
  29.   $header .= str_repeat(chr(0),100);
  30.   $header .= str_pad("ustar",6,chr(32));
  31.   $header .= chr(32) . chr(0);
  32.   $header .= str_pad($information["user_name"],32,chr(0));
  33.   $header .= str_pad($information["group_name"],32,chr(0));
  34.   $header .= str_repeat(chr(0),8);
  35.   $header .= str_repeat(chr(0),8);
  36.   $header .= str_repeat(chr(0),155);
  37.   $header .= str_repeat(chr(0),12);
  38.  
  39.   // Compute header checksum
  40.   $checksum = str_pad(decoct(__computeUnsignedChecksum($header)),6,"0",STR_PAD_LEFT);
  41.   for($i=0; $i<6; $i++) {
  42.     $header[(148 + $i)] = substr($checksum,$i,1);
  43.   }
  44.   $header[154] = chr(0);
  45.   $header[155] = chr(32);
  46.  
  47.   // Pad file contents to byte count divisible by 512
  48.   $file_contents = str_pad($Data,(ceil(strlen($Data) / 512) * 512),chr(0));
  49.  
  50.   // Add new tar formatted data to tar file contents
  51.   $tar_file = $header . $file_contents;
  52.  
  53.   return $tar_file;
  54. }
  55.  
  56. function targz($Name, $Data)
  57. {
  58.   return gzencode(tarSection($Name,$Data),9);
  59. }
  60.  

To use those functions all you have to do is send a header with the mime type for the tar gz ( application/x-gzip ) using the php header function. To add a tar/gz section for a file, read the file in an array using filegetcontents and pass the filename and data to the targz function. Echo what is returned. That’s it!

So why is it not active on clker.com website? I actually tried it and found that compression consumes a lot of CPU. In the first 20 minute I had more than one hundred connections for different users downloading their results and the CPU was saturated. This basically left no CPU for searching. So use it carefully, and only if you really need that functionality.

Technorati Tags: , , , ,

Storing images in your sql database versus filesystem

Monday, March 3rd, 2008

Many websites today rely on media be it images or videos. One reason is humans are visual creatures, they like seeing things versus reading long articles.

Images on websites can be divided into two different types: Images used in the website theme including logos, rounded corners, backgrounds …etc. and images used as content. Obviously, the ones discussed here are the content images. Images used in the website theme will be accessed frequently and almost with every page view, and usually they are very few and better managed by storing them in a directory.

(more…)

Running your server is easy, fun but involved

Saturday, March 1st, 2008

I love using Linux to do my work. My best usage of Linux is my web server, although I recall I read once that Linus never intended for the kernel to be used as a server. He was more focused on using the kernel in desktops. I’ve been running my own web server for almost a year now, which runs two websites mibrahim.net my real estate website, and clker.com a to be online clipart website - we’re halfway there.

The fun part is simply everything just works. You’ll have all the tools you need starting from database engines like postgres, mysql to scripting languages like php, ruby with different types of webservers apache, lighthttpd and others. All the tools you might think of are there and under your own hand. Building your own server is not expensive - around $100 will do it. You don’t need a super quad core machine to produce extremely fast websites, unless you are already getting more than 50 page requests per second and at that point you will need something faster.

The performance bottle neck is never the CPU, it’s the hard drives read or write speeds. You can improve on that using fake RAIDs. Almost all Linux distros offer fake RAIDs and that is the cheapest way to improve the read speed.

Setting up your server is not a hard process. The best distributions that I recommend are Debian and Ubuntu. The reason is the very large library of software that comes with each. I believe that now the full distribution has grown more than 11 CDs. I used to run Debian and switched to Ubuntu a year ago and the reason behind the switch is the faster updates I get from Ubuntu, which enables me to use more recent and updated versions of PHP and the database engines.

The easiest setup is using the Ubuntu server CD, which is not any different from the desktop CD in terms of binaries. The only difference is that it won’t install the X11-server (GUI) and the window managers (gnome or kde) and the install program itself runs over the console and not VGA graphics. I use the server installation, and connect to my server using ssh. I have another old machine that runs Ubutu as well, and is used to run freenx. By that way I keep the server’s memory for the services running, and I can add all the GUI programs I want on this old machine.

Since I greatly benefited from running my own web server, I will share my experiences every now and then when I’ve got time to write.

Technorati Tags: , , , , , , ,

corner image corner image
1,487 spam comments
blocked by
Akismet