corner image corner image
corner image corner image
corner image corner image
corner image corner image
corner image corner image
corner image corner image
corner image corner image

Archive for the ‘linux’ Category

Finding repeated images - Part 3 - Extracting features

Wednesday, July 30th, 2008

Series Part 1, Part 2, Part 3

At this point we want to extract a set of features to use with the similarity measures discussed in the previous posts. One of the important things we need to keep in mind is the context of the application, namely clip art repeated images. The size of the database is also relevant since a 1% error in a 3 million record database is way more than 1% in 10 thousand record database.

So our application is for relatively smaller databases. Clker.com database is approximately 19 thousand images.

For every picture in the database the feature vector consists of two part. The first part is an unrolled 3×3 RGB resampled version of the image. This would make a vector of 27 numbers between 0 - 255. I can hear all the synics saying “That won’t work dude”, well it somewhat did and the reason is again considering the context of the application.

Just using a 3×3 resampled version and comparing the result with Tanimoto’s coefficient, almost all repeated images were captured. False positives showed as well, and the reason is although the resampled images might end up being similar, yet the original images might not be the same. Having the same number of black and white pixels in the upper left third of the picture, will result in exactly the same intensity in the resampled image regardless of the original pixel distribution.

The false positive rate was pretty small, less than 5% of all images tagged as repeated were wrong. However, it would be nice to even reduce it further. This is done by using Hu moments, thus the feature vector now consists of 27 intensities, and 7 moments. As you might guess the values of the moments are usually very small compared to the intensities. Actually some of the moments are of the order 10^-34. This makes it impossible to use Tanimoto’s coefficient to generate a joint decision using both intensities and moments, and that’s where our \alpha similarity measure kicks in.

Using both Tanimoto’s similarity and \alpha, we were able to pinpoint exactly the repeated images. I didn’t see any false positives, so I guess the false positve rate was way lower than I can measure with my data set.

It is important to realize that we have some assumptions that might not be feasible with other databases including:

  • Rotated images are new images: An arrow pointing up is called “an up arrow clipart”, while an arrow pointing right is called “a right arrow clipart” and they are different images.
  • Scale is dealt with since we scale all images down to 3×3 RGB
  • Database size is relatively small
  • Moments are calculated on grayscale version of the image. You can calculate 3 sets of moments one on every channel (R,G,B) but that would result in another 21 features beside the 27 intensity features, and will take more time to process. Keep in mind that this is a web application, it needs to be relatively fast.

In the following posts I will share samples of the results as well as the implementation details.

Technorati Tags: , , ,

Discovering some of Linux Cow Powers: awk and bash

Tuesday, July 22nd, 2008

If you’ve never used linux you should give it a try. The simplest way is to install colinux with xming. If you feel its involved, just burn an ubuntu 8.04 bootable cd. It works out of the CD with almost all the every day used features without installtion, formating or any other extra work.

One of the cow powers that I use linux for, and didn’t find a way to do it using the windows shell is renaming a lot of files. I had around 3000 bmp files, their names were in the following format:

<word>-<volume>-<number>.bmp

and I wanted to rename them to

<word>-<number>.bmp

I couldn’t find a way to do this on windows except with writing a small pogram. On linux the solution is a program called awk, which can tokenize input strings, and write parts of them out. awk is a sophistiacated tool, and I usually use a very small subset of its features.

The solution was finally something similar to this:

find .bmp | awk -F- '{print "mv "$0" "$1"-"$3;}' | bash -v

The first part is running the find program to find all files with extension .bmp. The results are then piped as input to awk, instead of being written to the screen. awk is told that the field separator is ‘-’ using the -F parameter. A one statement awk program says print the words mv to issue commands to rename a file, then $0 for the full string before tokenizing, then the $1 for the first tokenized part, and $3 for the third tokenized part, skipping the second. So awk is being used to write the commands that we need to execute.

Those commands are then piped to bash, the bourne shell with the verbose flag on so that we can see the commands being executed.

A good practice when trying something like that is to pipe the results first to less, instead of bash. less is a program like more, which gives the user the ability to inspect whatever was piped to it and filp forward and backwords. It also includes some nice features like -N to show numbers, and searching for a word using the forward slash /  , with commands very similary to vim.

So the less version of the line whould be this:

find .bmp | awk -F- '{print "mv "$0" "$1"-"$3;}' | less

This is a very handy technique in renaming lots of files, or carrying batch operations on multiple files that I usually use.

Technorati Tags: , , , , , , ,

Finding repeated images - Part 2 - More on vector similarity

Saturday, July 19th, 2008

Series Part 1, Part 2, Part 3

This post took more time to write than the previous ones. Reason is when we write down our ideas, we see where are the problems, and that’s what exactly happened here. After writing what I was doing I figured out mistakes, which could lead to enhancing my current implementation.

Objective: Our objective is to figure out a way to compare elements of two vectors A & B. We have no knowledge about the range occupied by those elements. They can be positive or negative, and can be of different orders of magnitude.

Problem: Tanimoto’s coefficient discussed in Part 1 will not work due to the difference in orders of magnitude. Bigger elements will mask smaller ones.

Possible solution: We should normalize the value of elements to something that we know and is of the same order of magnitude. Mahalanobis distance accomplishes this by dividing the difference between a point and the mean by the variance. An approach similar to this would work, it is just that we have only two points to compare and thus the variance is 0.25 the square of the distance. So if the two elements were a_1, b_1, the the variance is 0.25(a-b)^2, which results in a constant Mahalanobis distance of 0.5.

Old and wrong ways of comparison

So the old and wrong idea was to divide the modulus of the difference by the modulus of the mean. By this way if the two values are similar to each other, the metric is small and if they are different the metric is big and is normalized against their order of magnitude. In order to make the value of this similarity metric between 0 and 1, we do this:

Sim(a_i,b_i)=\left(1+\frac{|a_i-b_i|}{(|a_i+b_i|)/2}\right)^{-1}

Actaully I used this similarity measure, and it seemed to improve my results than before. However, after I plotted it I saw a big flaw.

(more…)

GNUPlot wordpress plugin v1.1

Friday, July 18th, 2008

Plots GNUPlot charts without GNUPlot on your server. This plugin communicates with our custom version of GNUPlot hosted at clker.com, and responds with a PNG chart or errors in case of errors.

Write your GNUPlot code between [ gplot] and [ /gplot] (without spaces). Maximum chart size is 1×1.

To install

  1. Copy the file ( gnuplot plugin ) in you wp-content/plugins directory, and rename to .php instead of .phptxt.
  2. Create wp-content/cache directory, and make sure it is write able to the webserver
  3. Activate the plugin from the plugins tab inside wordpress

Example:

[ gplot]

set size 1,0.7
set dummy u,v
unset key
set parametric
set view 60, 30, 1.1, 1.33
set isosamples 50, 20
set title "Interlocking Tori - PM3D surface with depth sorting"
set urange [ -3.14159 : 3.14159 ] noreverse nowriteback
set vrange [ -3.14159 : 3.14159 ] noreverse nowriteback
set pm3d depthorder
splot cos(u)+.5*cos(u)*cos(v),sin(u)+.5*sin(u)*cos(v),.5*sin(v) with pm3d,\
1+cos(u)+.5*cos(u)*cos(v),.5*sin(v),sin(u)+.5*sin(u)*cos(v) with pm3d

[/ gplot]

would produce this:

… Enjoy

Finding repeated images - part 1 - Vector Similarity Measures

Wednesday, July 16th, 2008

Series Part 1, Part 2, Part 3

I’ve been playing around with spatial matching on clker.com . My goal was to figure out whether an image being submitted already exists or not, and to do that very fast. Titles, tags and all information in the image can change, so basically they are useless when it comes to know whether an image is repeated with high confidence. What is really needed is a set of features, that can be extracted fast enough and stored in the database, and indexed in a practically searchable manner.

(more…)

ionice and daily backups

Wednesday, June 25th, 2008

Anyone running a web server will either use a RAID for information protection, or will run a cron job that backups all the data daily

If you are running a budget server like mine, then most likely you’ll be doing daily backups. At 4 am everyday a backup process runs and stores a dump of the SQL and gzip of all the web folders including the SVN repository on a separate harddrive, which gets unmounted after the backup is done. The process takes around 2 hours to complete, during which the hard drives are all stressed and if the backup scripts were not carefully written they will hog any apache request due to slow disk access.

Because the primary reason of a webserver is to serve pages, I do two things to guarantee that any apache or database process will take precedence in resources when compared to the backup process:
1. The backup process runs with nice -n19 <backup>. This means that the process is running with a CPU idle priority, and will only use the CPU if no other process needs it. Using only nice will guarantee that the CPU is free, but programs like tar and gzip are big disk users and although they might not use a lot of CPU, a very small CPU usage is enough to generate big disk reads/writes. Large disk reads/writes usually will block other processes and keep them waiting for the disk resource, and will thus slow down apache, postgres and mysql and finally all pages accessed during the backup. In simple words, during backup if one tries to browse any of my websites when the backup process is only niced, the pages will load very slow.
2. This leads us to the second precaution, ionice -c2 -n7 nice -n19 <backup>. ionice is a tool that not so many people know that it exists. It basically prioritizes the disk access for different processes. Long tasks that access the disk a lot, and which are not that important or time critical should be given a lower priority compared to those that require fast turn around. ionice has three priority classes given by the parameter -c. Class 1 is idle priority, and only root can run processes in that class. Class 2 is best effort priority, within which there are 0-7 levels given by the -n parameter, 0 being highest priority and 7 being lowest. Class 3 is real time.

ionicing and nicing the backup process makes a huge difference in the speed and response of the webserver during the daily backup, especially if it was long. Try going over its man page. A good post about ionice is here.

Technorati Tags: , , , ,

Gatner “Windows is collapsing”

Friday, April 11th, 2008

So Gartner, one of the largest analysis companies said in a conference that windows is collapsing. Lots of reasons were given including a large code for windows that became un-maintainable and lead Microsoft to start from a more stable release of Windows 2003, and improve on for Vista. This in turn lead to Vista not really delivering meaningful enhancements from the point of view of the end consumer. Also, the fact that applications now are moving on the web and becoming OS agnostic where you can run many available software almost on all operating systems (openoffice for example, inkscape and others based on GNU’s compilers or java).

Other reasons included a hardware demanding OS, which can’t fit on small PDA’s and thus giving the chance for Linux and OS X. I add to this list, the complete prevalence of Linux in embedded OSs in appliances like Tivos, with zero competition from Microsoft.

A big major contribution from my point of view is the reason of making a new release. When a company decides to allocate resources for the development of a software, they need goals. The only goal that is obvious from the release of Vista was to duress people into DRM, using windows position as a must be OS on every PC or  laptop. A very important lesson that you can’t force the whole world into only your own vision, but you have to embrace what people want, make it your vision and may be add more on the side. After all people are the consumer. Unfortunately, that was the second time Microsoft did that, after they already suffered losses on the server side and hosting.

However, I do disagree with Gartner on the result that Microsoft or windows will crash. I believe that Microsoft is trying to embrace the standards and work with others, it’s just they’re not doing enough and lots of people doubt their intentions. My view is Microsoft becoming more like SUN or IBM where they will have to work with people and provide solutions for the problems that people see using ways that the people want.

Technorati Tags: , , , , , , , , , , ,

Creating a tar gz on the fly using PHP

Thursday, March 27th, 2008

A while ago, I thought about creating a tar.gz file for every download, so that if someone runs a search, he/she then can download all the images in the results. After a little bit of research, I found that PHP has a function for gzip. I also knew that the tar format just sticks files after one another, so if I can implement the tar format in PHP then I can gzip all images in the results.

I found this LGPL code that implemented the tar format. I used it (and modified it a little bit) to produce the online tar.gz functions:

  1. // Computes the unsigned Checksum of a file’s header
  2. // to try to ensure valid file
  3. // PRIVATE ACCESS FUNCTION
  4. function __computeUnsignedChecksum($bytestring)
  5. {
  6.   for($i=0; $i<512; $i++)
  7.     $unsigned_chksum += ord($bytestring[$i]);
  8.   for($i=0; $i<8; $i++)
  9.     $unsigned_chksum -= ord($bytestring[148 + $i]);
  10.   $unsigned_chksum += ord(" ") * 8;
  11.  
  12.   return $unsigned_chksum;
  13. }
  14.  
  15. // Generates a TAR file from the processed data
  16. // PRIVATE ACCESS FUNCTION
  17. function tarSection($Name, $Data, $information=NULL)
  18. {
  19.   // Generate the TAR header for this file
  20.  
  21.   $header .= str_pad($Name,100,chr(0));
  22.   $header .= str_pad("777",7,"0",STR_PAD_LEFT) . chr(0);
  23.   $header .= str_pad(decoct($information["user_id"]),7,"0",STR_PAD_LEFT) . chr(0);
  24.   $header .= str_pad(decoct($information["group_id"]),7,"0",STR_PAD_LEFT) . chr(0);
  25.   $header .= str_pad(decoct(strlen($Data)),11,"0",STR_PAD_LEFT) . chr(0);
  26.   $header .= str_pad(decoct(time(0)),11,"0",STR_PAD_LEFT) . chr(0);
  27.   $header .= str_repeat(" ",8);
  28.   $header .= "0";
  29.   $header .= str_repeat(chr(0),100);
  30.   $header .= str_pad("ustar",6,chr(32));
  31.   $header .= chr(32) . chr(0);
  32.   $header .= str_pad($information["user_name"],32,chr(0));
  33.   $header .= str_pad($information["group_name"],32,chr(0));
  34.   $header .= str_repeat(chr(0),8);
  35.   $header .= str_repeat(chr(0),8);
  36.   $header .= str_repeat(chr(0),155);
  37.   $header .= str_repeat(chr(0),12);
  38.  
  39.   // Compute header checksum
  40.   $checksum = str_pad(decoct(__computeUnsignedChecksum($header)),6,"0",STR_PAD_LEFT);
  41.   for($i=0; $i<6; $i++) {
  42.     $header[(148 + $i)] = substr($checksum,$i,1);
  43.   }
  44.   $header[154] = chr(0);
  45.   $header[155] = chr(32);
  46.  
  47.   // Pad file contents to byte count divisible by 512
  48.   $file_contents = str_pad($Data,(ceil(strlen($Data) / 512) * 512),chr(0));
  49.  
  50.   // Add new tar formatted data to tar file contents
  51.   $tar_file = $header . $file_contents;
  52.  
  53.   return $tar_file;
  54. }
  55.  
  56. function targz($Name, $Data)
  57. {
  58.   return gzencode(tarSection($Name,$Data),9);
  59. }
  60.  

To use those functions all you have to do is send a header with the mime type for the tar gz ( application/x-gzip ) using the php header function. To add a tar/gz section for a file, read the file in an array using filegetcontents and pass the filename and data to the targz function. Echo what is returned. That’s it!

So why is it not active on clker.com website? I actually tried it and found that compression consumes a lot of CPU. In the first 20 minute I had more than one hundred connections for different users downloading their results and the CPU was saturated. This basically left no CPU for searching. So use it carefully, and only if you really need that functionality.

Technorati Tags: , , , ,

Running your server is easy, fun but involved

Saturday, March 1st, 2008

I love using Linux to do my work. My best usage of Linux is my web server, although I recall I read once that Linus never intended for the kernel to be used as a server. He was more focused on using the kernel in desktops. I’ve been running my own web server for almost a year now, which runs two websites mibrahim.net my real estate website, and clker.com a to be online clipart website - we’re halfway there.

The fun part is simply everything just works. You’ll have all the tools you need starting from database engines like postgres, mysql to scripting languages like php, ruby with different types of webservers apache, lighthttpd and others. All the tools you might think of are there and under your own hand. Building your own server is not expensive - around $100 will do it. You don’t need a super quad core machine to produce extremely fast websites, unless you are already getting more than 50 page requests per second and at that point you will need something faster.

The performance bottle neck is never the CPU, it’s the hard drives read or write speeds. You can improve on that using fake RAIDs. Almost all Linux distros offer fake RAIDs and that is the cheapest way to improve the read speed.

Setting up your server is not a hard process. The best distributions that I recommend are Debian and Ubuntu. The reason is the very large library of software that comes with each. I believe that now the full distribution has grown more than 11 CDs. I used to run Debian and switched to Ubuntu a year ago and the reason behind the switch is the faster updates I get from Ubuntu, which enables me to use more recent and updated versions of PHP and the database engines.

The easiest setup is using the Ubuntu server CD, which is not any different from the desktop CD in terms of binaries. The only difference is that it won’t install the X11-server (GUI) and the window managers (gnome or kde) and the install program itself runs over the console and not VGA graphics. I use the server installation, and connect to my server using ssh. I have another old machine that runs Ubutu as well, and is used to run freenx. By that way I keep the server’s memory for the services running, and I can add all the GUI programs I want on this old machine.

Since I greatly benefited from running my own web server, I will share my experiences every now and then when I’ve got time to write.

Technorati Tags: , , , , , , ,

corner image corner image
3,081 spam comments
blocked by
Akismet