- marcus welz - http://marcuswelz.com -

Finding Duplicate or Similar Images with Perceptual Hashing in PHP

Posted By marcus On November 22, 2010 @ 6:59 pm In PHP | 2 Comments

A while ago on Reddit someone asked how Tineye [1] works. It's pretty fascinating; you upload a photo (or point it at a URL of an image) and it'll find other locations with similar images — if they've been indexed. Even if those images are in different sizes, or have had minor changes made to them, be it due to compression or because someone added or removed some text. So in a way, it's a fuzzy image based search engine.

Although I'm sure tineye has it's own set of algorithms and custom applications to drive all this, something similar, if crude in implementation, can be achieved with available software.

At the center of all this is an image hashing algorithm. Usually (cryptographic) hashes are designed to detect even the slightest modifications and return a completely different hash. We're looking for the opposite, and libphash [2] delivers:

The phash library implements a "perceptual hashing" algorithm. From their site:

A perceptual hash is a fingerprint of a multimedia file derived from various features from its content. Unlike cryptographic hash functions which rely on the avalanche effect of small changes in input leading to drastic changes in the output, perceptual hashes are "close" to one another if the features are similar.

So the idea is that you feed it an image, and it'll return a hash, and the similar two hashes are to one another, the more identical the images are. The comparison of similarity is done by calculating the hamming distance [3], which in a way is the bit-level version of the levenstein() [4] function that PHP developers may already be familiar with.

Even better, the phash library comes with PHP bindings, and provides a few functions to get you started. For instance, there's the ph_image_hash() function. Simply give it a filename of an image, and it'll return, well, a resource. Now this really puts a damper on the usefulness of that function, since resources are fairly opaque and hard if not impossible to work with.

Fear not, I've made a few changes [5] so that ph_image_hash() returns a plain string with the hexadecimal representation of the hash, which can then be stored in a database, for instance. You can grab phash from my github repository [6].

Alright, you've got a way to get to those hashes, now how does one index them and look them up in a speedy way? Well, this is where it gets interesting, and unfortunately a little bit theoretical.

Ideally, you'll store these hashes (and other meta data) in a database, but not a SQL database. You really want a vantage point tree [7], or better, multiple vantage point tree. Essentially these are binary trees that build up based on the distance of hashes. The idea is that hashes that are similar are close together. So you traverse the tree in order to get close to your match, and then just "look around" in that area and you'll likely find similar results.

The MVP tree area gets pretty academic, and from what I've looked at so far, most of it is theoretical, presented as limited in applicability, but at the same time seems to be exactly what such a fuzzy image search engine would need to be based on. I'm fairly certain companies working on various aspects of image recognition and augmented reality and what not are all messing with this sort of thing, so there's likely very little incentive for them to publicize or advertise their algorithms.

The phash library does include a "MVPTree" library with basic examples of how data is stored. Having something like this built out into a scalable data store with an HTTP interface a la SOLR would be fantastic.

I'd immediately work on a PHP application to index my photos, detect duplicates, etc.


Article printed from marcus welz: http://marcuswelz.com

URL to article: http://marcuswelz.com/2010/11/22/finding-duplicate-or-similar-images-with-perceptual-hashing-in-php/

URLs in this post:

[1] Tineye: http://www.tineye.com/

[2] libphash: http://phash.org/

[3] hamming distance: http://en.wikipedia.org/wiki/Hamming_distance

[4] levenstein(): http://php.net/levenshtein

[5] made a few changes: https://github.com/lucidix/phash/commit/5be2d454c932152e9b2395e21f97a008c6bd8766

[6] grab phash from my github repository: https://github.com/lucidix/phash

[7] vantage point tree: http://en.wikipedia.org/wiki/VP-tree

[8]  : http://www.shareaholic.com/api/share/?title=Finding+Duplicate+or+Similar+Images+with+Perceptual+Hashing+in+PHP&link=http://marcuswelz.com/2010/11/22/finding-duplicate-or-similar-images-with-perceptual-hashing-in-php/&notes=A%20while%20ago%20on%20Reddit%20someone%20asked%20how%20Tineye%20works.%20It%27s%20pretty%20fascinating%3B%20you%20upload%20a%20photo%20%28or%20point%20it%20at%20a%20URL%20of%20an%20image%29%20and%20it%27ll%20find%20other%20locations%20with%20similar%20images%20--%20if%20they%27ve%20been%20indexed.%20Even%20if%20those%20images%20are%20in%20different%20sizes%2C%20or%20have%20had%20minor%20changes%20made%20to%20them%2C%20be%20&short_link=&shortener=none&shortener_key=&v=1&apitype=1&apikey=8afa39428933be41f8afdb8ea21a495c&source=Shareaholic&template=&service=5&tags=&ctype=

[9]  : http://www.shareaholic.com/api/share/?title=Finding+Duplicate+or+Similar+Images+with+Perceptual+Hashing+in+PHP&link=http://marcuswelz.com/2010/11/22/finding-duplicate-or-similar-images-with-perceptual-hashing-in-php/&notes=A%20while%20ago%20on%20Reddit%20someone%20asked%20how%20Tineye%20works.%20It%27s%20pretty%20fascinating%3B%20you%20upload%20a%20photo%20%28or%20point%20it%20at%20a%20URL%20of%20an%20image%29%20and%20it%27ll%20find%20other%20locations%20with%20similar%20images%20--%20if%20they%27ve%20been%20indexed.%20Even%20if%20those%20images%20are%20in%20different%20sizes%2C%20or%20have%20had%20minor%20changes%20made%20to%20them%2C%20be%20&short_link=&shortener=none&shortener_key=&v=1&apitype=1&apikey=8afa39428933be41f8afdb8ea21a495c&source=Shareaholic&template=%24%7Btitle%7D+-+%24%7Bshort_link%7D&service=7&tags=&ctype=

[10]  : http://www.shareaholic.com/api/share/?title=Finding+Duplicate+or+Similar+Images+with+Perceptual+Hashing+in+PHP&link=http://marcuswelz.com/2010/11/22/finding-duplicate-or-similar-images-with-perceptual-hashing-in-php/&notes=A%20while%20ago%20on%20Reddit%20someone%20asked%20how%20Tineye%20works.%20It%27s%20pretty%20fascinating%3B%20you%20upload%20a%20photo%20%28or%20point%20it%20at%20a%20URL%20of%20an%20image%29%20and%20it%27ll%20find%20other%20locations%20with%20similar%20images%20--%20if%20they%27ve%20been%20indexed.%20Even%20if%20those%20images%20are%20in%20different%20sizes%2C%20or%20have%20had%20minor%20changes%20made%20to%20them%2C%20be%20&short_link=&shortener=none&shortener_key=&v=1&apitype=1&apikey=8afa39428933be41f8afdb8ea21a495c&source=Shareaholic&template=&service=40&tags=&ctype=

[11]  : http://www.shareaholic.com/api/share/?title=Finding+Duplicate+or+Similar+Images+with+Perceptual+Hashing+in+PHP&link=http%3A%2F%2Fmarcuswelz.com%2F2010%2F11%2F22%2Ffinding-duplicate-or-similar-images-with-perceptual-hashing-in-php%2F&notes=A%20while%20ago%20on%20Reddit%20someone%20asked%20how%20Tineye%20works.%20It%27s%20pretty%20fascinating%3B%20you%20upload%20a%20photo%20%28or%20point%20it%20at%20a%20URL%20of%20an%20image%29%20and%20it%27ll%20find%20other%20locations%20with%20similar%20images%20--%20if%20they%27ve%20been%20indexed.%20Even%20if%20those%20images%20are%20in%20different%20sizes%2C%20or%20have%20had%20minor%20changes%20made%20to%20them%2C%20be%20&short_link=&shortener=none&shortener_key=&v=1&apitype=1&apikey=8afa39428933be41f8afdb8ea21a495c&source=Shareaholic&template=&service=78&tags=&ctype=

[12]  : http://www.shareaholic.com/api/share/?title=Finding+Duplicate+or+Similar+Images+with+Perceptual+Hashing+in+PHP&link=http://marcuswelz.com/2010/11/22/finding-duplicate-or-similar-images-with-perceptual-hashing-in-php/&notes=A%20while%20ago%20on%20Reddit%20someone%20asked%20how%20Tineye%20works.%20It%27s%20pretty%20fascinating%3B%20you%20upload%20a%20photo%20%28or%20point%20it%20at%20a%20URL%20of%20an%20image%29%20and%20it%27ll%20find%20other%20locations%20with%20similar%20images%20--%20if%20they%27ve%20been%20indexed.%20Even%20if%20those%20images%20are%20in%20different%20sizes%2C%20or%20have%20had%20minor%20changes%20made%20to%20them%2C%20be%20&short_link=&shortener=none&shortener_key=&v=1&apitype=1&apikey=8afa39428933be41f8afdb8ea21a495c&source=Shareaholic&template=&service=88&tags=&ctype=

[13]  : http://marcuswelz.com/2010/11/22/finding-duplicate-or-similar-images-with-perceptual-hashing-in-php/feed

Copyright © 2009 MetaverseDeveloper. All rights reserved.