Finding Duplicate or Similar Images with Perceptual Hashing in PHP
A while ago on Reddit someone asked how Tineye works. It's pretty fascinating; you upload a photo (or point it at a URL of an image) and it'll find other locations with similar images — if they've been indexed. Even if those images are in different sizes, or have had minor changes made to them, be it due to compression or because someone added or removed some text. So in a way, it's a fuzzy image based search engine.
Although I'm sure tineye has it's own set of algorithms and custom applications to drive all this, something similar, if crude in implementation, can be achieved with available software.
At the center of all this is an image hashing algorithm. Usually (cryptographic) hashes are designed to detect even the slightest modifications and return a completely different hash. We're looking for the opposite, and libphash delivers:
The phash library implements a "perceptual hashing" algorithm. From their site:
A perceptual hash is a fingerprint of a multimedia file derived from various features from its content. Unlike cryptographic hash functions which rely on the avalanche effect of small changes in input leading to drastic changes in the output, perceptual hashes are "close" to one another if the features are similar.
So the idea is that you feed it an image, and it'll return a hash, and the similar two hashes are to one another, the more identical the images are. The comparison of similarity is done by calculating the hamming distance, which in a way is the bit-level version of the levenstein() function that PHP developers may already be familiar with.
Even better, the phash library comes with PHP bindings, and provides a few functions to get you started. For instance, there's the ph_image_hash() function. Simply give it a filename of an image, and it'll return, well, a resource. Now this really puts a damper on the usefulness of that function, since resources are fairly opaque and hard if not impossible to work with.
Fear not, I've made a few changes so that ph_image_hash() returns a plain string with the hexadecimal representation of the hash, which can then be stored in a database, for instance. You can grab phash from my github repository.
Alright, you've got a way to get to those hashes, now how does one index them and look them up in a speedy way? Well, this is where it gets interesting, and unfortunately a little bit theoretical.
Ideally, you'll store these hashes (and other meta data) in a database, but not a SQL database. You really want a vantage point tree, or better, multiple vantage point tree. Essentially these are binary trees that build up based on the distance of hashes. The idea is that hashes that are similar are close together. So you traverse the tree in order to get close to your match, and then just "look around" in that area and you'll likely find similar results.
The MVP tree area gets pretty academic, and from what I've looked at so far, most of it is theoretical, presented as limited in applicability, but at the same time seems to be exactly what such a fuzzy image search engine would need to be based on. I'm fairly certain companies working on various aspects of image recognition and augmented reality and what not are all messing with this sort of thing, so there's likely very little incentive for them to publicize or advertise their algorithms.
The phash library does include a "MVPTree" library with basic examples of how data is stored. Having something like this built out into a scalable data store with an HTTP interface a la SOLR would be fantastic.
I'd immediately work on a PHP application to index my photos, detect duplicates, etc.
Print This Post
Displaying Stacktraces in PHP
During development stuff breaks. And when that happens, it's not always clear what exactly when wrong. Luckily stack traces help narrow things down, by showing the execution path that lead up to the unfortunate event.
Still, unless you're intimately familiar with the code base, you need to sift through the files to understand what exactly was called on that particular line. To help out with doing this quickly during development of a Zend Framework based application, I wrote a view helper that would format and show a section of code for each line. The resulting stack trace is then easy to follow.
Of course, this is something that should only be used during development as exposing source code anywhere near a production environment isn't recommended.
I've found that it helps me quickly see what actually went wrong. And if necessary, it could be augmented to also display any arguments that were passed along for an even better overview.
<?php
/**
*
* Pretty print a stack trace
*
* Didn't put too much effort into this. After all, running this on a production
* site isn't really recommended.
*
* @author Marcus Welz
*
*/
class Helper_StackTrace extends Zend_View_Helper_Abstract
{
/**
* Retrieve the relevant portion of the PHP source file with syntax highlighting
*
* @param string $fileName The full path and filename to the source file
* @param int $lineNumber The line number which to highlight
* @param int $showLines The number of surrounding lines to include as well
*/
protected function _highlightSource($fileName, $lineNumber, $showLines)
{
$lines = file_get_contents($fileName);
$lines = highlight_string($lines, true);
$lines = explode("<br />", $lines);
$offset = max(0, $lineNumber - ceil($showLines / 2));
$lines = array_slice($lines, $offset, $showLines);
$html = '';
foreach ($lines as $line) {
$offset++;
$line = '<em class="lineno">' . sprintf('%4d', $offset) . ' </em>' . $line . '<br/>';
if ($offset == $lineNumber) {
$html .= '<div style="background: #ffc">' . $line . '</div>';
} else {
$html .= $line;
}
}
return $html;
}
/**
*
* Print the stack Trace
*
* @param Exception $exception Any kind of exception
* @param int $showLines Number of surrounding lines to display (optional; defaults to 10)
*/
public function stackTrace($exception, $showLines = 10)
{
$html = '<style type="text/css">'
. '.stacktrace p { margin: 0; padding: 0; }'
. '.source { border: 1px solid #000; overflow: auto; background: #fff;'
. ' font-family: monospace; font-size: 12px; margin: 0 0 25px 0 }'
. '.lineno { color: #333; }'
. '</style>'
. '<div class="stacktrace">'
. '<p>File: ' . $exception->getFile() . ' Line: ' . $exception->getLine() . '</p>'
. '<div class="source">'
. $this->_highlightSource($exception->getFile(), $exception->getLine(), $showLines)
. '</div>';
foreach ($exception->getTrace() as $trace) {
$html .= '<p>File: ' . $trace['file'] . ' Line: ' . $trace['line'] . '</p>'
. '<div class="source">'
. $this->_highlightSource($trace['file'], $trace['line'], 5)
. '</div>';
}
$html .= '</div>';
return $html;
}
}
Print This Post
Short URLs with Zend Framework
First up, what's a short URL? A short URL is just that; a url that is as short as it can possibly be, so that takes up as few characters as possible when it is used in a twitter message, which itself is limited to 140 characters and probably the main reason short URLs are so popular. Each character counts.
Technically, short URLs consist of a short domain name and a simple identifier, usually the numeric primary key in a database table of whatever item the page is supposed to be for. And to make that number even shorter it's typically base 62 encoded.
The digits are represented using the numbers 0-9, lowercase a-z and uppercase A-Z. And although PHP offers a base_convert() function, it's unfortunately useless as it only supports up to be base 36 and loses precision on large numbers (it uses floating point math internally). So a replacement is needed.
There are all kinds of base62 encoding and decoding functions out there already. One is bc_base_convert, which uses (requires) the bcmath extension. Another one that's a bit more fleshed out and cleaner looking that I found on pastie while browsing reddit. I've reproduced it here for easy reference:
/**
* @class Integer
* @author Julien Garand (Go On Web)
*
* Can encode and decode integers to/from a string, using a custom alphabet
*/
class Integer
{
// Default alphabet for a "normal" base 62 encoding
static protected $alphabet = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';
static protected $base = 62;
/**
* Define your custom alphabet here
*/
static public function setAlphabet( $alphabet )
{
// only strings are allowed
if ( !is_string($alphabet) )
{
throw new Exception('Given alphabet is not a string !');
}
self::$base = strlen( $alphabet ); // Our base will be the length of the given alphabet
// We check if alphabet doesn't have doubled characters
if ( strlen( count_chars( $alphabet, 3 ) ) != self::$base )
{
throw new Exception('The following alphabet has doubled characters : '.$alphabet);
}
self::$alphabet = $alphabet; // store it
}
/**
* Basic accessors
*/
static public function getAlphabet() { return self::$alphabet; }
static public function getBase() { return self::$base; }
/**
* Encode an integer according to the defined alphabet
*
* @param integer : Unsigned integer to be encoded
* @return string (or false if failed)
*/
static public function encode( $integer )
{
$integer = (int)$integer; // Be sure to have an integer
// We only accept unsigned integers
if ( $integer < 0 )
{
return false; // or throw new Exception( "($integer) is less than 0 and cannot be converted" );
}
$string = ''; // our encoded integer
// while we have to encode
while( $integer )
{
$pos = $integer % self::$base; // get the rest of euclidian division...
$string .= self::$alphabet[ $pos ]; // thats the position of the char in alphabet
$integer = ( $integer - $pos ) / self::$base; // and divide integer (minus just encoded char) by the base
}
return strrev( $string ); // As we started by the unit of our base ( $base ^ 0 ), we have to reverse the string
}
/**
* Decode a string to an integer according to the defined alphabet
*
* @param string : String to be decoded
* @return integer (or false if failed)
*/
static public function decode( $string )
{
$string = (string)$string; // be sure to have a string;
// check if our string only have chars that are in the alphabet
if ( strcspn( $string, self::$alphabet ) )
{
return false; // or throw new Exception( "($string) is not a string or contains characters that are not in alphabet" );
}
$integer = 0; // our integer to find
$unit = 1; // we start by $base^0
// foreach chars, starting at the end
for( $i = strlen( $string ) -1; $i >= 0; $i -- )
{
$pos = strpos( self::$alphabet, $string[$i] ); // we find it's position in alphabet
$integer += $pos * $unit; // its our number to add, multiplied by the current unit
$unit = $unit * self::$base; // and go to next unit in our base
}
return $integer;
}
}
So now we can convert our simple numbers into the more cryptic looking short url identifiers simply using Integer::encode(). Here's some example conversions:
1 => 1 10 => a 100 => 1C 255 => 47 1000 => g8 10000 => 2Bi 65535 => h31 100000 => q0U 1000000 => 4c92
Instead of http://example.com/1000000, you could end up with http://example.com/4c92. There, three characters saved. That makes a difference, particularly with really short domain names, such as Twitter's own URL shortener: http://t.co.
Working with Routes
So, in Zend Framework the actual page logic starts in controllers and actions, which are essentially classes and methods, respectively. In order to to reach a controller and action, request URLs are routed using the router.
The default route is sufficient for most applications. It conveniently maps the first two path segments to controller and action. So http://example.com/photo/view maps to the PhotoController::viewAction(). Also, the action is optional, and if omitted will default to index. Therefore, http://example.com/photo will map to PhotoController::indexAction(). There are other scenarios that are helpful to be familiar with.
Now the easiest way to support short URLs is to add a route that will match any alphanumeric characters and route that to the desired destination. That could look something like this:
$router = Zend_Controller_Front::getInstance()->getRouter();
$router->addRoute('photo', new Zend_Controller_Router_Route(':shortid', array(
'controller' => 'photo',
'action' => 'view'
), array(
'shortid' => '[0-9a-zA-Z]+'
)));
The side effect, however, is that there's no longer a distinction between a short URL such as "http://example.com/1hF" and "http://example.com/photo". Since "photo" could be a base62 encoded number (in fact, it would be the number 373,554,054). This can be worked around if you make all other URLs specify both the controller and action explicitly, so you'd use "http://example.com/photo/index" to ensure that the short URL route doesn't match.
Then, in your controller's action, you'd handle the request using the short URL:
class PhotoController extends Zend_Controller_Action
{
public function viewAction()
{
if ($shortId = $this->_getParam('shortid')) {
$id = Integer::decode($shortId);
}
// rest of the logic to view the photo here, using $id.
}
}
This technique may not always apply, however, since you might already have a larger application that has all kinds of links that you can't just change to make this short URL thing work.
Another technique is to modify short URL a bit so they're more easily recognizable as such. I did that for one application by sacrificing one extra character. I just prefixed all the short IDs with an upper case "S". So you'd have a URL such as http://example.com/S4c92. This works since normally the URLs are all lower-case anyway:
$routes['twitter-pics'] = new Zend_Controller_Router_Route_Regex(
'(?-i)S([\w\d]+)',
array('controller' => 'photos',
'action' => 'view'),
array('shortid' => 1),
'/%s'
);
Note that this is a regular expression based route. The (?-i) turns off case insensitivity. I still wasn't happy with this approach, because the action still needs to explicitly handle that 'shortid' variable.
Using a Custom Route
I wanted everything encapsulated in the route, so I wrote a custom route class.
The interface that Zend provides is rather straight forward:
interface Zend_Controller_Router_Route_Interface {
public function match($path);
public function assemble($data = array(), $reset = false, $encode = false);
public static function getInstance(Zend_Config $config);
}
matchchecks whether the route matches the path of the requestassembleis used to build a URL based on the parametersgetInstanceis supposed to accept a configuration and return a new instance of the route. I don't even care about that at the moment.
Here's the finished class:
/**
* Short Route
*
* Provides short URLs
*
* @author Marcus Welz
*
*/
class Td_Controller_Router_ShortRoute implements Zend_Controller_Router_Route_Interface
{
/**
* @var string The URL prefix
*/
protected $_urlPrefix = 'S';
/**
* @var array The parameter as passed to the request
*/
protected $_params = array();
/**
*
* @param string $urlPrefix The prefix of the URL
* @param array $params The parameters as passed to the request
*/
public function __construct($urlPrefix, $params = array())
{
$this->_urlPrefix = $urlPrefix;
$this->_params = $params;
}
/**
* @param string $path The URL such as "/P3"
* @return array|false returns parameters including the id on success, false if no match
*/
public function match($path)
{
$prefix = preg_quote($this->_urlPrefix);
if (preg_match('/\/' . $prefix . '([A-z0-9]+)$/', $path, $matches)) {
$params = $this->_params;
$params['id'] = Integer::decode($matches[1]);
return $params;
}
return false;
}
/**
* Assemble a URL using the ID
*
*
* @param array $data 'id' is the only used parameter in the array
* @param bool $reset unused / ignored
* @param bool $encode unused / ignored
*/
public function assemble($data = array(), $reset = false, $encode = false)
{
return $this->_urlPrefix . Integer::encode($data['id']);
}
public static function getInstance(Zend_Config $config)
{
throw new Exception('not implemented');
}
}
Using it is straight forward. First, add it to the router:
Zend_Controller_Front::getInstance->getRouter()
->addRoute('photo', new Td_Controller_Router_ShortRoute('S', array(
'controller' => 'photo',
'action' => 'view'
)));
Since the conversion between base 62 and base 10 is happening inside the class, the action doesn't have to decode it itself and is thus blissfully unaware of it. Encapsulation successful. And to generate a URL in a view, you'd use the url() view helper:
$this->url(array('id'=> $photo['id']), 'photo')
Good enough for me.
Print This Post
Caching files statically with Zend Framework
I've been using ZF (almost exclusively) since version 0.10 or so in 2006. It's come a long way since then, and the folks involved with it are very skilled and methodical. It's quite fun to see new versions roll out and see the various proposed components on the wiki come to life over time.
I do find the documentation to be a bit lacking at times, however. For instance, I was messing around with the Zend_Cache_Manager last night, and discovered templates for the "page" and "pagetag" caches, which led me to the Cache action helper. It seems that this component is completely undocumented. I found the proposal on the wiki, though, and after I read through the code I played around with it for a new project. And I have to say, it's rather neat.
So, it provides a caching mechanism using Zend_Cache_Backend_Static, which is a cache that will write out static files that can be served by the web server directly, without invoking PHP at all. And the cache action helper lets you invalidate the generated pages easily as well. Let's say you have a forum, and you're caching each thread statically, then when someone adds a reply, you'd bust the cache.
First, you'll want to tell the cache manager about where to store the cached pages. It defaults to "public/" but that's where I put hand-coded pages, and I don't want to just throw automatically generated pages in there as well. So I added the following to my application.ini:
resources.cacheManager.page.backend.options.public_dir = APPLICATION_PATH "/../public/_cached"
And then I created that directory. And made it world-writable.
Next, the logic added to the ThreadController, where I want to control what's getting cached:
<?php
/**
* Forum thread controller
*
* Handles viewing threads
*/
class ThreadController extends Zend_Controller_Action
{
public function init()
{
// the view action is cachable.
$this->_helper->cache(array('view'));
}
/**
* View a forum thread
*
* URL: forums.example.com/thread/<id>
*/
public function viewAction()
{
// get thread id from URL
$threadId = $this->_getParam('id', 0);
// Pull forum posts for this thread from the service tier
$service = new App_Service_Forums();
$posts = $service->findPostsByThreadId($threadId);
// Feed the view
$this->view->posts = $polls;
}
/**
* Reply to a forum thread
*
* URL: forums.example.com/thread/reply/<id>
*/
public function replyAction()
{
// get thread id from URL
$threadId = $this->_getParam('id', 0);
// we want the request
$request = $this->getRequest();
// The form that users will compose the forum reply in
$form = new App_Form_Forum_Reply();
if ($request->isPost() and $form->isValid($request->getPost())) {
// Form was submitted, so process it
// Post a reply to the thread via the service tier
$service = new App_Service_Forums();
$reply = $service->replyToThread($threadId, $form->getValues());
// also clear the cache for the URL
$this->_helper->cache->removePage('/thread/' . $threadId, true);
$this->_redirect('/thread/' . $threadId);
}
$this->view->form = $form;
}
}
Then, there are also a few pitfalls. For one, you have to turn off the front controller's output buffering, otherwise you end up with empty cache files. If you're using an .ini file to drive application configuration, you'll want to add
resources.frontController.params.disableOutputBuffering = true
And second, you need to tweak your web server to try to serve those cached files first. So my .htaccess file looks like this
RewriteEngine On
# Serve cached pages if they exist
RewriteRule ^/(.*)/$ /$1 [QSA]
RewriteRule ^$ _cached/index.html [QSA]
RewriteRule ^([^.]+)/$ _cached/$1.html [QSA]
RewriteRule ^([^.]+)$ _cached/$1.html [QSA]
# Hit files, symlinks and directories directly, if they exist.
RewriteCond %{REQUEST_FILENAME} -s [OR]
RewriteCond %{REQUEST_FILENAME} -l [OR]
RewriteCond %{REQUEST_FILENAME} -d
RewriteRule ^.*$ - [NC,L]
# Everything else hits the application.php
RewriteRule ^.*$ /application.php [NC,L]
And I think I just discovered a bug in the Zend_Cache_Backend_Static::removeRecursively() method, which doesn't remove the directory properly.
Still trying to find a way to scale this beyond a single server, since there's no distributed cache backend that supports tagging. Would have to use file based caching with an NFS share or something, and that doesn't seem all that optimal.
Print This Post
Bootstrapping Zend Framework via ErrorDocument 404
Traditionally Zend Framework applications are bootstrapped using mod_rewrite as recommended in the manual and various tutorials. For non-Apache servers such nginx, similar methods are provided. But it's also possible to use Apache's ErrorDocument configuration to kick off a Zend Framework based application. This comes in handy if the web application requires many Apache aliases and other legacy configurations that make it tedious or performance inefficient to use mod_rewrite. In such a case Apache is configured to just "serve" per its configuration and anything that results in a 404 because a file wasn't found will be redirected to the Zend Framework bootstrap file.
In order to get that working all you need in your httpd.conf or .htaccess is:
ErrorDocument 404 /index.php
Now, when you try to hit a URL such as 'http://example.com/hello/world', the file is not found, and Apache's error handler will kick in, and call index.php, which should be the ZF bootstrap file. And as long as you have a HelloController with worldAction() method, it should be invoked as you'd expect. There are, however, a number of pitfalls that need to be taken into consideration otherwise you'll soon run into strange issues.
The Modified Apache Context
Right off the bat, the first problem is that the query string is no longer accessible the regular way. $_GET is empty. So if you call http://example.com/hello/world?foo=bar and you expect to be able to grab that with $this->_getParam('foo') you'll be disappointed, because it's not there. $_GET is blank, and therefore, the Zend_Controller_Request_Abstract class logic to populate the parameters and query string interaction is broken. This is corrected by plugging Zend_Controller_Request_Apache404 into the Front Controller which is supposed to provide the compatibility with Apache 404s.
Somewhere in your bootstrap mechanism you'll want to have:
Zend_Controller_Front::getInstance()->setRequest(new Zend_Controller_Request_Apache404());
This will populate $_GET with data from the REQUEST_QUERY_STRING server environment variable, which Apache sets up. Well, it should anyway. Unfortunately as of version 1.10.4 the Zend_Controller_Request_Apache404 class has a bug in it, and it tries to access REQUEST_QUERYSTRING instead. This has been reported as a bug. As a workaround, you can either modify the class, copy and paste it into a new class, or copy $_SERVER['REQUEST_QUERY_STRING'] into $_SERVER['REQUEST_QUERYSTRING'].
GET requests aren't the only issue. POST variables are also not accessible (with Apache 2.0+). And unfortunately Apache provides no alternate environment variable that contains the POST context. Makes sense, since POSTs may be multi-part bodies and can be a bit more complex. In this case, mod_actions saves the day by providing a way of calling a particularly script depending on the type of HTTP action. In this case, we'll route all POST actions to index.php.
Script POST /index.php
The pitfall here is that you can no longer submit a POST to anything other than the Zend Framework bootstrap file. So if you were planning on running legacy PHP side by side with your ZF application, you'll have to resort to additional Apache configuration in order to make that fly.
Not as immediately apparent is the issue of having all your pages served with status code 404. Remember, the app is bootstrapped using Apache's 404 error handler. The Zend Framework currently doesn't set any headers by default as it assumes the default response code to be 200. So because PHP doesn't set anything, Apache will use its own 404 status code with your regular application body as content, so it'll probably still show up in most browsers. However, automated tools and search engines will freak out and probably not index your site at all. This has also been reported as a bug.
Considering the various issues that I've encountered it's almost safe to assume that nobody else has really tried bootstrapping a Zend Framework web application using the Apache404 handler. Some of these problems would have probably surfaced and been remedied by now.
Print This Post
