Improving the 404 Search

Posted November 14th, 2005 @ 08:00am by Erik J. Barzeski

I last wrote about the 404 Search in February 2003 (both here and here. Since that time, I've been using the 404 search code quite heavily on every site with MovableType (or any other blogging package).

It's undergone some improvements, however, and the version currently used on The Sand Trap looks like this. An explanation - and a question - follow.

The first three lines grab the filename from the (incorrect) URL (i.e. http://thesandtrap.com/archives/blah/some_url_here.php -> some_url_here.php). If the filename is one common to Internet viruses that troll for IIS servers, I simply exit quickly. I could put this if clause just before the explode and probably save a little processing time, but I think leaving it where it is makes the code easier to read.

Because MovableType's built-in search doesn't know much about filenames, I strip the file extension and save it for later. Then I create the variable for the basic search URL. The extension is important if the filename is .jpg, .gif, or .png - my staff members and I routinely re-use stock shots of people like Tiger Woods, Annika Sorenstam, etc., so occasionally we look for prior use so that we can properly attribute the image's copyright. If we're looking for an image, I add the extension back to the search term; otherwise, I strip the underscores and replace them with spaces ("%20" is a space in a URL) because The Sand Trap's filenames are "dirified" versions of the article, so "Bobby Jones Wins Again" becomes "bobby_jones_wins_again."

Along the way, I quickly compose an email to myself (it helps me to monitor missing files, as these searches are only triggered via a 404).

Finally, I implode the results of the file() call (a byproduct of which is that the search is actually performed) to create one big string. I match the string against something that appears in links to the matching articles: class="search result". If only one match is found, I forward the user directly to that article. If more than one match is found, I print the search results.

This search function has served me well (I routinely type http://thesandtrap.com/search term here to perform quick searches), and the "automatic forwarding" works in certain cases to shuffle people along. For example:

http://thesandtrap.com/archives/clubs/titleist_introduces_735_cm_combo_irons.php

Unfortunately, I've noticed a trend lately for some search results not to work as one would hope. Take this, for example:

http://thesandtrap.com/archives/clubs/sneak_peek_at_taylormades

I think what is happening is that some people may send an email and, due to line-wrapping in email clients, people are not getting the full URL:

http://thesandtrap.com/archives/clubs/sneak_peek_at_taylormades_2006_new_products.php

What's worse, because the article does not include "taylormades" - the title includes "TaylorMade's" with an apostrophe - the article that should appear does not appear in the search results. What's even worse, since we link to the "sneak peek" article in a review, the review thus contains "taylormades" (as part of the URL in the link), and so the reader is forwarded there.

Ideally, there would be some fast way to look at filenames and, if something is close, to send the user there. But that seems altogether too tedious: files are located in several directories and people don't always include every directory (i.e. searches for /clubs/blah_blah_blah are common, but the file actually resides in /archives/clubs/ - another directory down). This may not be too big a problem, as every article is inside of /archives/xxxxx/file.php where xxxxx is the category.

Does anyone have any suggestions? The "chopped off filename" happens frequently enough that I'd like to resolve it, but I'm not sure how to do so quickly and without searching through every folder inside of archives. I figure the code to forward the user to the proper URL would fit in best where I have commented "TAKE USER TO TRUNCATED FILENAME IF EXISTS," but I'm open to suggestions.

Anyone have any? Feel free to comment on other portions of the code, but please don't be upset if I don't pay attention to those comments: I'm primarily interested in getting the last bit working.

P.S. The search code I use here on nslog.com has a few other pieces in it. You can view that source here. It has the distinct advantage of allowing me to type "http://nslog.com/123" to visit the entry with the entry ID 123. Before I discovered the "dirify" (now "basename" in MovableType 3.2), my articles existed at http://nslog.com/000123.php and the code you see in this modified version forwards the user to the new URL for me.

4 Comments »

4 Responses to "Improving the 404 Search"

| Reply Bud Landry
Posted 15 Nov 2005 at 2:41pm #

From the surfers point of view, a truncated URL will either pull up a complete URL I have typed before, and that my browser can often autocomplete, or send me a 404 message.

The most natural response of a surfer to a 404 message, is to backspace to the last slash in the URL, or back to the domain name if need be, and hope that rung up the ladder in the hierarchy actually resolves to an html file, an actual page, ideally with links to the article one was trying to reach with the broken URL. Which might call for more work from the webmaster.

The most common workaround I see would be for the URL to always refer to a level referring to its date of publication.

for example

http://thesandtrap.com/archives/clubs/sneak_peak_at_taylormades_2006_new_products.php

Which wrapped in this comment box by the way, could be called

http://thesandtrap.com/archives/clubs/2005/11/11/sneak_peak_at_taylormades_2006_new_products.php

Most surfers looking for that article might backtrack to the date, and hope the article is among others posted on that date.
| Reply Erik J. Barzeski
Posted 15 Nov 2005 at 3:20pm #

Bud, that doesn't really solve the problem at all, nor is it something I'll do for several reasons. One of those is that good URLs don't change, so The Sand Trap is not getting a new URL format. Plus, it'd further strain MovableType: if there's no gain, I'll always take the faster (re)build setup. Plus, users can already "back up" a level to find the article.

The system works pretty well now and does a pretty good job of helping the user. What it doesn't do is match filenames. I just helped Judi write something that does this, though, so perhaps I'll go that route later on… Basically, it searches for basenames instead of the entry ID (as the NSLog variation does).
| Reply A View from Home
Posted 16 Nov 2005 at 12:39pm #

No more useless Ã¢â‚¬Å“file not foundÃ¢â‚¬Â

I launched this blog in January 2003 and since then have posted over 1100 entries. It's been a learning experience. I've changed the blog's file structure a few times. If you blog and you use a blogging content management system...
| Reply Erik J. Barzeski
Posted 16 Nov 2005 at 2:14pm #

That way didn't work because it's difficult to determine the primary category. Obviously it can be done, but I don't feel like writing the JOINs and whatnot to get it done at this point.

Comments RSS

About This Blog

This blog contains posts, all written by me, and comments, only some of which were written by me.

About Me

About Me
Disclaimer
The Sand Trap

Contact Me…

… via Email
firstname@lastname.com
(Think about that one a little.)

… via IM
AIM: iacas
MobileMe: iacas
flickr: iacas
Yahoo: iacas
Twitter: iacas
MSN: iacas@hotmail.com
Xbox Live: erikjb
Google Talk: erikjb@gmail.com
ICQ: 8186546

Don't email me at any of the above IM addresses or I may never see the email. I use these accounts primarily for chat (IM) only.

… on Facebook
http://facebook.com/iacas
Current Poll
Press My Buttons

Donate Life: Because if you're not using your body parts, someone else can.

NSLog();

The Weblog of Erik J. Barzeski

Improving the 404 Search

4 Responses to "Improving the 404 Search"

Leave a Reply

About This Blog

About Me

Contact Me…

Current Poll

Press My Buttons