introduction did you ever notice on big sites like www.microsoft.com, if you reach a page that doesn't exist, they don't just say "sorry, 404.", they give you a list of pages that are similar to the one you requested. this is obviously very nice for your users to have, and it's easy enough to integrate into your site. this article provides source code and explains the algorithm to accomplish this feature. note: the real benefit of the approach outlined here is the semi-intelligent string comparisons. overview overview when communicating via http, a server is required to respond to a request, such as a web browser's request for an html document (web page), with a numeric response code, sometimes followed by an email-like mime message. in the code 404, the first "4" indicates a client error, such as a mistyped url. the following two digits indicate the specific error encountered. http's use of three-digit codes is similar to the use of such codes in earlier protocols such as ftp and nntp. each response code has an associated string of english text that must also be present; response code 404's associated string is "not found". when sending a 404 response, web servers usually include in the response message a short html document that mentions both the numeric code and this string. these messages can be customized on a large number of such servers to display a page that could be of more help than a default. for example, this can be achieved in apache by placing a .htaccess file on the web server or editing httpd.conf. creating humorous 404 pages has become popular, and websites have been created for the sole purpose of linking to numerous amusing 404 error pages. internet explorer, however, will not display custom pages unless they are larger than 512 bytes, opting to instead display a "friendly" error page. a 404 error is often returned when pages have been moved or deleted. in the first case, a better response is to return a 301 moved permanently response, which can be configured in most server configuration files, or through url rewriting; in the second case, a 410 gone should be returned. because these two options require special server configuration, most websites do not make use of them. the popularity of the world wide web has led to the use of "404" as a neologism denoting missing a thing or person. [edit] false 404 errors some websites report a "not found" error by returning a standard web page with a "200 ok" response code; this is called a soft 404. soft 404s are problematic for automated methods of discovering if a link is broken or not. a heuristic for identifying soft 404s was given by bar-yossef, et al.[1] in july 2004, the uk telecom provider bt group implemented the cleanfeed content blocking system, which returns a 404 error to any request for content identified as illegal by the internet watch foundation. governments that censor the internet also often return a fake 404 error when a user tries to access a blocked website.
background
the need for this grew out of a client of mine who was changing a content management system, and every url in the site changed, so all the search engine results came up with 404 pages. this was obviously a big inconvenience, so i put this together to help users find their way through the new site when arriving from a search engine. requirements your web site must be set up so that 404 pages get redirected to a .net aspx page. you must have some way of getting an array of all the page urls in your site that you want to compare 404 requests against. if you have a content management system, there is probably a structure of all the pages stored in xml or a javascript array (for dhtml menus or something), or you could write your own query to get the pages from a database. if not, use a content management system, you could hard-code a string array variable in the 404 page code behind containing the page names, or think up some way of dynamically reading all the .aspx or .html pages from the file system. when the 404 page is accessed, you need to know which page was requested. using web.config, you can set up 404 error codes to go to /404.aspx, where it will tag on the requested page to the querystring. the source code here assumes you have this approach, but you can obviously change it to your own needs; simply change the getrequestedurl() function. why regular expressions are not enough to compare strings, you can use system.string.indexof or you can use regular expressions to match similarities, but all these methods are very unforgiving for slight discrepancies in the string. in the example url above, the page name is december15-isercworkshopontesting.html but under the new content management system, the url is december 15 - iserc workshop - software testing.html, which is different enough to make traditional string comparison techniques fall down. so, i looked around for a fuzzy string comparison routine, and came across an algorithm written by a guy called levenshtein. his algorithm figures out how different two strings are, based on how many character additions, deletions and modifications are necessary to change one string into the other. this is called the 'edit distance', i.e., how far you have to go to make two strings match. this is very useful because it takes into account slight differences in spacing, punctuation and spelling. i found this algorithm here where lasse johansen kindly ported it to c#. the algorithm is explained at that site, and it is well worth a read to see how it is done. code summary private void page_load(object sender, system.eventargs e) { getrequestedurl(); setupsiteurls(); computeresults(); bindlist(); }
application -for a better user interface -flow of using internet without total errors -better usage with more precesion