Friday, February 11, 2005

Kids, don't try this at home ... you'll only make it more difficult to do your homework.

As a maintainer of several web sites and as someone who depends on the web as a tool of the trade, I take abusers seriously. Search engines have a tough job indexing the web (after all, software can’t read), just so we can all find relevant content when we visit our favorite search engine and type a phrase of interest. We've all experienced it -- search engines often index pages badly in a vain attempt to make up for the fact that software can't read, injecting a lot of noise into search results, often frustrating Joe-Q-Surfer to no end. The seasoned among us learn to ignore this noise, but it is still there, boolean logic aside. An accomplished webmaster might painstakingly march their home page up to the top of the preverbal charts by carefully crafting the titles, headings, meta-tags and link text, not to mention providing valued content, just so we can get to it when we need to, only to find their web site might as well have a target at it's URL.


While the search engine authors devise ever-more clever methods for connecting phrases to pages and thwarting the posers, there are those out there that are working even harder to undermine the whole thing. You kinda expect this from those cheesy doorway sites that use all the tricks they can think of to attract visitors, unaccomplished typists and unsuspecting web lurkers alike, with the most common of typos and misspelled URLs. But how's this for a shrewd little trick? Find the top ranking page for a search phrase, for example "dog training", then copy a key phrase out of that page, say it's "Labrador Retriever training tips" then generate a URL with all those words in it like this "http://www.somesite.com/dog-training-labrador-retriever-tips.jsp" then host that URL on a web server that coughs up a page containing all the same words in titles, hidden and alt text, and tosses in a lot of links under those words to the snake oil du jour, which, by the way, typically has absolutely nothing to do with dogs.

It's hard to imagine a mechanism that would work better to usurp the place of the top ranking page so easily. Clever yes, I'll give them that. How does this happen? Simply put, it goes something like this: a targeted search engine might rate the content of a page by trying to determine the meaning of it (I know, I know, software can’t read), by applying some simple, and sometimes some very complex rules. For example, a rule might assign greater rank to phrases in titles, assuming titles are more important than those phrases lower down on a page, so the page must be about that topic, right? The text in or under a hyperlink might also be an indicator of that phrase's importance. In practice, of course, the rules used are much more imaginative than this, but you get the idea. These ranking rules can be applied by engines to determine the correlation between phrases and pages and they thereby determine what you see when you type that phrase while searching the web. I often think a return to some much more simple principles here would help this situation immensely, but that is another topic I suppose.

Back to our example above -- the creators of the masquerading page have reverse engineered the rules the engine in question uses to index the web, and have auto-generated a near-perfect rule match effectively knocking the real page off the podium. All this is guaranteed to take the web searcher away from the page they are really looking for once the 'trick' page itself is indexed by the same engine. You can bet the end result is ever more inaccurate searches, making it increasingly difficult to locate what you're looking for in one or two clicks.


These guys are not only breaking every principle of copyright law, they are wasting our time, and are no better than the e-mail spammers that siphon your email address off the internet, then, with unbelievable audacity, hock their wares in a seemingly never ending torrent of bandwidth robbing prattle with YOUR address in the "from" field. I mean really! ... what if your eighty year old dad got that tip, supposedly from you, about those "Pen1s enlarg3ment p1lls" he can order online. The lowest form of pond scum. Flame off.