Jean-Marc

JM's Blog

Web, Society, Technology, and Innovation

Blog Home | August 2005 »

June 04, 2005

Spam: you probably know about email address harvesting, what about email gleaning?

Most people are aware that e-mail addresses posted on the web are being harvested by spammers using software that crawles the web from page to page following links and retrieves everythings that looks like an email address. This happens on any website or newsgroup, including forums and of course blogs.
The one thing spammers need to decide is where on the web to start collecting addresses. On this matter, they proceed as everyone else looking for information: they either use a directory, or use a search engine. Initiating the crawl from a large web directory like the Open Directory and its derivatives gives them the option to target certain categories of victims: personal sites, small businesses, or universities, thereby giving more value to their email dataset. Using search engines allows them to attempt to shortlist sites containing up-to-date contact information (e.g. searching for “contact 2005”). The search results can then be used as pre-processed data for further automated email address extraction.

More surprising is the fact that there are some people out there spending their days manually gleaning email addresses on the web. They are mostly connecting from Internet cafés in places like Ivory-Coast or Nigeria and use tools such as Google, Yahoo or search engine aggregators to look for email addresses using queries like “contact john 2005” or “email me 2005”. Look for “2005” in your webserver log and chances are you will find evidences of this happening on your site.
There are ways to help avoid automated email harvesting without sacrificing too much web usability (i.e. using encoded email links). There are also ways to help prevent manual email address collection: a simple thing to do is to remove the year appearing in the copyright notice of your contact page, and replace it with a simple script:

<script type="text/javascript"><!--
document.write((new Date()).getFullYear())
//--></script>

Copyright ©2007 Syronex