Since writing my SpamKit Plugin I have been keeping a keen eye on the comment/trackback spam subject and have guinea pig’d my ideas on my own blog. Recently I noticed a distinct change in the sophistication of comment-spammers.
The early comment-spammers were using very basic HTTP clients, mostly without thinking about what’s going on ‘under the hood’. As such their spam-messages would come through with easily filtered HTTP “User-Agent” headers like “PEAR HTTP_Request class ( http://pear.php.net/ )
” and “libwww-perl/5.803
“. Over a period of a few months these – what I call 1st generation – bots began to dwindle in numbers, replaced by slightly more sophisticated clients which loosely emulated real browsers.
These 2nd generation bots were still very primitive, apart from changing the “User-Agent” and adding a few other headers they were still pretty basic and would repeatedly attempt to post comments over the period of a few seconds on a number of posts. This activity is also easily filtered since not even a superhuman Blog-fiend could comment on your top ten posts in less than 10 seconds.
All the attempts so far have been very basic, beginners in Perl / PHP could probably pull it off easily, and they are just as easily filtered out.
Over the Christmas period I observed some very unusual activity, a ‘spam attack’ coming from dozens of source IP addresses, coordinated within a few minutes. I initially spotted it because the “User-Agent” header was completely empty – stands out a bit. After some investigation and further attacks I became pretty confident this wasn’t a fluke or coincidence of independent spammers.
I knocked up a quick WordPress plug-in to capture as much info about these suspicious requests as possible. Here is one of the first attacks.
03/02/2006 20:37:44 212.0.XXX.XXX GET /
03/02/2006 20:38:14 201.242.XXX.XXX GET /category/wordpress/plugins/
03/02/2006 20:39:54 210.183.XXX.XXX GET /2006/02/02/search-term-highlighter-plugin-0-0/
03/02/2006 20:40:25 200.122.XXX.XXX GET /category/java/jakarta-velocity/
03/02/2006 20:40:37 62.23.XXX.XXX GET /2006/02/02/sitecom-cn-502-usb-bluetooth-dongle-works-on-linux/
03/02/2006 20:40:55 68.96.XXX.XXX GET /2006/02/02/search-term-highlighter-plugin-0-0/
03/02/2006 20:41:18 70.88.XXX.XXX POST /wp-comments-post.php
03/02/2006 20:41:20 70.88.XXX.XXX GET /category/thoughts/
03/02/2006 20:41:44 200.21.XXX.XXX POST /wp-comments-post.php
03/02/2006 20:41:48 200.21.XXX.XXX GET /2006/01/25/ti-7x21-flashmedia-sd-host-controller-104c-8033/
03/02/2006 20:42:16 61.145.XXX.XXX GET /category/wordpress/plugins/search-term-highlighter/
03/02/2006 20:42:24 217.113.XXX.XXX GET /category/flash/
03/02/2006 20:42:48 212.251.XXX.XXX GET /category/internet/
03/02/2006 20:43:04 205.180.XXX.XXX POST /wp-comments-post.php
03/02/2006 20:43:22 82.76.XXX.XXX GET /keywords/
03/02/2006 20:43:56 218.248.XXX.XXX GET /2006/02/02/search-term-highlighter-plugin-0-0/#postcomment
03/02/2006 20:44:13 206.191.XXX.XXX GET /2006/02/02/search-term-highlighter-plugin-0-0/%23postcomment
03/02/2006 20:44:14 206.191.XXX.XXX GET /category/tools/
03/02/2006 20:44:15 206.191.XXX.XXX GET /category/wordpress/plugins/search-term-highlighter/
03/02/2006 20:44:38 62.23.XXX.XXX GET /category/wordpress/plugins/search-term-highlighter/
03/02/2006 20:45:33 82.76.XXX.XXX POST /wp-comments-post.php
03/02/2006 20:45:34 82.76.XXX.XXX GET /category/tools/
03/02/2006 20:45:35 82.76.XXX.XXX POST /wp-comments-post.php
03/02/2006 20:45:48 203.162.XXX.XXX POST /wp-comments-post.php
In this particular instance, the attack was over a ten minute period. The first request was a HTTP GET on the root of my Blog “/” almost definitely used to feed the other bots with URL’s. Next, other clients in the Botnet continue to spider my Blog in parallel, building a list of URL’s to try later and lastly the first of the attempts to post a comment.
If you examine the sequence of requests, the bots are posting a comment, then coming back to check if it was successful. Analysis of later attacks even found other bots in the group checking if the comment posted by a peer bot was successful. The participating hosts are located all over the world but the majority are in North America and Asia.
This obviously demonstrates a very high level of sophistication. Initially I presumed that there was a single client application running requests in parallel over a group of HTTP proxies. After tracing down the locations & owners of each of the participants in the attacks I concluded it was infeasible that they all happened to have open proxies being abused in this way. A large proportion of the machines being used are actually web servers which have probably been exploited and are running IRC-controlled Trojan software.
Backing this up is the pace these attacks are evolving, the first few were very primitive without even a HTTP “User-Agent” header; however this was very quickly amended. The most recent attack I observed (1st March 2006) showed even more improvements, each client was almost indistinguishable from normal visitors. Providing full ‘Internet Explorer’ like headers of accepted mime types, charsets, languages and even including valid HTTP referrer headers and cookies.
Thankfully, all their time seems to be invested in improving the client software; the actual content of the comment was practically identical.
My SpamKit Plugin has so far easily handled each of these situations. It uses Gerry‘s “Time Based Tokens” which were auto-generated and written into a hidden form field. Any incoming comments without a token or with an invalid token could be held for moderation while at the same time having zero impact on real visitors writing comments. Unlike techniques used by other solutions it does not require the user to type in a random key from an image like the ‘captcha’ technique, nor does it rely on JavaScript support in the browser. Until these spam bots reach a level of sophistication where they are parsing out HTML forms including hidden values and posting them, the current version of SpamKit will still be an effective solution.
However there is one major drawback with SpamKit; pingback/trackback’s are machine-generated, they will not have a “Time Based Token” and will be held for moderation as if they were spam. The problem with this is that spammers are also increasingly using the pingback/trackback mechanism to get their comments through the net. A lot of thought and discussion on this subject with Gerry lead to one potential solution; scoring & validation on the URL the pingback/trackback is supposedly from.
In early examples of trackback spam the URL given pointed straight to some advertising-based web page. Something like this lends itself to easy detection and filtering as the content when examined would score highly for spam key words like ‘Viagra’ etc. However these attacks have also evolved, the most recent of which point to real web pages or Blogs that contain obfuscated JavaScript redirection code – redirecting real visitor’s browsers but avoiding any page content detection techniques. In some cases the code has been inserted into Bulletin Boards or Guestbook’s which allow unfiltered HTML.
An example page with obfuscated JavaScript redirection (warning, this will redirect you to mp3search.ru
)
http://zigfrid.blog.kataweb.it/il_mio_weblog/
So, what measures can be taken to stop spam?
Personally I don’t think you will ever get rid of spam, you have a pretty good chance of eradicating all but the most sophisticated of spammers, but you’ll never stop 100% of spam. The best methodology is to constantly evolve your defences at the same rate or faster than the opposition. For starters Gerry & I are constantly dreaming up new ways we can enhance SpamKit… Recent updates include encoding the original source IP address in the “Time Based Token” which would become invalid if submitted from a different address. Other works in progress include hardcore validation of the email address submitted; does the domain exist? does it have a mail exchanger MX record? etc. content validation, key word searching and probabilities of the content being spam – progress will be reported here and on Gerry‘s site.
In the long term spammers are going to have clients that pretty much replicate real users down to the delays & randomness between requests. Countermeasures are going to have to be just as sophisticated, evaluating content and even executing JavaScript as if they were also real clients.
Leave a Reply