Thiemo Mättig

How to fight R*ferer Spam

Please note that this is a very bad idea because it causes heavy traffic!

1. Let's handle friendly R*ferers

What's happening if a weblog entry r*fers to another weblog entry? Imagine two weblogs. First, we are writing something interesting in "our" weblog. The entry gets a permanent link: http://our.domain/our_weblog/15. Then, a "friendly" weblog writes about our writing. It also gets a permanent link: http://friendly.domain/friendly_weblog/210.

Now, a visitor of the friendly weblog uses the link which points to our weblog. This triggers a HTTP GET request which may look like this.

Hello http://our.domain
I want your site /our_weblog/15
I'm comming from http://friendly.domain/friendly_weblog/210

On our site /our_weblog/15 we are adding http://friendly.domain/friendly_weblog/210 to our list of recent r*ferers. Anything is fine.

Sometimes, if the link is still on the main site of the friendly weblog, the request may look like this.

Hello http://our.domain
I want your site /our_weblog/15
I'm comming from http://friendly.domain/

We don't know exactly where the r*ferer came from. We could add http://friendly.domain/ to our recent r*ferers at /our_weblog/15 but this doesn't make much sense. Anyway, it's a valid r*ferer, so let's add it.

Let's assume the friendly weblog didn't used our permanent link.

Hello http://our.domain
I want your site /
I'm comming from http://friendly.domain/friendly_weblog/210

We know where it comes from (http://friendly.domain/friendly_weblog/210). But we don't know what it want's on our site (it have to be /our_weblog/15, but it is just /). We are not interested in r*ferers to our main site (/). So let's ignore this.

2. Fight R*ferer Spam

Here's what's r*ferer spam usualy looks like.

Hello http://our.domain
I want your site /
I'm comming from http://spam.domain/spam.html

Because we are not interested in r*ferers to our main site (/) we could ignore this.

Let's assume they are a bit more clever.

Hello http://our.domain
I want your site /our_weblog/15
I'm comming from http://spam.domain/spam.html

That's the tricky part. What's the difference to a real r*ferer? In the first view, nothing. The HTTP GET request looks exactly the same as the one from the friendly weblog.

The difference is: http://friendly.domain/friendly_weblog/210 contains a link to our weblog entry (for example <a href="http://our.domain/our_weblog/15">), http://spam.domain/spam.html does not. If it does contain a link and it is spam - who cares? That's ok because the spammer realy links to use. More Pagerank for us. ;-)

Of course, we may use a black list for this. But that can't be automated. The automated part may look like this.

$r*ferer = $_SERVER['HTTP_R*FERER'];
if (! isAlreadyInOurRecentR*ferersList($r*ferer) &&
    ! isInBlackList($r*ferer))
{
    $data = file_get_contents($r*ferer);
    if (strstr($data, $_SERVER['HTTP_HOST'] . $_SERVER['REQUEST_URI']))
    {
        addToOurRecentR*ferersList($r*ferer);
    }
    else
    {
        addToOurBlackList($r*ferer);
    }
}

It's a nice idea to add some error checking to file() becaus it could fail or use fsockopen() instead. But that's not what I want to talk about.

Unfortunately the spammers found another way to spam our r*ferers. In fact, http://spam.domain/spam.html may contain a link to our site, but not <a href="http://our.domain/our_weblog/15">, it's <img src="http://our.domain/our_weblog/15"> instead. No problem. Let's improve our script a bit.

    ...
    $pattern =
        '{<a\b[^>]*\shref\s*=[\s"\']*http://' .
        preg_quote($_SERVER['HTTP_HOST'] . $_SERVER['REQUEST_URI']) .
        '}is';
    if (preg_match($pattern, $data))
    ...

This also detects <A TARGET = "_blank" HREF = "http://our.domain/our_weblog/15"> for example.

Conclusion: If the r*ferer site (http://spam.domain/spam.html) does not contain a real link to our site, it's spam and it's added to our blacklist. If http://spam.domain/spam.html hits our r*ferers again, it's already known as spam. Due to this fact no second network connection is forced. It's simply ignored.

3. What about unintended R*ferer Spam?

What's happening when the browser have a bug and sends incorrect r*ferers?

Hello http://our.domain
I want your site /our_weblog/15
I'm comming from http://unintended.domain/unintended.html

This looks like spam. In fact, it is spam. It's not intended, of course, but it's no real r*ferer. We are trying to open http://unintended.domain/unintended.html to find a link like <a href="http://our.domain/our_weblog/15">, but there is nothing.

Spam. Ignored.

This is also added to our black list. Sad, but true. There is no way to divide unintended from real r*ferer spam. Check your black list from time to time. The only way to atomate this a bit is to use a white list. This could be generated from our recent r*ferers because they are checked and proved.

Now, go and implement this.