How to fight R*ferer Spam
Please note that this is a very bad idea because it causes heavy traffic!
1. Let's handle friendly R*ferers
What's happening if a weblog entry r*fers to another weblog entry?
Imagine two weblogs.
First, we are writing something interesting in "our" weblog.
The entry gets a permanent link: http://our.domain/our_weblog/15.
Then, a "friendly" weblog writes about our writing.
It also gets a permanent link: http://friendly.domain/friendly_weblog/210.
Now, a visitor of the friendly weblog uses the link which points to our weblog. This triggers a HTTP GET request which may look like this.
Hello http://our.domain I want your site /our_weblog/15 I'm comming from http://friendly.domain/friendly_weblog/210
On our site /our_weblog/15 we are adding
http://friendly.domain/friendly_weblog/210
to our list of recent r*ferers. Anything is fine.
Sometimes, if the link is still on the main site of the friendly weblog, the request may look like this.
Hello http://our.domain I want your site /our_weblog/15 I'm comming from http://friendly.domain/
We don't know exactly where the r*ferer came from.
We could add http://friendly.domain/
to our recent r*ferers at /our_weblog/15
but this doesn't make much sense. Anyway, it's a valid r*ferer,
so let's add it.
Let's assume the friendly weblog didn't used our permanent link.
Hello http://our.domain I want your site / I'm comming from http://friendly.domain/friendly_weblog/210
We know where it comes from (http://friendly.domain/friendly_weblog/210).
But we don't know what it want's on our site
(it have to be /our_weblog/15, but it is just /).
We are not interested in r*ferers to our main site (/).
So let's ignore this.
2. Fight R*ferer Spam
Here's what's r*ferer spam usualy looks like.
Hello http://our.domain I want your site / I'm comming from http://spam.domain/spam.html
Because we are not interested in r*ferers to our main site (/)
we could ignore this.
Let's assume they are a bit more clever.
Hello http://our.domain I want your site /our_weblog/15 I'm comming from http://spam.domain/spam.html
That's the tricky part. What's the difference to a real r*ferer? In the first view, nothing. The HTTP GET request looks exactly the same as the one from the friendly weblog.
The difference is:
http://friendly.domain/friendly_weblog/210
contains a link to our weblog entry
(for example <a href="http://our.domain/our_weblog/15">),
http://spam.domain/spam.html does not.
If it does contain a link and it is spam - who cares?
That's ok because the spammer realy links to use. More Pagerank for us. ;-)
Of course, we may use a black list for this. But that can't be automated. The automated part may look like this.
$r*ferer = $_SERVER['HTTP_R*FERER'];
if (! isAlreadyInOurRecentR*ferersList($r*ferer) &&
! isInBlackList($r*ferer))
{
$data = file_get_contents($r*ferer);
if (strstr($data, $_SERVER['HTTP_HOST'] . $_SERVER['REQUEST_URI']))
{
addToOurRecentR*ferersList($r*ferer);
}
else
{
addToOurBlackList($r*ferer);
}
}
It's a nice idea to add some error checking to file()
becaus it could fail or use fsockopen() instead.
But that's not what I want to talk about.
Unfortunately the spammers found another way to spam our r*ferers.
In fact, http://spam.domain/spam.html
may contain a link to our site, but not
<a href="http://our.domain/our_weblog/15">,
it's
<img src="http://our.domain/our_weblog/15">
instead.
No problem. Let's improve our script a bit.
...
$pattern =
'{<a\b[^>]*\shref\s*=[\s"\']*http://' .
preg_quote($_SERVER['HTTP_HOST'] . $_SERVER['REQUEST_URI']) .
'}is';
if (preg_match($pattern, $data))
...
This also detects <A TARGET = "_blank" HREF = "http://our.domain/our_weblog/15">
for example.
Conclusion:
If the r*ferer site (http://spam.domain/spam.html)
does not contain a real link to our
site, it's spam and it's added to our blacklist.
If http://spam.domain/spam.html
hits our r*ferers again, it's already known as spam.
Due to this fact no second network connection is forced.
It's simply ignored.
3. What about unintended R*ferer Spam?
What's happening when the browser have a bug and sends incorrect r*ferers?
Hello http://our.domain I want your site /our_weblog/15 I'm comming from http://unintended.domain/unintended.html
This looks like spam.
In fact, it is spam.
It's not intended, of course, but it's no real r*ferer.
We are trying to open
http://unintended.domain/unintended.html
to find a link like
<a href="http://our.domain/our_weblog/15">,
but there is nothing.
Spam. Ignored.
This is also added to our black list. Sad, but true. There is no way to divide unintended from real r*ferer spam. Check your black list from time to time. The only way to atomate this a bit is to use a white list. This could be generated from our recent r*ferers because they are checked and proved.
Now, go and implement this.