Blacklisting
Posted October 21st, 2004 @ 01:37pm by Erik J. Barzeski
I'm working on my Blacklist today. Lowering the URL limit to 3 has seemed to have some good effects, and so I am leaving that alone. Right now I'm replacing some blacklist strings: replacing four or five "health-insurance-from-us" type domains with the URLPattern "\binsurance\b," and adding some patterns like "\bsex\b" and "\blipitor\b".
I have a list of about 1000 entries - what's a good way to find more of these patterns so that I can condense several strings into one entry? I replaced 28 entries with "\bpoker\b" but have basically just been scrolling up and down the list to try to find common words.
Surely there's some software that can analyize a list and present some choices, no?
Update: I'm down to about 100 items, having simply deleted any URL that had been hit less than ten times. I'm sure this will result in a slight spike in comment spam in the coming month, but my three-URL limit may help as well.
The top spammy domain: us.com with over 1500 comment spam attempts.
Posted 21 Oct 2004 at 2:39pm #
MT-Blacklist maintains a list of comment spam keywords at
http://www.jayallen.org/comment_spam/blacklist.txt
It's generally updated several times a week, but the software's author just took a job at Six Apart so is moving and not as up-to-date as usually the case. I presume there will be even better integration of MT-Blacklist and MT in the near future.
Posted 21 Oct 2004 at 3:17pm #
Ken, thanks, I know. That's not really what I seek, though.
Posted 21 Oct 2004 at 4:23pm #
Please don't blacklist "sex." My domain is TooMuchSexy.org, and all...
Posted 21 Oct 2004 at 6:59pm #
Have you considered leveraging DNSBL in your exploits? While intended for mail, the type that's bound to spam here is also sending out email, and DNSBL is a quick, simple protocol (really, really simple) that could be easy to integrate as a plugin.
If you do leverage it, I recommend DJB's dnscache to make things much nicer for the rest of us with regards to speed. Of course, if you're already using a locally caching BIND, you won't need it.
Posted 22 Oct 2004 at 10:34am #
Etan, I didn't blacklist "sex." I blacklisted "\bsex\b" - which is different.
Posted 22 Oct 2004 at 6:43pm #
I blogged about this a couple of days ago. I just up and decided to be very aggressive and straight up kill most common spam words from referring URLs.
I manually went through the list and cherry-picked the terms I felt were most common. I suppose programmatically you can scan Jay Allen's list for dictionary words then dump each word into a hashmap with an occurrence -> word mapping then dump the map sorted by occurrence. I was thinking of doing something like that but I took the lazy route.
I had no false positives from my database of 1,000 or so legit comments, but your weblog is more popular with a more diverse audience so YMMV.