I'm working on my Blacklist today. Lowering the URL limit to 3 has seemed to have some good effects, and so I am leaving that alone. Right now I'm replacing some blacklist strings: replacing four or five "health-insurance-from-us" type domains with the URLPattern "\binsurance\b," and adding some patterns like "\bsex\b" and "\blipitor\b".

I have a list of about 1000 entries - what's a good way to find more of these patterns so that I can condense several strings into one entry? I replaced 28 entries with "\bpoker\b" but have basically just been scrolling up and down the list to try to find common words.

Surely there's some software that can analyize a list and present some choices, no?

Update: I'm down to about 100 items, having simply deleted any URL that had been hit less than ten times. I'm sure this will result in a slight spike in comment spam in the coming month, but my three-URL limit may help as well.

The top spammy domain: with over 1500 comment spam attempts.

  1. MT-Blacklist maintains a list of comment spam keywords at

    It's generally updated several times a week, but the software's author just took a job at Six Apart so is moving and not as up-to-date as usually the case. I presume there will be even better integration of MT-Blacklist and MT in the near future.

  2. Ken, thanks, I know. That's not really what I seek, though.

  3. Please don't blacklist "sex." My domain is, and all...

  4. Have you considered leveraging DNSBL in your exploits? While intended for mail, the type that's bound to spam here is also sending out email, and DNSBL is a quick, simple protocol (really, really simple) that could be easy to integrate as a plugin.

    If you do leverage it, I recommend DJB's dnscache to make things much nicer for the rest of us with regards to speed. Of course, if you're already using a locally caching BIND, you won't need it.

  5. Etan, I didn't blacklist "sex." I blacklisted "\bsex\b" - which is different.

  6. I blogged about this a couple of days ago. I just up and decided to be very aggressive and straight up kill most common spam words from referring URLs.

    I manually went through the list and cherry-picked the terms I felt were most common. I suppose programmatically you can scan Jay Allen's list for dictionary words then dump each word into a hashmap with an occurrence -> word mapping then dump the map sorted by occurrence. I was thinking of doing something like that but I took the lazy route.

    I had no false positives from my database of 1,000 or so legit comments, but your weblog is more popular with a more diverse audience so YMMV.