Simple Detection of Comment Spam in Rails

It’s always nice to be able to get some feedback, or for users to make a contact via a simple Contact form. However, it didn’t take too long before spammers started hitting those forms too. It was quite interesting to see the kind of messages we started receiving. In a way, most of those submissions were more like stories, or snippets from an email to a friend. They didn’t have any of those very much expected keywords for fake watches or erectile dysfunction enhancers. Many didn’t even have any links either. So what were these messages then? My personal guess was that these were some kind of a reconnaissance attempts. The bots were sending innocent messages first to various online forms. Then I imagine they will crawl the site more, trying to see if those submissions appear elsewhere. If/when they do, they will hit those forms hard with the real spam content. In any case, these were all speculations that I didn’t really care to prove right or wrong. I just wanted to get rid of this junk. Fast.

Best approach?

The immediate suggestion was of course to introduce a captcha. This easily stops bots, and users are quite familiar with those and learned to accept them as necessary evil. There must be plenty of gems or plugins for rails that we could use, so this seemed like a no-brainer. Except from the fact that captchas are not fun. If someone just wants to make a comment, asking for a captcha might mean it isn’t going to be worth the effort. It’s just a (annoying) step in the way. So how can we block those bots without a captcha?

Existing solutions

Searching online for ways to fight rails comment spam mostly pointed us in the direction of services like akismet, using a bayesian filter, or one guy who built his own score-card based system. These are all fine solutions, but they just felt a little too heavy. They would either consume quite some resources in filtering content and training the system over time, or rely on external providers, which means adding latency, dealing with failures etc.

Cookies for comments

Having used wordpress quite extensively, I’m well familiar with comment spam. It’s something you just have to accept if you have a blog. One of the most simple yet effective wordpress plugins for fighting comment spam is cookies for comments. The plugin takes a slightly different approach to detecting bot spam. It simply adds a css to the page, that when retrieved by the browser, sets a cookie. When a comment is posted, this cookie is checked. If it doesn’t exist – the comment is marked as spam. This pretty much stops all bot spam, which must make 99.9% of all spam. Neat.

I was looking for something similar for Rails, but sadly couldn’t find anything. I was thinking of writing something myself, but I also wanted a quick-fix. Something we can plug-in quickly to get rid of those annoying comments. I am assuming that since this contact form wasn’t linked to any real comments, no human spammer will ever try to submit a comment anyway. Those contact submissions never appear on the site. Bots however, would just try anything. And stopping bots should be easier.

Timestamp for comments

Ok, so this is clearly a little inferior solution to the wordpress plugin, but surprisingly (or not so surprisingly), it worked quite well. At least as a quick-fix. The principle was similar, but simplified:

When the user (or bot) requests any page on the site, we set a timestamp inside the session. The timestamp is only set if it doesn’t already exists, i.e. only the first time the user accesses the site. When a form submission is made, in our case, on the contact form, we check how long it took the user to submit the comment. If the comment was submitted too fast, we can quite reliably assume it is a bot. No real user will access the site and fill in the contact form within 5 (or probably even 30) seconds. Checking the log files also showed that most of these spam comments were indeed made within a few seconds of the first request coming in.

in our application controller I’ve added a very simple before_filter:

    before filter :anti_spam

    def anti_spam
      session['antispam_timestamp'] ||= Time.now
    end

Then in our contact controller we can check the timestamp:

class ContactController < ApplicationController
  skip_before_filter :authenticate_user!

  def create
    # checking for contact spam
    contact_spam = false
    time_to_comment = Time.now - session&#91;'antispam_timestamp'&#93;
    if time_to_comment < config.antispam_threshold
      logger.warn("potential spam detected for IP #{request.env&#91;'REMOTE_ADDR'&#93;}. Antispam threshold not reached (took #{time_to_comment.to_i}s).")
      contact_spam = true
    end

  .
  .
  .
  end

&#91;/code&#93;
</section>

<p>Setting the <code>antispam_threshold</code> to 30 seconds seems reasonable. We might adjust it if we get some false positives though.</p>
<h2>Handling spam</h2>
<p>In our case, we simply ignored the spam message, but didn't return any error. You can choose to throw an error message or even an error page. However, to make it more fun, we actually took action on this log message. Effectively blocking the spammer's IP address completely.</p>
<h2>Fail2ban</h2>
<p><a href="http://www.fail2ban.org/wiki/index.php/Main_Page">fail2ban</a> is a fantastic little tool that reads your log files, and takes action. Since I was using it anyway, particularly for blocking SSH brute-force attempts, adding a rule to it was very straight-forward. In this case, we matched the warning message on the Rails log file, and used the IP address to put a block for a few minutes:</p>
<section>

failregex = .*potential spam detected for IP <HOST>. Antispam threshold not reached for comment \(took [0-9]s\)

So far, we haven't had any spam getting through, and no sign of any false-positives either. Most spam timestamp we see is within less than 2 seconds. It might not be the best solution, and we might enhance it later on (taking the next step to make it work like the wordpress plugin, or in other directions). As a quick and simple fix, it seems to work well and users don't have to put in any annoying captchas. At least for now.

EDIT: all code snippets on this post are licensed under MIT

Copyright (C) 2012 Yoav Aner

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
documentation files (the "Software"), to deal in the Software without restriction, including without limitation the
rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit
persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the
Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

UPDATE: Thanks to Marc Anguera Insa (markets on github) for bringing the invisible captcha gem to my attention. It uses several anti-bot techniques to fight comment spam. I have not tried it myself, so cannot vouch for it, but it definitely looks interesting.

10 Responses to “Simple Detection of Comment Spam in Rails”

  1. Yoni

    Very intersting.
    I guess that if your approach will spread, then spammers learn to skip it also, but until then, it very clever.
    Another question: Why not use 5 seconds that block most of the spam that and doesn’t give you false positive?
    Thanks for your article.
    Yoni

  2. Yoav Aner

    Thanks Yoni,

    Spammers can easily find ways around my approach. It’s really not meant to be particularly clever or new to be honest. However, most spammers/bots aren’t that clever either, and just try the ‘shotgun’ approach of posting any form they can find… so at least this approach is useful against that very basic type of spam.

    As for tuning it to say 5 seconds, this question is always valid. Why 5 and not 3? Why 3 and not 15? It’s always a balance and there’s no right or wrong value. currently 30 seconds seems fine and I don’t get any false positives. If I see more false positives I can decrease the value. If I get more spam, perhaps increase it… As with most things, YMMV.

  3. Regis

    Hi!

    It’s a really clever solution. To avoid false positive, would it be possible to add a Javascript function that shows the “Submit button” only if 30 seconds have passed? (If Javascript is activated of course).

    Thanks for your article,
    Regis.

  4. Yoav Aner

    Thanks Regis. There are lots of other enhancements that can be added to tackle bot spam (and using javascript is certainly one of them). However in this case it was simply figuring out the simplest solution with a minimal effort that would actually work, and at least so far it seems to work just fine. We tried to check our logs for some false positives, but found none, so there wasn’t much point adding more to it at this stage.

  5. Regis

    Ok, that’s great! :)
    I’ll share also my knowledge: an easy other solution is the “Honeypot”. It’s quite simple, most of the time, spammers fill in all the fields in a form, regardless of if they need to be filled in or not…

    So! Just add a Honeypot field, that you can name whatever you want. In the model you can add an “attr_accessor :honeypot” or even put it in “attr_accessible” and add a column in the data’s table to store it.

    Next when someone submit the form and before storing it in the database or sending the mail, just test if this honeypot attribute is blank or not. If it’s not, then it’s surely a bot which posts the form and you can then do whatever you want: don’t store it, store it in a different table, redirect…

    In the view, of course, hide the honeypot field, you can wrap it in a non-visible container or even make it a “hidden_field”, it still works most of the time.

    The Honeypot thing is also quite effective and I’m not the only one to use it (I’ve read it on another blog, don’t remember which one). And most important, it’s totally transparent for users and has 0 false positive! :)

  6. Yoav Aner

    Hi Colin,

    It might be a good addition, but I’m not sure I have the time to allocate to it right now. Feel free to use my code snippets to create a PR. I didn’t explicitly provide a license to those snippets, but I’m happy to MIT license them, so this code can be easily re-used etc.

    Cheers
    Yoav

  7. markets

    Hi Yoav,

    Finally we released a new version of https://github.com/markets/invisible_captcha with timestamps. So now, it provides: honeypots and timestamp.

    Could you please consider to add a link to the gem in the post? I think would be really useful for future readers. We also added a link back here in the docs.

    Thanks!

Leave a Reply

css.php