12.22.08

On the time domain, with regard to spam

Posted in Spam at 3:21 pm by Craig

One year of mail on my server

So. There has, off and on, been a debate on whether or not the day/time at which an email arrives at your system from the outside world can be used to help determine whether or not the message is spam. The argument has generally been inconclusively decided, with rules providing for such identification generally not getting folded into projects such as SpamAssassin. I have till now been tentatively on the side of those who favor putting such rules in. Now I’m fully convinced that this must be a useful test. The chart above shows the volumes of non-filtered email traffic in and out of my web server, plotted over the last 12 months. Notice the regular weekly spikes in traffic numbers; weekdays see a lot more email than weekends. This is sure to surprise nobody. Note that spam, viruses, etc have been filtered out of this stream; this is only “real” mail and false-negatives. Ignore april. I had logging problems in April.

Now.

Malmail detections in email over the previous 12 months

Notice the complete lack of weekly spikes in that data. ["Rejected" by the way means I rejected the incoming SMTP connection before it even started speaking protocol at me, ie based just on the IP address. Almost all of those will be spam sources in XBL or SORBS's SOCKS registry.]

So, we have massive swings in total email volume during the week. But no swings in malmail (spam, viruses, etc). Therefore, the ratio of malmail to real mail clearly is affected by time of day/day of week. If mail arrives outside of “normal email” hours, it surely is much more likely to be malmail; a rule which learns which days/hours are good vs bad and scores mail accordingly surely would be useful for identifying and filtering out malmail.

Secondly, I’ve noticed anecdotally a pattern on “missed” spam which makes it to my inbox. It arrives in batches. When I’m going through and selecting/deleting based on subject lines in my inbox preview pane, I am almost always shift-selecting to delete multiple adjacent messages, not ctrl- or cmd-selecting to select discontiguous messages. I’ve noticed that generally, I’ll be more or less selecting all messages for a timeslot like 10pm-7am; basically for my personal email, everything that arrives in that window is spam. Not always, but very often. Occasionally I’ll get something from a European who doesn’t sleep at the right times of day. Sometimes something really crazy will happen and a friend will email me from Korea, where time is really fucked up. They might even have like one of those on-the-half-hour timezones over there. Anyway, clearly it’s not a hard and fast rule, but the SpamAssassin umbrella/plugin system with scoring is designed *precisely* for dealing with indications as opposed to hard-and-fast rules.

So, after a multi-year hiatus (partly due to a non-compete agreement which prevented me from doing some work in the field), I think it’s time to get a little back into anti-spam hacking. No doubt there’s some thoughts and previous work on this out there on the lazyweb from which I can benefit. The “it can’t be used to identify spam though” is just not true though — there might be some rules needed to protect against exceptions, but it’s obvious looking at those graphs above that it can and very likely should be used to help identify spam.

1 Comment »

  1. jmason said,

    December 23, 2008 at 2:35 am

    hooray! welcome back ;) I’d say that non-compete has long expired by now…

Leave a Comment

You must be logged in to post a comment.