Spam comes from a wide variety of sources. It is simply impossible to dispose of all spam without discarding useful messages.
The simplest approach to filtering spam is filtering, at the mail server or when you sort through incoming mail. If you get 200 spam messages per day from ‘random-address@example.org’, you block ‘example.org’. If you get 200 messages about ‘VIAGRA’, you discard all messages with ‘VIAGRA’ in the message. If you get lots of spam from Bulgaria, for example, you try to filter all mail from Bulgarian IPs.
This, unfortunately, is a great way to discard legitimate e-mail. The risks of blocking a whole country (Bulgaria, Norway, Nigeria, China, etc.) or even a continent (Asia, Africa, Europe, etc.) from contacting you should be obvious, so don’t do it if you have the choice.
In another instance, the very informative and useful RISKS digest has been blocked by overzealous mail filters because it contained words that were common in spam messages. Nevertheless, in isolated cases, with great care, direct filtering of mail can be useful.
Another approach to filtering e-mail is the distributed spam processing, for instance DCC implements such a system. In essence, N systems around the world agree that a machine X in Ghana, Estonia, or California is sending out spam e-mail, and these N systems enter X or the spam e-mail from X into a database. The criteria for spam detection vary—it may be the number of messages sent, the content of the messages, and so on. When a user of the distributed processing system wants to find out if a message is spam, he consults one of those N systems.
Distributed spam processing works very well against spammers that send a large number of messages at once, but it requires the user to set up fairly complicated checks. There are commercial and free distributed spam processing systems. Distributed spam processing has its risks as well. For instance legitimate e-mail senders have been accused of sending spam, and their web sites and mailing lists have been shut down for some time because of the incident.
The statistical approach to spam filtering is also popular. It is based on a statistical analysis of previous spam messages. Usually the analysis is a simple word frequency count, with perhaps pairs of words or 3-word combinations thrown into the mix. Statistical analysis of spam works very well in most of the cases, but it can classify legitimate e-mail as spam in some cases. It takes time to run the analysis, the full message must be analyzed, and the user has to store the database of spam analysis. Statistical analysis on the server is gaining popularity. This has the advantage of letting the user Just Read Mail, but has the disadvantage that it’s harder to tell the server that it has misclassified mail.
Fighting spam is not easy, no matter what anyone says. There is no magic switch that will distinguish Viagra ads from Mom’s e-mails. Even people are having a hard time telling spam apart from non-spam, because spammers are actively looking to fool us into thinking they are Mom, essentially. Spamming is irritating, irresponsible, and idiotic behavior from a bunch of people who think the world owes them a favor. We hope the following sections will help you in fighting the spam plague.