Making ServWise
50% off all hosting at servwise

Howto: Reduce Spam using SpamAssassin Bayesian filters in cPanel

May 14th, 2007

Introduction

While this article is intended to demonstrate how to train and use Bayesian filters on cPanel hosted sites, you can use the information in this technique wherever spamassassin and fetchmail are installed.

Requirements

  1. cPanel based host
  2. Spamassasin
  3. Access to the cron scheduler within cPanel
  4. Access to the file manager within cPanel
  5. Fetchmail must be installed and usable on your hosted server

Audience

This article assumes a working knowledge of cPanel, file editing, and email protocols. It is useful for anyone wanting to increase the accuracy of SpamAssassin’s spam detection.

SpamAssassin

SpamAssassin inspects emails being delivered to your email account looking for signs that the email is spam. It has a number signatures that it looks for, and each of these can be given a weighting, which results in the email having a spam “score”. If the score is over a threshold, the email is considered spam. There are a number of actions that can be taken if the email is designated as spam, one of which is to alter the Subject line of the email to reflect it is spam, so that a rule can be defined in your email client to automatically delete the email or move it to a junk email folder.

Bayesian Filters

Bayesian filters are a method that SpamAssassin can use to learn more about the kinds of email you receive and so can enhance its ability identify which emails are spam, and which are not. This also helps reduce the “false-positive” count, where an email is marked as spam when it should not have been.

These filters work by identifying key elements of emails that are common to spam, and those that are common to non-spam email. To achieve this, the filters must be given a sample of both to learn from, and requires at a minimum 200 spam emails and 200 non-spam emails. Unfortunately, for many, obtaining 200 spam emails is a trivial task.

The Setup

This article assumes that you will be using a separate email address for spam training. This can make things easier to manage, especially if you use POP3 as your primary means for downloading email.

The basic steps are as follows:

  1. Enable and configure SpamAssassin
  2. Create Spam email account with a non-spam email folder
  3. Configure .fetchmailrc to allow SpamAssassin to download emails
  4. Configure cron to automatically train SpamAssassin
  5. Copy 200 spam emails to Spam inbox, and 200 non-spam emails to non-Spam folder

Enable SpamAssassin

  1. Navigate to the Mail / SpamAssassin menu in cPanel
  2. Click Enable SpamAssassin. This will enable SpamAssassin on any incoming email. Note that this will already accurately identify spam, however the remainder of the article shows how to increase SpamAssassin’s accuracy.
  3. Click Configure SpamAssassin
  4. Ensure the configuration is as follows:

add_header: all

This adds headers emails that will enable you to see why an email has been classed as spam or non-spam

report_safe: 0

This makes changes to the email “in situ”. If you would prefer that the email is left unaltered, set this option to “1”. A new email will be created with the spam details in the content, with the original unaltered email as an attachment

required_score: 5

SpamAssassin uses a series of criteria to determine if an email is spam, and each is given a score. The email will be classed as spam if the determined score is above 5

rewrite_header subject: **SPAM** _HITS_ (_BAYES_)

This will change the subject line of the email. The email subject line will be prefixed with the above line, with _HITS_ replace with the spam score for the email, and _BAYES_ replaced with the Bayesian score

score BAYES_99: 5

This states that any email that has a 99% or above rating as spam due to Bayesian analysis, then add five “spam” points to its score. Note that this will put the score of the email over the 5 point threshold defined above, and so automatically marks the email as spam

    5) Save your settings

Create Spam Account

This email address will be used to train SpamAssassin; it is not intended that this account be used for sending or receiving normal emails.

  1. Navigate to Mail / Manage/Add/Remove Accounts
  2. Click Add Account
  3. Create a new account with an appropriate name, such as spam@mydomain.com, and define a password. SpamAssassin will empty this account on a regular basis, so it is not necessary to define a quota.

  4. Click Back and you should see a list of email accounts; click Read Webmail for the spam account you just created. Enter the password for the account, and you will be taken to the webmail selection page. Choose your preferred webmail client.

  5. Create a new folder called “nonspam”.

Configure Fetchmail

SpamAssassin uses fetchmail to read the emails from the spam account to train its Bayesian filters.

  1. Navigate to File Manager in cPanel

  2. You will be presented with a list of folders, followed by a list of files. At the bottom of the folder list, click Create New File

  3. On the right hand side, type .fetchmailrc in the dialog box, and leave the document type as Text Document

  4. The .fetchmailrc file should appear in the file list in the left hand panel. Click the filename and then select Edit File from the menu on the right hand side.

  5. Complete the file as follows, using the username and password you selected for your spam account above:

poll 127.0.0.1

with protocol imap

username spam+mydomain.com password spampassword is spam

     

    Note that the username is the email address selected, with the @ sign replaced with a +

  1. Save the file

 

Configure cron to Automatically Train Spamassassin

Cron is a task scheduler, and so can carry out tasks at specific times of the day. We will configure cron to run the SpamAssassin training process for spam and non-spam once a day.

  1. Navigate to Cron jobs in cPanel

  2. Click Advanced

  3. Each time cron runs, it can send an email to an administrator to report the results of each task. Enter an appropriate email address in the box provided

  4. First we want SpamAssassin to train for non-spam, so put the time that you want this task to run at by entering the minutes of the time in the first box, and the hour in the second box. So to have the non-spam training task run at 1:30am, enter 30 in the minute column, and 1 in the hour column. The command to run the non-spam training task is:

fetchmail -v -u spam+mydomain.com -a -n –folder nonspam 127.0.0.1 -m ’sa-learn –ham –siteconfigpath /usr/share/spamassassin’

    This command is saying “Fetch the email from the “nonspam” folder of the specified email account, and pass it to the ’sa-learn’ command, and learn that this email is “ham” (non-spam)

     

  1. The next entry will tell SpamAssassin to train for spam. Enter a later time in the the minute and hour field, such as 35 for minute, and 1 for hour, to have the command run at 1:35am. The command to train for spam is:

fetchmail -v -u spam+mydomain.com -a -n 127.0.0.1 -m ’sa-learn –spam –siteconfigpath /usr/share/spamassassin’

    As no folder is specified, the spam emails will be assumed to be in the inbox of the spam account

The cron screen should look something like the following:

Provide Training Emails

As noted before, SpamAssassin requires 200 spam and non-spam emails before it will start using the Bayesian filters in emails.

To provide the emails, you must first access the spam email account in your preferred email client. Follow the instructions for your email client to access an IMAP email account, using the account name and password you selected above. When you have created the account and accessed it, you should see the Inbox and nonspam folders. Copy 200 spam emails into the inbox, and 200 non-spam emails into the nonspam folder. The next time the cron jobs are scheduled to execute, SpamAssassin will create a database from those emails that it will use to determin whether any future emails are spam.

 

Ongoing Management

Any emails that are identified as spam will be be marked as such with ***SPAM*** in the subject line. You can use a rule or filter in your email client to automatically deal with these emails, perhaps by putting them into a folder the contents of which are automatically deleted after a defined time period.

SpamAssassin gets better with the more data it has to work with, and so from time to time you will get false positives, where an email is classed as spam that should not have been. If this happens, copy the email to the nonspam folder of the spam account and SpamAssassin will update its “nonspam” database. Likewise, some spam will not be identified, and these should be copied to the inbox of the spam account.

 

Troubleshooting

Most issues that may be encountered with using this method to train SpamAssassin can be diagnosed by reviewing the email received from the cron job. They are usually related to the fetchmail process executed through cron:

  1. The SpamAssassin global configuration file is not at /usr/share/spamassassin. Check with your hosting provider as to its location

  2. Fetchmail is not installed. Ask your hosting provider if it can be installed.

  3. Emails do not have a Bayesian spam score. This is usually because SpamAssassin has not been provided with enough emails to build a profile of what is and isn’t spam. Check that the inbox and nonspam folders are being emptied each day, and that you have provided 200 samples of each type of email

This article also appears in our knowledgebase, along with many other cPanel tips

 

Sabayon Stabiliser 2

April 27th, 2007

This article is a follow on from my previous go at putting together a technique for encouraging Sabayon to move away from using packages that are masked for testing, and to use stable packages instead. I posted a link to the original at http://www.sabayonlinux.org forums, and voxiac pointed out the obvious: If you want to get to a stable environment, then the goal should be to change the ACCEPT_KEYWORDS from accepting all test masked packages to using stables. The rest can follow from there. So this is how do to it.

Note: All references to ~amd64 should be changed to reflect your architecture – probably x86 if not amd64

First things first, edit /etc/make.conf and either comment out, remove, or change the ACCEPT_KEYWORDS line to:

ACCEPT_KEYWORDS=amd64

If you do this, you must do the rest, otherwise you could be in trouble next time you get to emerge something. This line tells emerge that all packages must be stable, unless explicitly stated in /etc/portage/package.keywords

Next we need to get the list of packages that are currently masked as test packages, and create a file that we can use to add to the package.keywords to let emerge know that they can remain as test packages.

We use three commands piped together to do this. You can try these out stage by stage so that you can see exactly what is going on. The first part is:

equery -i -N list

This shows a list of installed packages, such as:

[I--] [ ~] x11-wm/twm-1.0.3 (0)

The second column will contain a tilde ‘~’ if the package is masked. These are the packages we need, so we will pipe the output into grep to pull out only those lines that contain a tilde:

equery -i -N list | grep \~

The output will now only contain the masked packages. We need to manipulate each line a little to pull out just the package name, and then add the prefix and suffix we need ready for the package.keywords file:

equery -i -N list | grep \~ | sed 's/.* \(.*\) (.*/=\1 ~amd64/'

This transforms the above line:

[I--] [ ~] x11-wm/twm-1.0.3 (0)

into:

=x11-wm/twm-1.0.3 ~amd64

This means “permit use of a masked package for x11-wm/twm, for version 1.0.3 only.”

We just need to output the results of that to a file, and append it to /etc/portage/package.keywords

Output to a file:

equery -i -N list | grep \~ | sed 's/.* \(.*\) (.*/<=\1 ~amd64/' > testpackages

Append to the end of /etc/portage/package.keywords (take a backup first):

cp /etc/portage/package.keywords /etc/portage/package.keywords.back

cat testpackages >> /etc/portage/package.keywords

This last command will need to be done with root privileges. If you followed the previous article’s approach, you will have a bunch of old entries in their from the previous method. These will need to be removed first.

Once you have done this, you can effectively leave it alone. As new packages are released into the stable tree, they will supercede the ones you have installed, and the package.keywords entries will be ignored. voxiac suggested a cron job that would clean up the package.keywords file over time. I’ll have a go at putting something together down the line, depending on what level of hassle the cleanup is.

By the way, this command will give you a count of the current test packages you have installed:

equery -i -N list | grep \~ | wc -l

Versus total installed:

equery -i -N list | grep \/ | wc -l

My count is:

Total: 1347, of which are unstable: 852.

Quite a way to go….

Sabayon Stabiliser

April 21st, 2007


The following talks about how you can gradually move Sabayon from the testing tree to the stable tree of portage.

Sabayon. Very cool. However, it is a bit disturbing that it bases everything on the testing tree of portage.

In make.conf, there is a line:

ACCEPT_KEYWORDS=~amd64

That last part may read ~x86 if you are on a 32bit platform. In gentoo speak a package is “masked” in the portage tree (the software repository) until it has completed testing. The ~x86 or ~amd64 keyword identifies a packages as being in testing. Once testing is complete, the keyword changes to “x86″. Ie, without the tilda.

The line above is essentially saying that whenever you install a package, it should install the latest testing version.

This can lead to problems, as the packages in testing haven’t been tested to see if they are ready for release – that they haven’t been QAed to work with everything else you may have installed.

So just remove the line from make.conf right? Wrong. The dependencies that packages have on each other are far too complex for this to work. Another strategy would be to work out groups of packages that work together, and roll them back to stable using /etc/portage/package.keywords and /etc/portage/package.mask

Forget it. Portage is the tangled web we weave.

The following approach works on this principle: You accept that you currently have a bunch of testing packages installed,but everything seems to be working. Then take advantage of the fact that lots of people are volunteering their time to ensure that these packages are suitable for release, or fixing them so that they are. So if you could say “Ok, no package can be upgraded from now on until they are in the release tree”.

So look at this:

equery -i list | sed 's/(.*)/\>\=$1 -~amd64 amd64/' > packages

This command will get a list all of your currently installed packages, add a “>” to the beginning of each line, and add ” -~amd64 amd64″ to the end of each line, and output them all to a file called “packages”

So if the package was this:

media-video/vlc-0.8.6-r1

Now it looks like this:

>media-video/vlc-0.8.6-r1 -~amd64 amd64

If this line were put into /etc/portage/package.keywords, it would be saying “remove the “~amd64″ mask from this package, and add the release mask, if the version number is greater than the one shown. In other words, don’t install any more test versions for this package.

So if the file you generated is added to the end of /etc/portage/keywords then any updates to these packages will not be installed unless they are in the release tree. So now, all you need to do is wait until enough packages are in release that you feel confident you can remove the ACCEPT_KEYWORDS line from make.conf.

Note that for the above file, the first line of the “packages” file will not be valid. Remove the first line. You can append the file to package.keywords with (as root):

cat packages >> /etc/portage/package.keywords

Note that this is not a fire and forget solution. There are likely to be dependency tweaks you need to make as you go along, and you will have to enter a line in package.keywords for any new packages you install, but this at least will stop “emerge world” from ensuring your system is never stable.

Windows the Boot

March 27th, 2007

Yeah, Windows sucks. It is truly awful. But a necessary evil for myself and many others.

One of the suckiest things is how it refuses to accept that there can be in existence any other operating systems in the world but itself.

So I have a PC, and had a Windows and Linux install for some time running together, with grub in the MBR handling the booting both Linix and Windows. Or at least passing booting off to ntldr to bring windows up. No problem.

But then I wanted to reinstall Windows in another partition. It is always good to do a reinstall of Windows into a fresh partition so that you can copy over bits and pieces of the old install as you need them (notably stuff in the Application Data folder in your profile). So it was time. My Windows installation had gotten slow and bloated after only six months, as I randomly install software to make the experience more palatable. My new strategy was to run everything from Linux primarily, and do Windows stuff from a windows VM on my server, and just use native Windows for games.

I did try and use VM Server as a way of reinstalling Windows into the partition without having to leave Linux, by creating a partition based disk for the vm. VM Server reports that the partition table is invalid if you do this on my install. There are a few people who have experienced this before, but none seemed quite the same as mine, and I didn’t get to the bottom of it.

So it was time for a native install. So a PC fairly useless while Windows gets on with copying itself. Except it didn’t work. This was new – most of the problems you get with Windows I have encountered one time or another but this one I hadn’t seen.

The CD would boot, and do the “Checking Hardware configuration” message that precedes the blue install screen, but it never got to the blue. It just cleared the screen and hung.

So.. a bit of a googling (sorry Google for using your name as a verb, but seriously, it is one now) and I discover that Windows XP install will not work if there is a boot sector that it doesn’t know. Ie, if there is one present, and it isn’t a Windows one. Either that or the partition tables aren’t aligned the way it expects.

So a bit of monkeying around later – reorganising the partitions in a way that I did not want, and having to boot the disk with no hard drives connected, I finally got Windows installed. Of course my boot sector is now trashed with the Windows one instead of grub.

But clearly putting grub back is only going to give me problems next time I want to wipe windows.

The good news is that you can boot Linux – essentially chain grub – from NTLDR – the windows boot manager. There are a stack of articles about it around, and from my experience GRLDR doesn’t work – this is the grub4dos boot manager that should be able to read and process a standard menu.lst file. The documentation is erratic, and says that it must on the boot drive, but cannot be on ntfs. Not sure how you resolve that if your boot drive is ntfs.

So the other way is to get a copy of the grubbed boot sector, make a file out of it, put it in the root of your boot drive, and just add it to boot.ini.

You need to understand which is your boot drive. In my case grub was installed in the MBR of the primary disk /dev/hda. After all this stuffing around, that had been wiped, so I reinstalled it into another drive:


# grub
grub> root (hd0,7)
grub> setup (hd0,7)

This will install grub into /dev/hda8 (the eighth partition of the first ide drive – grub itself numbers partitions from zero). If your first drive is SATA, then this may refer to /dev/sda

Now make a copy of the first 512 sectors:


dd if=/dev/hda8 of=linux.boot bs=512 count=1

Copy this to your primary Windows partition – wherever boot.ini resides (usually c:\). Then edit boot.ini and put in the line:


c:\linux.boot="Linux"

Reboot, and this will appear in the standard boot menu for Windows.