A guide to Content Control in
Pegasus Mail and Mercury/32.

This document is last updated 26-December-2003.

Click here to return to the CC-Index page Click here to return to my main linkpage


Content Control rule language

Mercury's content control rule language has been designed to be simple and flexible: it is based on the use of regular expressions, which describe patterns of text within the message.

A rule set consists of a sequence of tests applied sequentially to the message.

 
The types of test

Body and subject tests:
These tests look for content in either the subject field or the body of the mail message. There are two types of test - a substring test, using the CONTAINS operator, and a regular expression test, using the MATCHES operator. The substring test simply looks for a group of characters anywhere in the specified location, while a regular expression test looks for more complex patterns of characters. See below for more information on the difference between substring and regular expression tests.

IF SUBJECT CONTAINS "string" WEIGHT x
IF SUBJECT MATCHES "regular_expression" WEIGHT x
IF BODY CONTAINS "string" WEIGHT x
IF BODY MATCHES "regular_expression" WEIGHT x
If you want to test for a string or a pattern in either the body or the subject field, you can use the CONTENT test instead - this checks in both places automatically:
IF CONTENT CONTAINS "string" WEIGHT x
IF CONTENT MATCHES "regular_expression" WEIGHT x
Header tests: these tests check specific headers or groups of headers in the mail message. The SENDER test looks in the "From", "Sender", "Resent-From" and "Reply-to" fields of the message, while the RECIPIENT test looks in the "To", "CC", "BCC" and "Resent-To", fields. The HEADER test allows you to check any single header in the message: if the header does not exist, the test does not trigger. Finally, the EXISTS test allows you to check whether or not a specific header exists in the message.
IF SENDER CONTAINS "string" WEIGHT x
IF SENDER MATCHES "regular_expression" WEIGHT x
IF RECIPIENT CONTAINS "string" WEIGHT x
IF RECIPIENT MATCHES "regular_expression" WEIGHT x
IF HEADER "headername" CONTAINS "string" WEIGHT x
IF HEADER "headername" MATCHES "regular_expression" WEIGHT x
IF EXISTS "headername" WEIGHT x
Wordlist tests - HAS and HASALL There are also some more specialized tests you can use to test for groups of words in a message - HAS and HASALL:
IF xx HAS "wordlist" WEIGHT x
IF xx HASALL "wordlist" WEIGHT x
(Note that "xx" can be "subject", "sender", "recipient", "header" or "body") Both of these tests accept a list of words separated by commas as their parameter. The HAS test will succeed if the message contains any of the words in the list, while the HASALL test will succeed if the message contains all the words in the list, in any order.
Example: to detect a message containing "viagra", "prescription" and "erectile"
IF BODY HASALL "Viagra, prescription, erectile" weight 50
Specialized tests
Mercury has a number of specialized tests that are specifically designed for detecting spam (unsolicited commercial e-mail); these tests examine special characteristics of the message that could not otherwise be easily detected using standard regular expressions. Specialized testing can trigger on things like Lazy HTML content (messages containing links to graphics instead of the graphics themselves), excessive use of HTML comments and other telltale signs of spam. For more information on the specialized tests, click here.

Negating and linking tests (NOT, AND and OR operators):
You can negate a test by using IFNOT instead of IF: similarly, you can link multiple tests together by using AND, ANDNOT, OR or ORNOT instead of IF in each test following the first.

Substring matching vs Regular expressions:
Any test that uses the CONTAINS keyword to perform a substring search does a simple string search instead of a regular expression match: this is a little faster and a little easier to understand than the regular-expression based versions of the rules. Note that CONTAINS tests are completely literal - no regular expression matching of any kind occurs. CONTAINS tests are always case-insensitive - so, the strings "foo" and "FOO" are identical as far as a CONTAINS test is concerned.

Detecting obfuscated text:
A common trick in spam is to embed unusual characters in words that commonly trigger anti-spam routines - like "vi@gra", or "pen1s"; indeed, this technique is now becoming so pervasive that Mercury includes a special keyword just to handle it. When defining HAS, HASALL or CONTAINS rules, you can add the keyword OBFUSCATED (you can abbreviate this to OB if you wish) before the WEIGHT keyword in the rule - like this:

IF SUBJECT CONTAINS "viagra" OBFUSCATED WEIGHT 51
This rule will detect any of the following words in the subject line of a message:
 
"viagra", "v-i-a-g-r-a", "vi@gra", "V 1 -@- G R A" 
or even "_v$1&@(G*r*A".
If you want to test for a phrase when using the OBFUSCATED keyword, you must enter the phrase in the rule without spaces: so, if you wanted to check for any obfuscated version of the phrase "increase the length of", you would have to enter it like this:
IF CONTENT CONTAINS "increasethelengthof" OB WEIGHT 51
Note that you cannot use the OBFUSCATED keyword on a MATCHES test - if you do, Mercury will simply ignore the keyword and match using the expression you provide.

CAUTION: You should exercise a certain amount of caution when using obfuscated tests, because there is a slightly increased risk of false positive matching (i.e, having two adjacent words which while harmless on their own, add together to form a trigger word).

Tags
Any rule can have a Tag, or a name used to describe it: the tag is used by Mercury when you have told it to construct a diagnostic header for messages, and is useful when the test that the rule is performing is either very verbose or very obscure, or when the actual text of the rule may contain offensive material.

Example: IF BODY HAS "Fuck, Shit" Weight 100 Tag "Offensive language"
In this example, when Mercury prepares the "X-CC-Diagnostic" header in the message, it will format it as:
Offensive language (100) instead of Body Has "Fuck, Shit" (100), which may be offensive to some people.

Tags are optional, and can appear instead of or after a WEIGHT statement. The name parameter to a Tag statement must always appear in double-quote marks, as shown in the example above.

General layout

The rule language itself is not case-sensitive, so the following tests are both semantically valid:

If Sender contains "foobar" weight 80
IF SENDER CONTAINS "foobar" WEIGHT 80
Furthermore, whitespace is ignored, so you can layout your tests in whatever way you feel is clearest: as an example, the following is a completely syntactically valid test:
If
		sender
			contains
			"Foobar"
		Weight 80
The only restriction is that neither a string nor a keyword can cross a line boundary; so, the following test is invalid:
	If sender con
		tains foobar Weight 80
Examples:

1: To detect a message where the sender's address contains "spam.com"

IF SENDER CONTAINS "spam.com" WEIGHT 50

2: To detect a message where the sender's address contains "spam.com" and the body of the mesage contains the word "viagra"

IF SENDER CONTAINS "spam.com"
AND BODY CONTAINS "viagra" WEIGHT 50 tag "Viagra ad"

3: To detect a message where the sender's address contains "spam.com" and either the subject field or the message body contains the word "viagra", allowing for possible obfuscation of the text:

IF SENDER CONTAINS "spam.com"
AND CONTENT CONTAINS "viagra" OBFUSCATED WEIGHT 50

4: To detect a message where the sender's address contains "spam.com", the message has no "Date" header, and the Subject or the Body contains "viagra"

IF SENDER CONTAINS "spam.com"
ANDNOT EXISTS "Date"
AND CONTENT CONTAINS "viagra" WEIGHT 50

Making the most of regular expressions

The CONTAINS test does a simple string search, looking for the exact text you provide anywhere in the message. Often, however, you may want to look for patterns of text rather than exact strings: you can do this by using a MATCHES test instead of a CONTAINS test, because MATCHES tests use a special pattern-matching mechanism called a regular expression to describe the general form of text you want to find.

Using regular expressions, you can detect extremely complex patterns of text within the messages you filter. Mercury's regular expression uses what is known as a metasyntax to describe the pattern you want to match: in the metasyntax, certain characters have special meanings that Mercury applies to the text it is testing. The following special characters are recognized in your expressions:

*	Match any number of any characters
?	Match any single character
+	Match one or more occurrence of the last character
[ ]	Encloses a group of characters to match. Ranges can
	be specified in the group using '-'.
/w	Match zero or more whitespace characters
/W	Match one or more whitespace characters
/c	Toggle case sensitivity (case-insensitive by default)
/s	Toggle whitespace stripping 
    - ignore all whitespace in the source
/b	Match a start-of-word boundary (including start-of-line)
/B	Match an end-of-word boundary (including end-of-line)
/x	Toggle "ignore-non-alpha" mode 
    - ignore non alphanumeric characters
/X	Toggle "ignore-spam" mode 
    - ignore non-alphanumeric except @ and |
You can use any number of metacharacters in an expression - so, for example, to detect all users at any system within the domain "spam.com", you could use the regular expression
*@*.spam.com

The set operation is especially powerful, particularly when combined with the repeat occurrence operator: so, to detect a message where the subject line ends in a group of three or more digits (a common indicator of a spam message) you would use this expression:

Subject:*[0-9][0-9][0-9]+

In this expression, we use the "*" operator to match the general text within the subject line, then we use the set "[0-9]" three times to force a minimum of three digits, and a "+" operator to detect any extra digits following the third one. Because there is no "*" at the end of the expression, the digits must therefore be the last characters on the line - if there is any text following them, the expression will fail.

Normally, Mercury compares all text on a case-insensitive basis - that means that it will regard "hello" and "HELLO" as being the same. In some cases, though, the case of the text you're matching can be important, so the "/c" operator allows you to toggle Mercury between case insensitive and case-sensitive comparisons. So, to detect the string "FREE!" anywhere within the subject line of a message, you would use this expression:

Subject:/c*FREE!*

In this expression, the expression will only succeed if the word "free" appears in uppercase characters.

Important note: matching anywhere within the target text Mercury's regular expression parser is designed to start at the beginning of the text it is evaluating and to stop matching at the end. As a result, if you want to find a regular expression anywhere within the text you are examining, you need to start and end the expression with an asterisk operator (*).
To illustrate why this is necessary, consider the following three regular expressions:

Wearing a fedora hat
Wearing a fedora hat*
*Wearing a fedora hat.
The first will only match if the target text consists only of the string "Wearing a fedora hat": if there is text before or after the string, the match will fail. The second will match only if the text starts with the string "Wearing a fedora hat". If there is any text before the string, the match will fail, but the "*" at the end ensures that any text following the string will not prevent a match. The last example will match only if the text ends with "Wearing a fedora hat" - again, the "*" at the start of the expression will match anything prior to the string.

If you want to find the expression anywhere it occurs in the target text, you need to enter it as

*wearing a fedora hat*
If you forget to add the leading and trailing * operators, the rule will typically not work, and this error can be quite difficult to spot when you're simply reading the source file.
 
Next: Rule language (Specialized tests)