Foreign language spam

From MozillaZine Knowledge Base
Jump to navigationJump to search

This article discusses several different ways to identify foreign language spam using message filters. It will focus on Russian and Chinese since they're the most common case, but the same techniques can be used with any foreign language. Several of the techniques rely upon custom headers. You can add a custom header by selecting "Customize" from the bottom of the left most list box (it starts with Subject) when creating a message filter. If you add one you have to use the custom header in that message filter, but it can be used in any other message filter (in any account).

Content-Type: header

The Content-Type header may identify the character set. For example, Content-Type: text/plain; charset=koi8-r indicates it a plain text message using the KOI8-R character set encoding. Its designed to cover Russian and Bulgarian using the Cyrillic alphabet. If it does, then you can add a custom header for Content-Type and test whether "Content-Type" "contains" "koi8-r" . Use "View -> message source" (or <Control>U) to see the message source.

Russian uses the Cyrillic alphabet. Some commonly used character sets for Russian spam are are KOI8-R , KOI8-U, ISO 8859-5, and Windows-1251.

There are a couple of problems with this approach:

  • They don't have to identify the character set.
  • It might use Unicode (UTF-8) or Windows-1251 (it adds Cyrillic alphabet characters to a 7-bit ASCII character set). These character sets are sometimes used for messages that just have 7-Bit ASCII characters.
  • If the Content-Type line wraps Thunderbird won't read all of it. This is a bug, it will read multiple lines for "Recieved", "To" and "CC". You used to be able to workaround this by testing whether the "Body" contain that string, but they recently redefined "Body" to mean just the message body rather than the entire message, including all headers. Unfortunately when they did that they didn't add some alias such as "Headers" to let you test all of the headers or "All" to test the entire message.

The Wikipedia has a list of popular, Cyrilic , Big5 (Chinese) , GB (Chinese) character sets.

Foreign letters

Look in the Wikipedia to find the lower case version of the most common vowels for a foreign language and then test whether a message contains any of them that are not used in English. This table lists the letters of the Cyrillic language and identifies how they are used in various languages. This article identifies the Russian vowels.

Russian uses а, э, ы, у, o, я, е, ё, ю, и as vowels. You could create a message filter set to "Matches any of the following" that test whether "Body" "contains" "и", "Body" "contains" "ё" and so forth until you covered all of the vowels. Ssince English also uses "a" , "e" , "o", and "y" letters don't test for them. The reason for "Matches any of the following" is to logically OR them - you want the action to take place if any of those letters are found.

Foreign words

Chose several very common words such as "and" and "to", and use BabelFish to convert the word from English to the other language. For example, supposedly and is и and to is до in Russian. You could create a message filter set to "Matches any of the following" that tests whether "Body" "contains" "и" and "Body" "contains" "до" .

It can get confusing who to believe though. This online dictionary says "and" is "соединительный союз и". If you convert that using Babelfish it says that means "connecting union and".

What country the senders SMTP server is in

Thunderbird doesn't provide any information on what country the sender's SMTP server is in. But some SpamAssassin implementations are configured to identify what country the senders domain is in. For example:

X-Spam-source: IP='202.108.255.197', Host='smtpr2.tom.com', Country='CN', FromHeader='com', MailFrom='com'

You could add a custom header for X-Spam-source and test whether "X-Spam-source" "doesn't contain" "US". Or create a message filter set to "Matches any of the following" that tests whether "X-Spam-source" "contains" "CN" and "X-Spam-source" "contains" "RU". See this web page for a list of internet domain abbreviations.

There is a Country lookup extension but it appears designed to just show you a colored flag icon, and doesn't set a custom header that you can test. Same thing for the Display Mail Route extension.

Sample filters

This is a copy of a "msgFilterRules.dat" file that contains a sample message filter for removing Russian spam using vowels, and another for removing Russian/Chinese spam based on character sets. mailbox://nobody@Local%20Folders/Junk looks bizarre but thats just how Thunderbird encodes the Junk folder within Local Folders. If you don't have any message filters defined for an account you could copy and paste it into a "msgFilterRules.dat" file in your account directory (the directory named after your accounts mail server) in your profile.

Windows-1251 wasn't tested since it wasn't clear how many English messages might have it. If you get a lot of Russian spam that uses it you might want to add it. There are a number of variants of Big5 but since all of them and HKSCS seem to use Big5 as part of their name it seemed safe to just test for Big5.


version="8"
logging="no"
name="Remove russian spam using vowels"
enabled="yes"
type="1"
action="JunkScore"
actionValue="100"
action="Move to folder"
actionValue="mailbox://nobody@Local%20Folders/Junk"
action="Mark read"
condition="OR (body,contains,э) OR (body,contains,ы) OR (body,contains,я) OR (body,contains,ё) OR (body,contains,ю) OR (body,contains,и)"
name="Remove russian/chinese spam using Content-Type header"
enabled="yes"
type="1"
action="JunkScore"
actionValue="100"
action="Move to folder"
actionValue="mailbox://nobody@Local%20Folders/Junk"
action="Mark read"
condition="OR (\"Content-Type\",contains,KOI8-R) OR (\"Content-Type\",contains,KOI8-U) OR (\"Content-Type\",contains,ISO 8859-5) OR (\"Content-Type\",contains,Big5) OR (\"Content-Type\",contains,GB-2312) OR (\"Content-Type\",contains,GB18030) OR (\"Content-Type\",contains,GBK)"

See also

External links