Our Anti-spam is Ready! - Accuspam

TonyT · Post by **TonyT** » Wed Jul 21, 2004 6:31 am

After months of more research, fine tuning the programming and testing, our promised anti-spam solution is now up and running.
Check it out:
http://www.accuspam.com
quote
AccuSpamTM is the only anti-spam in the world which blocks 100% of spam and never fails to show you the non-spam.

Roody · Post by **Roody** » Wed Jul 21, 2004 6:55 am

Nice work Tony

fastchevy · Post by **fastchevy** » Wed Jul 21, 2004 9:04 am

How's it working?
We just implementing an internet appliance for spam from CyberTrust as well.
(Iron Mountain) We are in the beginning stages of fine tuning.

Norm · Post by **Norm** » Wed Jul 21, 2004 9:12 am

I will be sending everyone who asks how to keep spam out of thier mail to the site.

Good stuff Tony

Prey521 · Post by **Prey521** » Wed Jul 21, 2004 9:47 am

TT, can you better explain to me the secret email account option? Thanks!

Paft · Post by **Paft** » Wed Jul 21, 2004 1:00 pm

Sounds like a very complex bayesian filtering system, Tony.

Nice work.

TonyT · Post by **TonyT** » Wed Jul 21, 2004 4:32 pm

Paft wrote:Sounds like a very complex bayesian filtering system, Tony.
Nice work.

Nope! Bayesian does not work well for anti-spam.

Prey:
re Secret Email Address used in paid version:
see http://accuspam.com/faq.php#as_configure
and click on "Click here to create the SECRET email account correctly."

Cypher · Post by **Cypher** » Wed Jul 21, 2004 4:37 pm

Looks great Tony. I'll be signing up for this one and will put the word out like Norm said.

YARDofSTUF · Post by **YARDofSTUF** » Wed Jul 21, 2004 4:37 pm

Site menu still looks wrong in firefox like when you first started posting this stuf, but its managable and yes, good work Tony!!!

Paft · Post by **Paft** » Wed Jul 21, 2004 4:59 pm

TonyT wrote:Nope! Bayesian does not work well for anti-spam.

With a sample size as large as yours, it would work superbly. And I use Bayesian filtering for my mail.. 99.4% accuracy.. and my sample size is probally thousands of messages less than yours!

I'm not sure why you'd say it doesn't work well for anti spam.

AceFireball · Post by **AceFireball** » Wed Jul 21, 2004 5:42 pm

YARDofSTUF wrote:Site menu still looks wrong in firefox

\
Same for me, Great job tho

I am signed up and will be testing it

chimdogger · Post by **chimdogger** » Wed Jul 21, 2004 6:52 pm

I signed up. I will let you know how it goes. Thanks for the linkage / invite.

TonyT · Post by **TonyT** » Wed Jul 21, 2004 7:23 pm

Paft

I'm not sure why you'd say it doesn't work well for anti spam.

The word "well" is a relative term. 99.4% may be "well" for a home user, but for a business, that means that .6% spam can get through (which is not too bad) but it also means that legitamate messages can get lost). In a large business, that single "lost or deleted message" can cost thousands of dollars). An email account or network with HUGE volume of traffic means that many weekly hours get wasted by employees who must take the tile to cull their inboxes weeding out those .6% spams messages.

A brief description of Bayesian (this was once pn a page at Accuspam site but has been removed)

Bayesian (Statistical) Phase/Word Content Filtering
(used by SpamAssassin, SpamCop, Sophos, etc)
1. Easily subvertable by future spam which uses random phrases/words or Markov chains
2. Cannot detect email viruses
3. Misses spam with no or very few phrases/words
4. Misses spam with clever methods of hiding phrases/words
5. Misses some spam that have most phrases/words in common with your legitimate email
6. Occasionally blocks legimitate email which contains some phrases/words in common with spam. This research paper (requires Adobe Acrobat is running to view) shows in section 6.3 on page 9 that Bayesian is inferior to using no filter at all ("baseline"), for those who place a high cost on erroneous blocking of your non-spam email.
7. Requires initial training on large sample of your legitimate and spam emails
8. Requires periodic re-training as spam content mutates

Paft · Post by **Paft** » Wed Jul 21, 2004 7:33 pm

TonyT wrote:Paft

A brief description of Bayesian (this was once pn a page at Accuspam site but has been removed)

Bayesian (Statistical) Phase/Word Content Filtering
(used by SpamAssassin, SpamCop, Sophos, etc)
1. Easily subvertable by future spam which uses random phrases/words or Markov chains
2. Cannot detect email viruses
3. Misses spam with no or very few phrases/words
4. Misses spam with clever methods of hiding phrases/words
5. Misses some spam that have most phrases/words in common with your legitimate email
6. Occasionally blocks legimitate email which contains some phrases/words in common with spam. This research paper (requires Adobe Acrobat is running to view) shows in section 6.3 on page 9 that Bayesian is inferior to using no filter at all ("baseline"), for those who place a high cost on erroneous blocking of your non-spam email.
7. Requires initial training on large sample of your legitimate and spam emails
8. Requires periodic re-training as spam content mutates

1.) True
2.) True - No spam filtering method can, however. Only in conjunction with an antivirus can this be done.
3.) Depends on configuration. Can be configured to throw out mail with low word counts.
4.) Most Bayesian filters can be configured to scan sections of emails so that they can detect the broken URLs, and read HTML/CSS (the font/image tricks) as the messages come in. Granted there are some tricks that can't be detected, but I'd love to see your algorithm to see how you get by it.
5.) How does yours avoid that?
6.) See 5
7.) Yes it does. As does anything requiring statistical analysis.
8.) Which is constant with you flagging messages as junk or not junk in the case of errors.

I'm not doubting your software, I'm just wondering what the algorithm is you're using to bypass the Bayesian failures.

TonyT · Post by **TonyT** » Wed Jul 21, 2004 7:52 pm

Paft wrote:1.) True
2.) True - No spam filtering method can, however. Only in conjunction with an antivirus can this be done.
3.) Depends on configuration. Can be configured to throw out mail with low word counts.
4.) Most Bayesian filters can be configured to scan sections of emails so that they can detect the broken URLs, and read HTML/CSS (the font/image tricks) as the messages come in. Granted there are some tricks that can't be detected, but I'd love to see your algorithm to see how you get by it.
5.) How does yours avoid that?
6.) See 5
7.) Yes it does. As does anything requiring statistical analysis.
8.) Which is constant with you flagging messages as junk or not junk in the case of errors.
I'm not doubting your software, I'm just wondering what the algorithm is you're using to bypass the Bayesian failures.

I understand your curiosity!

response to above:
2. We use NO antivirus in conjunction, none is needed. Email viruses are MIME TYPES and these can be detected immediately.
3. Yes, but it would also throw out legit mail too.
8. The short answer is that Acuspam is "self-learning" and "self-training" and user response is only asked for to speed up this training cycle as SOME bulk mail is legit, as in the user's subscriptions.

I cannot devuldge what I know of the algorthm but I can say that it is built upon this definition of what spam is:
Spam = UBE = Unauthorized Bulk Email
and that some mathematical calculations have to do with overall volume of mail being sent based upon certain criteria.

I am certain that the developer will respond in this thread later to answer any technical questions you may have. (he is presently in a time zone 12 hrs offset from me)

Paft · Post by **Paft** » Wed Jul 21, 2004 8:02 pm

TonyT wrote:I understand your curiosity!

response to above:
2. We use NO antivirus in conjunction, none is needed. Email viruses are MIME TYPES and these can be detected immediately.
3. Yes, but it would also throw out legit mail too.
8. The short answer is that Acuspam is "self-learning" and "self-training" and user response is only asked for to speed up this training cycle as SOME bulk mail is legit, as in the user's subscriptions.

I cannot devuldge what I know of the algorthm but I can say that it is built upon this definition of what spam is:
Spam = UBE = Unauthorized Bulk Email
and that some mathematical calculations have to do with overall volume of mail being sent based upon certain criteria.

I am certain that the developer will respond in this thread later to answer any technical questions you may have. (he is presently in a time zone 12 hrs offset from me)

Ah, nice catch. Yes, you could very well check for MIME types in the application/xxxxxx family.

*grins* Self-learning and self-training. So you're definately on a ruleset of some kind. Token/response system?

I'd definately love to talk to the developer.

Hell, I'd sign any NDA or anything else you'd want me to sign if that's what would be needed to even get a description of this algorithm.

*interest peaked*

TonyT · Post by **TonyT** » Wed Jul 21, 2004 8:09 pm

The developer ( Shelby Moore) just emailed me this:
tt

1. It is working very well so far. We need another 24 hours or so to see if any bugs pop up. A noteable stat thus far is that 93% of incoming email so far is spam (for our users) and AccuSpam is of course blocking 100% of it. Note that the statistical aspect of AccuSpam is not going to kick in until we have many more users signed up. However, AccuSpam works more accurately than other anti-spam even without the statistical effect.

2. It is definitely NOT bayesian, although we could employ bayesian in future as a way of grouping those more likely to be spam in the daily summary emails. However, we would never use bayesian as the determinant of which email to delete, because bayesian has one of the worst false positive rates of any anti-spam.

This http://eric.univ-lyon2.fr/~pkdd2000/Download/WS4_01.pdf research paper</a> (requires Adobe Acrobat is running to view) shows in section 6.3 on page 9 that Bayesian is inferior to using no filter at all (baseline), for those who place a high cost on erroneous blocking of your non-spam email!

One of the biggest mistakes to make when comparing anti-spam systems is to focus only on the false positive accuracy (i.e. the spam detection rate) and not also focus on the false negative accuracy (i.e. the non-spam misclassification). If you have to spend all your time browsing the spam folder to find misplaced non-spam, then it is the same WORK as using no filter at all. Nothing has been gained with Bayesian over no filter at all, unless you do not care about losing an occassional non-spam email.

Spammers can use "reverse Bayesian" techniques to tweak their spam to pass through your Bayesian filter. For example, they can use a genetic algorithm to generate many different spams and then place an <img> in the email which pings their server, so the algorithm can get statistical data on which spams are passing through the filter. From that, they can completely rebuild your bayesian statistics. There are other techniques as well. Simply morphing content a lot and sending a lot of different content can subvert Bayesian.

The worst possibility which I warned Paul Graham about when he first proposed Bayesian for anti-spam to the world in 2002 (and which he mostly ignored):

Spam that learns to not be statistically identified:
http://ixazon.dynip.com/pipermail/nilsi ... 00041.html

Is that as spammers morph spam to look more and more like your non-spam, then the false positive inaccuracy (the missclassification of non-spam) increases drastically. And worse, then it makes it much more time consuming to browse the spam folder than before no bayesian, because the spam now looks very similar to the non-spam.

Note in that prophetic post, I predicted that spammers would end up using "reverse bayesian" techniques.

3. Unlike most other anti-spam, AccuSpam never looks at the content of the email, only the headers. This is good indicator that AccuSpam is going to misstakenly delete an important email just because it has the word "penis" or "drug" in it. Using bayesian to group the probable spam in the daily summaries would make AccuSpam less efficient and less private (although no human is involved). There are easier and better ways for us to do this grouping, and we will be implementing this in coming weeks. But this is grouping only, not the algorithm we use to detect and delete spam. One way to do grouping better than bayesian is to leverage the best real-time blacklists. Again we would not use these to delete email (as blacklists are known to cause false positives), only to try to group the spam from the non-spam in the daily summaries.

Here is what the daily summary says, which should give you some insight what we are writing about here and also explain how the statistics works. Realize if you read all of this below, that you won't be seeing these daily summaries often, once there are many users:

Daily Summary Of Possible Spam

READ CAREFULLY PLEASE!

Please click Reply and send this entire email back,
and type an "X" in the [] boxes for only the
emails below, which you wish to be delivered to
your Inbox.

When you reply, emails without an "X" or "D",
are PERMANENTLY DELETED and can not be recovered.

Occasionally NON-SPAM EMAILS WILL APPEAR BELOW
from senders who never emailed you before, so
make sure you scan all email subjects below
before replying.

Since you joined AccuSpam:
?? spams deleted (most automatically)
?? emails delivered (most automatically)
??% of your email has been spam
100% of this spam has been blocked from your Inbox

When you reply with an "X" in the [] box, then you
will never see this message again for that sender, the
sender is added to your Approved Senders list, and all
future emails from that sender will be automatically
delivered to your Inbox.

To deliver an email below, but NOT add the sender to your
Approved Senders list and NOT auto-deliver all future
emails from the sender, then type a "D" instead
of an "X" in the [] box.

(Then a list of potential spams which were not detected via deliverable and reputation statistics)

When you reply, you are helping AccuSpam statistically
detect spam. Your reply is deleting your spam and the
spam of the other AccuSpam users. Also the replies of
other AccuSpam users is deleting your spam before this
message is sent to you, thus reducing the number of spam
subjects you must review in this message. As the number
of AccuSpam users grow, the frequency of these messages
and the number of spam subjects in them will reduce
eventually to almost never. Thus you are required to reply.
Statistically even erroneous or malicious replies of other
users can never delete your non-spam.

To illustrate the rationale for replying, assume the number
of new deliverable, unforged spam senders per day to be
10,000. Thus with 10,000 AccuSpam users, each user will
only have to review 10 or less spam subjects per day. That
takes into account an approximate factor of 10 for
statistical safety. Then with 1 million AccuSpam users (i.e.
only 1/10th of 1% of all email users), each AccuSpam user
would only have to review 1 spam subject every 10 days.

accuspam · Post by **accuspam** » Wed Jul 21, 2004 9:25 pm

First a few corrections:

> I wrote:
>This is good indicator that AccuSpam is going to
> misstakenly delete an important email just because
> it has the word "penis" or "drug" in it.

Typo. I meant "...NOT going to misstakenly...".

> Tony wrote:
> 2. ...Email viruses are MIME TYPES and these can be
> detected immediately.

Note AccuSpam does not need to look at Mime types,
because it is blocking 100% of spam (email viruses are
spam). The only ways you could receive a spam using
AccuSpam are explained here:

http://www.accuspam.com/faq.php#as_spam

> Tony wrote:
> 8. The short answer is that Accuspam is "self-learning"
> and "self-training"...

That could be misleading statement. AccuSpam is not
training itself, but it may seem like it is because the
statistics employed have a way of reducing the effort
for users to near nothing, so it seems like it is automatic.
But in reality, the statistics are done by the users, it just
doesn't take much effort per user, because spam is sent
in such large quantities (this all assuming we will have
many AccuSpam uses to spread the work on to).

And note that the paid version can leverage the free users
so paid user never has to answer the daily summary:

http://www.accuspam.com/faq.php#as_paid

I will write more about the statistics in next post.

Kind Regards,
Shelby Moore III

CEO 3Dize, Inc. (coolpage.com)
CEO DownloadFAST.com, Inc.
founder and main programmer of AccuSpam.com (AntiViotic.com)
main programmer of Cool Page* (1998-), Art-O-Matic* (1996-8), WordUp* (1986-90), TurboJet (1988)
contributing programmer to DownloadFAST.com* (2001-2), Corel Painter* (1993-5), Corel ArtDabbler, EOS PhotoModeler (1996), FONTZ! (1988)

shelby@coolpage.com

* denotes major involvement in massive multi-year R&D projects with millions of characters (1000s of pages) of code

accuspam · Post by **accuspam** » Wed Jul 21, 2004 9:42 pm

> 2.) True - No spam filtering method can, however.
> Only in conjunction with an antivirus can this be done.

Any anti-spam which blocks 100% can also block 100% email viruses.

Also remember 100% blocking is not enough to compare anti-spam systems. I laugh when I see the ad for the C/R system (mailmoper.com I believe) which is offering to pay $1 for every spam you recieve, but they would never dare pay you $1 for each non-spam you will not receive!

So when comparing anti-spam, don't forget to compare the false positive rate also. AccuSpam is 0% false positive! It is the only one in the industry!

> 3.) Depends on configuration. Can be configured
> to throw out mail with low word counts.

Some non-spam would be thrown out too.

That nasty little false positive rate issue is the achilles heal of all other anti-spam (except BrightMail.com which we feel is our best competitor but they don't do 100% blocking).

> 4.) Most Bayesian filters can be configured to scan
> sections of emails so that they can detect the broken
> URLs, and read HTML/CSS (the font/image tricks) as
> the messages come in. Granted there are some tricks
> that can't be detected, but I'd love to see your
> algorithm to see how you get by it.

One of the tricks that can not be detected by bayesian is to put all the letters of a word in a table so that they are just letters to bayesian but visually they layout to be word.

And these tricks are always increasing in variety. Content filters are just "hacks" or "heuristics" in my opinion. BrightMail uses spam problems in conjunction with humans to constantly adjust these heuristics, so they are in effect using the economy-of-scale of the fact that spam is sent in large quantities. But bayesian in itself does not gain much from the fact that spam is sent in large quantities. More on that below...

See below for explanation of why AccuSpam is not subverted by these tricks.

> 5.) How does yours avoid that?

Simple. AccuSpam never looks at content. So it can not be tricked by content

> 6.) See 5

Ditto.

>7.) Yes it does. As does anything requiring statistical analysis.
>8.) Which is constant with you flagging messages as junk or not junk in the case of errors.

Correct, except the difference is that with AccuSpam, only a few users have to flag a message as spam and then all other users do not have to. Where "few" is defined by the statistical "significance" or "accuracy" we desire (sigma). As the number of AccuSpam users increases, then when the user is asked for this input, they will only be looking at a very few spams to classify and very infrequently. Unlike a Bayesian system where you are constantly having to look at ALL the spam in your junk folder for non-spam.

When we reach 1 million users for AccuSpam (e.g. 1/300 of # of BrightMail users), then we expect that each user will only be asked about a few spams maybe once a month.

> I'm not doubting your software, I'm just
> wondering what the algorithm is you're
> using to bypass the Bayesian failures.

Simple. We do not look at content. We look at senders and senders' domains. We run real-time statistics from that. Also we detect undeliverables and forgeries using the auto-response (which deletes a large % of the spam without asking users...maybe that is what Tony meant by self-training but these deletes are not input into the statistics portion), and is necesary to force the spammers to reveal their true addresses. See the "How It Works" section:

http://www.accuspam.com/accuspam.php#how

Bayesian makes a statistical assumption that the non-spam and spam words are mutually exclusive. That is why they call it "naive" bayesian. Without that assumption, then the Bayesian stats can not be computed (realistically). This is fundamentally why Bayesian for anti-spam is error prone, because that mathematical assumption is not entirely true. See Paul Graham's web site for some discussion of this math or better to Google "naive Bayesian spam".

Whereas AccuSpam is not measuring the statistics of content. It is measuring the statistics of the opinions of which senders and domains are spam as compared to all other senders and domains. Thus we can choose any statistical accuracy we want. If we set a sigma of say 10, then we get 99.9999+% accuracy (did not take the time to calculate that exactly...just for illustration of point).

There is no equivalent way to dial in Bayesian. One can trade false negative rate for false positive rate in any statistical method for anti-spam, but with Bayesian you can never get any where near 0% false positive rate, because the underlying math assumption is not that accurate.

So the next line of thought is to compare AccuSpam to other (than bayesian) statistical anti-spam, especially those that use data from many users.

http://www.cloudmark.com/products/spamnet/features/

I think CloudMark uses Vipul's Razor as it's statistical network:

http://razor.sourceforge.net/

This is confirmed in the FAQ:

http://razor.sourceforge.net/docs/doc.p ... t&name=FAQ

Some clear differences from AccuSpam:

1. Email is delivered before recipient can be ask if a message is spam (recipient is not asked whether to deliver email from non-approved sender). Thus system is not 100% effective.

2. Blacklisting is by content signatures, not sender address. Thus false positives result.

3. The statistics can not be as accurate, because AccuSpam is sampling a huge quantity for domains as a baselines. Razor does not measure by sender (domain or address), so statistics have to baselined according to a recipient trust metric. A key fact of statistics is that sampling error decreases as sample size increases.

Note that the DCC anti-spam system does not use statistical methods (standard deviation, confidence intervals, etc) and it suffers from many problems such as #2 and either #1 or whitelist management problems of Challenge Response systems:

http://www.rhyolite.com/anti-spam/dcc/

-Shelby Moore
http://AccuSpam.com

accuspam · Post by **accuspam** » Wed Jul 21, 2004 9:49 pm

>> BrightMail uses spam problems in conjunction...

Typo. Meant "BrightMail uses milliions of spam probes in conjunction..."

Paft · Post by **Paft** » Wed Jul 21, 2004 9:51 pm

So, basically, you're stating that because you do not look at email content, you are immune from spammers.

So how do you block spoofed From addresses (to a valid domain), people who run off from free email sites (hotmail, yahoo, etc), people who buy domains just to run an SMTP server from; and still manage to detect the perhaps .001% of users who actually WANT to get mail from, say, xxxhotb4b3s.com?

Statistically, if you're basing your assumptions using NHST (Null Hypothesis Statistics Testing), you HAVE TO HAVE some sort of error level. Either Type 1 or Type 2 (False positives [calling non-spam spam], or false negatives [calling spam non-spam]) errors. If your alpha level is, say, .00001%, and your null hypothesis is that "this mail is spam", then you have an INCREDIBLY SMALL chance to correct a mistake if your software calls a mail spam (say the return email hits their server during a major network outage).

How do you avoid situations like that?

accuspam · Post by **accuspam** » Wed Jul 21, 2004 9:57 pm

A common misconception is something like "but spammers will just change their address or domain".

Again I urge the reader to read more carefully the "How It Works" section and focus on the words "undeliverable" and "forged". By detecting "undeliverable" and "forged", it becomes prohibitively expensive for the spammer to change addresses and domains often enough to escape real-time statistical detection:

http://www.accuspam.com/accuspam.php#how

As a side benefit, corporations will be able to license AccuSpam to insure than no spammer can forge their address when emailing to AccuSpam users. That option will be coming to our web site soon. There is nothing the users have to change. It is already built in when they sign up.

Any more questions? If yes, then send me an email to:

shelby@coolpage.com

to alert me that you made a post here for me to answer. That is how confident I am in AccuSpam. I am willing to post my personal email address in a public forum! Spammers PLEASE SEND ME SPAM! :-) ;-)

-Shelby Moore
http://AccuSpam.com

accuspam · Post by **accuspam** » Wed Jul 21, 2004 10:21 pm

> So how do you block spoofed From addresses (to a valid domain...

Spoofed address are either undeliverable or they are forged. We detect both.

> ...you have an INCREDIBLY SMALL chance to correct a mistake if
> your software calls a mail spam...

First, the detection of undeliverables is robust, because SMTP is robust. I am sure you have seen those emails "Warning message not delivered in 4 hours...will keeping trying".

Second, the undeliverables do not feed into the statistics.

Third, what ever confidence level we select for our hypothesis, is the error rate we will see. So yes, 0% would be more correctly stated as 0.00001% or 1 in million. Yes AccuSpam could lose 1 in 1 million non-spam emails, which is one reason why we offer an "undo" in the paid version:

http://www.accuspam.com/faq.php#as_paid

But who really cares about 1 in 1 million? Compare that to Bayesian, Challenge Response, real-time blacklists, or heuristic filters, all which are more like 1 in 1000. That is the difference between losing 1 non-spam in 1000 days (for AccuSpam) and every day (for the other anti-spam).

As I said, I only know of BrightMail which can compare to our false positive rate.

As for the false negative error rate in our hypothesis, all undetected spam is presented in the "daily" (frequency will be more like monthly as number of users grows) summary, so never delivered to Inbox without permission of user.

The bottom line is that for the user the perception will be:

1. AccuSpam detects and deletes most (soon to be 99+%) of my spam automatically, and NEVER (100% blocking) is spam delivered to my Inbox.

2. Occassionally (perhaps monthly) get an email from AccuSpam asking whether a few spams are spam or not. I send it back with my answers. Minimal and infrequent effort.

3. "Never" (1 in 1 million) do I lose non-spam or have to go browse a spam folder.

So in essense you get 100% protection, lose 0% non-spam (1 in 1 million), and do not have daily effort or hassle of browsing spam (as the number of AccuSpam users increases).

-Shelby Moore
http://AccuSpam.com

Paft · Post by **Paft** » Wed Jul 21, 2004 10:25 pm

*nods* Gotcha. Nice system.

accuspam · Post by **accuspam** » Wed Jul 21, 2004 10:31 pm

Correct another typo I made in previous post

"One of the biggest mistakes to make when comparing anti-spam systems is to focus only on the false **NEGATIVE** accuracy (i.e. the spam detection rate) and not also focus on the false **POSITIVE** accuracy (i.e. the non-spam misclassification)."

accuspam · Post by **accuspam** » Thu Jul 22, 2004 3:33 am

As you know, we just released AccuSpam less than 24 hours ago and it usually takes at least 24 hours to discover any post-release bugs. That is normal and expected.

I just fixed two bugs, neither of which I expect to have lost any email. The worst case was some small corruptions in emails. Note it is fixed now, but those emails which arrived before now, can not be fixed. So when you process your next daily summary you may get a few of these corruptions but they are usually not worrisome (e.g. a "=" at end of each line of the line). Also some attachments could have been lost, but not entire emails.

By tomorrow this fix will have fully propogated and should not see any more problems.

For the more inquisitive minds, the specific bugs where:

1. We were failing to propogate some of the critical Mime encoding headers. Changing to a case-insensitive search fixed this.

2. We were failing to propogate the "Date:" header so all the dates were getting changed on delivered emails. This should be fixed as well.

3. An obscure case in the bounce detection logic was fixed/improved. The case where a sender replies to the AccuSpam confirmation, but changes the subject significantly (e.g. more than "Re: " or "Re[4]: ") but also returns the body (even if changed) thus it does not look like an auto-response. This obscure case almost never happens, but just to be safe we added a check whether the sender has changed, since most bounces use "postmaster@", "mailer-daemon@", etc.. I actually found this because I lost some important email when our host support department replied (not auto-response but human reply) but mangled the subject with a ticket #.

-Shelby Moore
http://AccuSpam.com

accuspam · Post by **accuspam** » Thu Jul 22, 2004 10:54 pm

Major improvement coming!

Note these algorithms are AccuSpam's inventions and mentioning them here in no way gives rights to others to use these algorithms without a license from AccuSpam.

From the initial usership, we see that much of the spam in the Daily Summaries is coming from the same domains (as expected) but different senders (part before the @ changes) which is also as expected.

What we just realized is that we don't need to have the user manually blacklist those domains (we need planned to) and we do not have to wait for many, many users of AccuSpam to do the global domain blacklisting (that was what we planned).

We can simply apply our domain blocking statistics per user. So as a user builds up data about domains, that users domains which are always sending spam will get statistically (with confidence that insures 1 in million false positives) blocked.

This should drastically reduce the # of spam subjects in the Daily Summaries.

We hope to implement it next week.

We ALSO will still retain the (as originally planned) global statistical blocking of domains and senders, but this won't be able to kick in until we have many 1000s of users.

-Shelby Moore
http://AccuSpam.com

Brk · Post by **Brk** » Fri Jul 23, 2004 12:12 am

I feel like a dunce. Even after reading all of this, I still don't understand how it all works. As it is, I am preparing a huge export of my address book and typing in every possible e-mail address I might need to let through

accuspam · Post by **accuspam** » Fri Jul 23, 2004 9:43 pm

Why do you feel like you need to know how it works?

Are you experiencing some issues with using it?

It is not absolutely necessary to add your Address Book to the Approved Senders list (although more efficient), as these will get added when you scan your Daily Summaries for legit email and put an "x" in the [] boxes for those. Even if you add your Address Book, you still need to scan the Daily Summaries because you might get legit email from new senders that are not in your Address Book yet.

The main issue we are working on right now, is that AccuSpam users are seeing too many spam subjects in their Daily Summaries. The spam is being blocked, but it is too much to wade through every day to find the legit email. On average, 60% of spam is being automatically deleted and the other 40% is showing in the Daily Summaries.

The reason is because there are not enough AccuSpam users yet for the global statistical methods to kick in. When they do, then we predict AccuSpam users will see spam subjects in their Daily Summaries very rarely.

While we are waiting for the # of AccuSpam users to increase, we need to provide better than 60% performance in terms of the Daily Summaries. Realize even now, 100% spam is blocked from Inbox. I am just referrring the % of spam subjects in the Daily Summaries.

Within a few days or less, we will implement some extra filters to deal with the 40%:

1. We will implement the very accurate (1 in 1 million false positive) PER USER domain blocking as per previous post I made in this thread yesterday. This may delete a significant portion of the 40%. Maybe 39% of it (did not yet run the data to see how much would be caught)

2. We noticed spammers are spoofing the AccuSpam confirmation messages and these are showing up as spam in the Daily Summaries. We will detect those automatically (may have that done today).

3. We may add a bayesian filter or other SpamAssassin filter to RANK the subjects in the Daily Summaries, but never delete from the Daily Summary using those inexact type of filters.

Be patient as you use AccuSpam, knowing that we are continually observing the results and finding ways to improve it.

Also please feedback here or via email to <support@accuspam.com> any issues you want to bring to our attention so that we can know about things that need to be fixed or improved.

-Shelby Moore
http://AccuSpam.com

accuspam · Post by **accuspam** » Fri Jul 23, 2004 9:54 pm

Add to my previous post (on page 2 of this thread), that all the effort AccuSpam users are expending now to scan their Daily Summaries is being recorded and is not wasted effort.

When we flip a switch to turn on the PER USER domain blocking, all that effort will be rewarded by an instant reduction of the # of spam subjects seen in the Daily Summaries.

Keep using AccuSpam and keep scanning your Daily Summaries. You will be rewarded very soon. You are already being rewarded in terms of 60% removal of spam subjects from Daily Summaries and 100% blocking from Inbox.

It will only get better very soon.

-Shelby Moore
http://AccuSpam.com

accuspam · Post by **accuspam** » Fri Jul 23, 2004 10:43 pm

For the potential users of anti-spam, perception is apparently all that matters in marketing.

Let me tell a short story about something unrelated to anti-spam and unrelated to AccuSpam in order to make my point. I relate to this to security of passwords. It is a well known fact among us computer scientists that using a password that is composed of recognizeable words is very risky, because all a hacker has to do is run through all combinations of a dictionary of known words, proper names, and word fragments. This is much faster than running through all combinations of all digits. For example, if we have a dictionary of 10,000 words and we try all 1 and 2 word combinations as passwords, then we get, (10,000 + 10,000 x 10,000) = 100 million combinations to try. A computer which can try a million combinations a second, will only take 100 seconds to hack any possible password made of 1 or 2 words. Whereas compare that to passwords which don't use words, but use random digits, e.g. "j24vshvg5s7g4b2G". Then given 26 letters in lowercase alphabet, 26 in uppercase, and 10 numeric digits, then all combinations for a 16 digit password would be: (26+26+10)^26 = 47672401706823533450263330816 combinations. Same computer would take 47672401706823533450263 seconds to try all combinations, which happens to be 1511681941489838 years!!!!!

So which password do you think is more secure, the one that takes 100 seconds or the one that takes 1511681941489838 years to crack?

However, users still prefer to use passwords that contain recognizeable words, because they can remember them more easily. Alas, the user is not currently hacked, so they are under a false sense of security.

Okay so now let me relate this to anti-spam and AccuSpam.

The current state of anti-spam is that many (most?) users are currently using an anti-spam system based on Bayesian and/or hueristic rules (guesses). For example, Spam Assassin is a very popular product installed by many ISPs. Many users are quite satisfied with their current results, just as they are satisfied with their current recognizeable word passwords. Just as we were all satisfied with dates in all our programs that wrapped back to 0 after the year 2000. Remember the massive effort it took to fix that before year 2000?

The problem is that all a spammer has to do is change a few things and these Bayesian and hueristic filters can go awry. For a particular user, maybe their pattern of use is such that they haven't noticed a problem yet, but the fundamental problem is lurking and will happen eventually. For example, I know a "security expert" sysadmin who swears SpamAssassin has near 0% false positive rate for him (even though the published stat is 0.5% for SpamAssassin as used in McAfee SpamKiller, which is a horrible 1 in 200 non-spam emails lost), and it could be that his use of email terminology in his non-spam is very different from the spam he is currently receiving. Also he has the knowledge and time to tweak SpamAssassin to his personal use. But once a spammer sends him email with "sysadmin" words, e.g. "server", "downtime", "pager", "linux", etc., then either his non-spam will get flagged as spam or his spam will not get caught.

Thus products like SpamAssassin will constantly require tweaking of the hueristic rules (guesses) to keep up with the changes in the "quirks" of spam that the rules detect. I bet the anti-spam companies such as Norton and McAfee would love for users to get locked into monthly updates of rules. How convenient as a way to charge for upgrades!

Whereas, AccuSpam is based on the principle of deterministic statistics. We want to solve the problem once and for all. We target the unmorphable aspect of spam, i.e. that spam is email sent in large quantities and undesired by the majority of recipients. That is the most exact and agreeable definition of spam I know of. Whereas, those bayesian and hueristic anti-spam are defining spam to be "bad content" or "bad headers" or "bad relay server" or a zillion other things which have some correlation to but are not what spam is.

So don't be surprised if you use recognizeable word passwords that you will get hacked one day and by the same line of logic, if you use a bayesian or hueristic anti-spam (not AccuSpam), then don't be surprised if you lose important email or get swamped in spams or email viruses one day. We can apply the same logic to anti-virus software which is also hueristic or based on previously seen viruses, not on future unknown viruses.

AccuSpam is deterministic. That is the bottom line.

For now, we have a little bit of a marketing problem because all the user cares about is the performance on the first day they use a product, not the ongoing performance. But we can say right now that AccuSpam prevents 100% of spam from reaching the Inbox. That is sure. And we can say, if you scan your Daily Summaries then you will never lose important email. And we can say that the Daily Summaries will contain less and less spam subjects as we progress...

-Shelby Moore
http://AccuSpam.com

accuspam · Post by **accuspam** » Fri Jul 23, 2004 10:58 pm

There I go again. In my haste, I made another typo.

Correction:

16 digit password would be: (26+26+10)^16

Note the "^16" instead of the "^26".

Also let me explain that "^16" is shorthand for "raised to the power of 16".

Thus the long way of writing it is:

(26+26+10) X (26+26+10) X (26+26+10) X (26+26+10) X (26+26+10) X (26+26+10) X (26+26+10) X (26+26+10) X (26+26+10) X (26+26+10) X (26+26+10) X (26+26+10) X (26+26+10) X (26+26+10) X (26+26+10) X (26+26+10) = 47672401706823533450263330816

The reason is because you have 16 digits (places to put the 26+26+10 possibilities).

Also remember this password example has NOTHING to do with how AccuSpam works. It was merely an unrelated example to point out that users do not care about security. All they care is that passwords are easy to remember.

-Shelby Moore
http://AccuSpam.com

Bouncer · Post by **Bouncer** » Fri Jul 23, 2004 11:21 pm

accuspam wrote:Why do you feel like you need to know how it works?

-Shelby Moore

Shelby, this is a (somewhat) technical site. People are curious about inner workings or they wouldn't be here. They also feel more confident about a system whether it's QPSK based signalling oriented, or software oriented when they have what they consider to be a reasonable understanding of the material.

Regards,
-Bouncer-

Bouncer · Post by **Bouncer** » Fri Jul 23, 2004 11:35 pm

I'm going to ask a simple question.

Is Accuspam based (wholly or partially) on probability vectors as related to header info?

Regards,
-Bouncer-

accuspam · Post by **accuspam** » Fri Jul 23, 2004 11:59 pm

Good news! I just ran some queries on AccuSpam data, and very soon AccuSpam users will see their Daily Summaries drop to near 0!

It is very simple. The vast majority of unspoofed spam is coming from same domains over and over. Remember AccuSpam deletes the spoofed spam so 60% is already gone. The other 40% has been going into the Daily Summaries but none of it (100% protection) has been going into the Inbox.

Now we will be able to delete the 40% from the Daily Summaries so it will be easier to find any legit email from new senders (not have to scan so many subjects).

So for the new AccuSpam user, they will only have to disapprove (return the Daily Summary) for spam senders a few times, then AccuSpam will statistically recognize that a domain is a spam domain FOR THAT USER ONLY.

The ability for users to help other users delete spam will not kick in statistically until there are closer to 100,000+ AccuSpam users, but it may not even be necessary. Apparently the improvement we will make this week as described above, will be enough to clean up the Daily Summaries very effectively.

For the more technically inquisitive, here is an except of the query I did. With Tony's permission, user #20 is TonyT (you all know him here as Senior member of SpeedGuides). Tony current has 98 emails waiting to be summarized in his next Daily Summary and 64 of them from unique senders. And from 2nd query, we see that all 98 has been for domains which were disapproved more than 3 times already by Tony (and probably not approved, although I did not do that query yet but will be part of the statistical calculation).

mysql> SELECT UserId, COUNT( DISTINCT ConfirmId), COUNT( DISTINCT SenderId ) FROM confirm WHERE Status='Confirm'
-> GROUP BY UserId ORDER BY UserId;
+--------+----------------------------+----------------------------+
| UserId | COUNT( DISTINCT ConfirmId) | COUNT( DISTINCT SenderId ) |
+--------+----------------------------+----------------------------+
...
| 20 | 98 | 64 |
...
+--------+----------------------------+----------------------------+
1700 rows in set (0.07 sec)

mysql> SELECT s.Tld, c.UserId, COUNT( DISTINCT c.ConfirmId ), COUNT( DISTINCT s.SenderId ) FROM confirm AS c
-> LEFT JOIN disapproved AS d USING( UserId )
-> LEFT JOIN sender AS s USING( SenderId )
-> WHERE Status='Confirm'
-> GROUP BY s.Tld, c.UserId HAVING COUNT( DISTINCT s.SenderId ) > 3;
+-------------------------+--------+-------------------------------+------------------------------+
| Tld | UserId | COUNT( DISTINCT c.ConfirmId ) | COUNT( DISTINCT s.SenderId ) |
+-------------------------+--------+-------------------------------+------------------------------+
| 3kinmy.com | 20 | 98 | 4 |
| 8it0cop.com | 20 | 98 | 5 |
| aa05.com | 20 | 98 | 4 |
| aagregory.com | 20 | 98 | 7 |
| absoff10.com | 20 | 98 | 5 |
| adknowledge2.com | 20 | 98 | 7 |
| adknowledgemail.com | 20 | 98 | 4 |
| blmngp.com | 20 | 98 | 4 |
| boolever.com | 20 | 98 | 4 |
| centerunionac.com | 20 | 98 | 5 |
| centerunionaj.com | 20 | 98 | 5 |
| choosemymail.com | 20 | 98 | 4 |
| classoffers.com | 20 | 98 | 6 |
| crysholef.com | 20 | 98 | 4 |
| crysholgh.com | 20 | 98 | 4 |
| estrategics.com | 20 | 98 | 4 |
| estratfg.com | 20 | 98 | 5 |
| golunga.com | 20 | 98 | 11 |
| great-dealz.net | 20 | 98 | 4 |
| greatestdealsaround.com | 20 | 98 | 4 |
| hh02.com | 20 | 98 | 4 |
| hotmail.com | 20 | 98 | 4 |
| ofcsvr.com | 20 | 98 | 4 |
| ofmsvr.com | 20 | 98 | 11 |
| pactkl.com | 20 | 98 | 4 |
| pclght11.com | 20 | 98 | 4 |
| pfiftyonemustang.com | 20 | 98 | 4 |
| rashpie.com | 20 | 98 | 12 |
| smilepop.com | 20 | 98 | 4 |
| squibkl.com | 20 | 98 | 4 |
| squibnetworks.com | 20 | 98 | 4 |
| trlcnt12.com | 20 | 98 | 4 |
| uu02.com | 20 | 98 | 4 |
| velvettooth.net | 20 | 98 | 4 |
| vmadmin.com | 20 | 98 | 11 |
| winsomab.com | 20 | 98 | 5 |
| wzone03.com | 20 | 98 | 5 |
+-------------------------+--------+-------------------------------+------------------------------+

accuspam · Post by **accuspam** » Sat Jul 24, 2004 12:12 am

> Is Accuspam based (wholly or partially) on
> probability vectors as related to header info?

Only the From: header and the probabilities are determined by the statistically significant opinions of AccuSpam users as to which unspoofed senders' addresses and domains are sending ONLY spam, where ONLY is statistically significant to the accuracy we have chosen. And this is calculated in a way that any one user can not influence the statistics (malicously or erroneously).

For the rare (lets use 1 in 1 million) falsely flagged non-spam, the senders whose email is deleted by AccuSpam, (unlike those blocked into the Daily Summaries which get confirmations that do not require any additional action from sender), get a challenge response which allows them to escape the blacklist into the Daily Summary and then possibly become Approved Sender.

Thus the statistics are self-correcting, even in the worst possible rare imagineable scenario.

-Shelby Moore
http://AccuSpam.com

TonyT · Post by **TonyT** » Sat Jul 24, 2004 12:14 pm

Burke wrote:I feel like a dunce. Even after reading all of this, I still don't understand how it all works. As it is, I am preparing a huge export of my address book and typing in every possible e-mail address I might need to let through

Burke:
This is how it works: (simplified)
1. You login to accuspam using your email info:
username:
mail server:
password:
email address]. If you wish to receive any of these flagged messages you put the appropriate characher inbetween the brackets and simply Reply to the Daily Summary message. (send back to accuspam) If you do not place a character inbetween the brackets the message gets deleted and accuspam "learns" based on user input. As the number of users grow, the system gets "smarter" and the statistical spam blocking improves.

When someone not in your Approved Senders list sends you a message, he receives an auto response telling him that his message was received. He does NOT have to reply to this auto response. (no hassle) But this auto response has an option for him to reply, and if he does, it forces faster delivery of his message. (much like using your mail client to mark a message as 'Priority')

The Daily Summary can contain message that you DO want to receive. In this case, you put an X inbetween the brackets and send the message back to accuspam. And this user will be auto added to your Approved Senders list. By using this method (instead of adding entire address book right away), accuspam will 'learn' faster about spam and all of your wanted messages will still arrive in your mail client's inbox. Someone whom you desire to receive messages from will only ever have ONE auto response sent to him, unless you fail to approve him via the daily summary. (adding his address to Approves Sender list also stops auto response sent to him)

accuspam · Post by **accuspam** » Sat Jul 24, 2004 7:30 pm

TonyT, thanks for the excellent summary of how to use AccuSpam in previous post.

For those who want more details on why "naive" Bayesian is mathematically flawed, I dug us this old email I wrote. Note there may be some errors in here, because I did not take the time today to study what I had written then and see if I had corrected myself later. Any way, this will at least stimulate the thought process about why Bayesian can fall apart if the "probabilities of features (i.e. words) are independent" is not true, i.e. if spam and non-spam start sharing the same word sets.

I also publicly posted portion of this email at the bottom of this public post:

http://forum.icann.org/lists/stld-rfp-m ... 00061.html

Subject: ELABORATION on Probability Theory of: Multiplicative principle for anti-spam
Cc:

1. The math assumption, "P(a | b) = P(a) * P(b)", in forwarded email below is derived, where "|" is intersection of two mutually exclusive events. If we are not sure they are mutually exclusively and we assume they are, then this is called "naive":

P(a | b) = P(b ! a) * P(a), where "!" is conditional probability, i.e. "if"

P(a I b) = P(b) * P(a), because P(b ! a) = P(b) if a and b are mutually exclusive events.

2. Incidentally, the P(a & b) = P(a) + P(b) - P(a | b) where "&" is union of two mutually exclusive events. The derivation is:

P(a & ~a) = 1 = P(a) + P(~a), where is "~" is complement

P(a & b) = P(a) + P(b | ~a)

P(b) = P(a | b) + P(b | ~a)

P(b | ~a) = P(b) - P(a | b)

3. So the probability of spam being caught by the intersection and union of two filters (events), is P(a | b) and P(a & b), as shown above. However, if a spam is caught by the intersection of two filters, then the probability (confidence) that the caught spam is really spam is:

P(a @ b) = P(a) * P(b) / [P(a) * P(b) + (1 - P(a)) * (1 - P(b))]

when the "a priori" probability of any email being spam is 0.5 and assume that P(a @ b) = P(~a @ ~b), i.e. that probability that caught spam is really spam is equal to probability that not caught spam is really not spam.

This is intuitively correct because P(0.5 @ 0.5) = 0.5. That is to say that intersection of two filters which say catch 50% of spam catch less (25%) spam, but the probability that the caught spam is really spam does not change, because the filters had a equal probability of catching spam and not catching spam.

The derivation is:

http://www.mathpages.com/home/kmath267.htm

Thus:

P(0.95 @ 0.95) = 0.95 * 0.95 / (0.95 * 0.95 + 0.05 * 0.05) = 0.997 = 99.7%

Thus although caught spam decreases by using intersection of two filters, the probability that the caught spam is really spam increases.

However, note that if we have measure false positive rate of 0.01%, then the predicted 0.3% (100% - 99.7%) is incorrect, so it means either or both of the assumptions in #3 are not true. This is very interesting, because Paul Grahams "Naive Bayesian" makes these assumptions in 2 different equations:

i) http://www.paulgraham.com/naivebayes.html

Same eq as #3

ii) http://www.paulgraham.com/spam.html

Here Paul Graham assumes:

P(a ! b) = P(b ! a) / [P(b ! a) + P(b ! ~a)], thus assuming P(a) = P(~a)

Because the Bayesian equation is:

P(a ! b) = P(b ! a) * P(a) / [P(b ! a) * P(a) + P(b ! ~a) * P(~a)]

Paul admits these assumptions:

http://www.paulgraham.com/better.html

"Probabilities in this algorithm are calculated using a degenerate case of Bayes' Rule. There are two simplifying assumptions: that the probabilities of features (i.e. words) are independent, and that we know nothing about the prior probability of an email being spam."

>Date: Thu, 15 Apr 2004 00:12:22 +0800
>From: Shelby Moore <shelby@coolpage.com>
>Subject: Multiplicative principle for anti-spam
>
>An idea to contemplate is that if you take two spam filters that are 95%
>effective, and have 1% false positive rate, then if you only delete the spam
>which is caught by *BOTH* filters, then the effectiveness is 0.95 * 0.95 =
>90.25% and the false positive rate is 0.01 * 0.01 = 0.01%. So 90 out of 100
>spams are caught and only 1 in 10,000 legit emails are caught.
>
>So one idea for anti-spam is to apply multiple highly effective filters to
>reduce the false positive rate.

-Shelby Moore
http://AccuSpam.com

accuspam · Post by **accuspam** » Sat Jul 24, 2004 9:28 pm

Correction to previous post:

> Good news! I just ran some queries on AccuSpam data, and
> very soon AccuSpam users will see their Daily Summaries
> drop to near 0!

There was a mistake in the query I used. It now looks like perhaps 25% or more of the spam subjects in the Daily Summaries can be eliminated with PER USER domain statistical blocking.

It could be greater, possibly even much greater, as data builds.

We will know more when we implement that next week.

Also we still have the GLOBAL domain and sender statistical blocking to look forward to as the # of AccuSpam users increase.

Once again, want to re-iterate that 100% spam is blocked from Inbox (paid version, or 99% for free version). We currently seeing around 60% deleted immediately. Then about 40% summarized in the Daily Summary email. We think this 40% can be reduced by at least 25% with simple PER USER domain stats blocking. To be implemented next week. So then you would have 70% deleted immediately and 30% summarized. However it could be better than that. We will know next week. And on the horizon, as the # of AccuSpam users increase we will see that move towards something like 99.9% deleted immediately and only 0.1% summarized, due to the GLOBAL (users helping other users) statistical blocking.

I will be away for next 3 days or so. Direct any issues to Tony please.

-Shelby Moore
http://AccuSpam

Paft · Post by **Paft** » Sat Jul 24, 2004 9:42 pm

Basically a global blacklist based on user input and "known" spammer domains.

Sweet.