Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Google probably lets some amount of known-spam emails through for data gathering. See this quote from Google's "Rules of Machine Learning" [1] (A great resource by the way)

> Rule #34: In binary classification for filtering (such as spam detection or determining interesting emails), make small short-term sacrifices in performance for very clean data.

> In a filtering task, examples which are marked as negative are not shown to the user. Suppose you have a filter that blocks 75% of the negative examples at serving. You might be tempted to draw additional training data from the instances shown to users. For example, if a user marks an email as spam that your filter let through, you might want to learn from that.

> But this approach introduces sampling bias. You can gather cleaner data if instead during serving you label 1% of all traffic as "held out", and send all held out examples to the user. Now your filter is blocking at least 74% of the negative examples. These held out examples can become your training data.

> Note that if your filter is blocking 95% of the negative examples or more, this approach becomes less viable. Even so, if you wish to measure serving performance, you can make an even tinier sample (say 0.1% or 0.001%). Ten thousand examples is enough to estimate performance quite accurately.

[1] https://developers.google.com/machine-learning/guides/rules-...



I don't think that explains the very obvious crap that gets through, for instance several near duplicate spams in a row, each of which I manually reported.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: