105

Some spam messages fresh from my Wordpress filter:

Asking questions are in fact pleasant thing if you are not understanding something totally, except this article gives good understanding yet.

and

Thanks for any other informative blog. Where else may I am getting that kind of information written in such an ideal means? I’ve a project that I’m simply now working on, and I have been on the look out for such info.

Is it just that basically all blog spam comes from non-English speaking countries, or is there some kind of tactical decision being made about the language? I ask because when I first saw it, I thought perhaps they were being genuine but inarticulate.

Lucas
  • 1,019
  • 2
  • 7
  • 9
  • 39
    Google translate from Russian? – Adi Jun 13 '13 at 17:42
  • 3
    Related: why do so many big websites (Salon, Wired, ...) have such terrible anti-spam comment filters. Filtering out "my wife/girlfriend/ just made $XX dollars) would eliminate a ton of spam.) – Jerry Asher Jun 13 '13 at 20:27
  • 20
    @LarsH: It is definitely a security issue. Security is by definition the protection of a valuable resource from exploitation by hostile attackers. My blog comments are a valuable resource and I assure you it is under constant attack by hostile parties. – Eric Lippert Jun 13 '13 at 23:00
  • 6
    @AJHenderson Couldn't you make the same argument for say, a desktop computer, the computer is there to do calculations, connect with the internet, provide data access etc. Someone who breaks in is then just using it. The thing is, we just don't want them to do that, so we have security systems to stop them. Stopping people from doing things you don't want to you, or to your property, seems like the very definition of security to me. A spam filter fits this, is stops people posting whatever they want, because the owner of the site doesn't want it. – Lucas Jun 14 '13 at 03:45
  • 3
    To make another parallel, lets say I rob a bank, that's a security issue. On the other hand, if I use the bank to store my illegal money, it isn't business the bank wants (hopefully) but I'm not breaking their security. I'm misusing their service, but I'm using it as it was designed, simply not for the reasons it was intended. – AJ Henderson Jun 14 '13 at 03:50
  • 1
    @Lucas - it's really just a semantics thing and enough of the community seems to be on-board with it being ok by either association (since they are still closely associated in either case) or a difference in view from me. I'll just point out that many vendors themselves have Anti-Spam & Security products where they see it as two distinct things as well. Otherwise it would just be a feature of a good Security product with no specific mention of anti-spam needed in the category. But I do acknowledge there is room for both viewpoints and the difference is minor. – AJ Henderson Jun 14 '13 at 13:19
  • 27
    The question title should have been something like "Why blog spam written so bad always?" – Tobias Kienzler Jun 14 '13 at 14:30
  • two words: fuzzy logic. fuzzy logic is hard if not impossible to efficiently predict. – jokoon Jun 15 '13 at 17:38

10 Answers10

140

The spammers are automatically generating new comments by taking existing comments and running them through a thesaurus program that replaces words with synonyms or related parts of speech. The result is a sentence which makes sense, but has word choices that no native speaker would ever make:

Where else may I am getting ...

is clearly not something a native speaker would write, but

Where else could she be getting...

is, and can be transformed by a simple substitution of pronouns and synonyms into the spam text.

This way, even if anti-spam forces have a huge database of known-spam comments, the spammers can generate infinitely many new ones that are plausibly English.

I long suspected this was the case but I recently got proof. I now occasionally get comment spam containing the entire substitution script; it'll be something like:

I can't [believe/understand/comprehend] the [great/superior/amazing] [content/information/data]...

Since the spammers were likely non-English speakers to begin with, they didn't notice they were sending the script rather than the output.

If you examine a large enough corpus of spam, you can pretty easily figure out what algorithms they're using. It would be an interesting challenge in reverse engineering to write a program that deduces the algorithms used from the corpus.

I ask because when I first saw it, I thought perhaps they were being genuine but inarticulate.

They fooled you once. It probably won't happen again!

Commenter TildalWave points out:

none of the sample spam messages OP posted actually endorse any products, or are otherwise promoting any other cause.

Well let me give you an example: here's a comment that arrived a few minutes ago on my blog:

user name:  cuisinart compact toaster review
user url:   toasterovenpicks.com
user email: jeffryshuler@2-mail.com
user IP:    37.59.34.218 
Comment contents:
One in particular clue for that bride and groom essential their
own absolutely new everything, actually a surname burned which has a mode,
which render nearly girl thankful recognizing their refreshing surname
therefore distinctively printed.

The product is promoted in the user's metadata, not in the content of the comment. The content is just an attempt to get past the spam filter. (I suspect that in this case the text is not a mutation of an existing text but rather generated by a Markov process over a corpus of documents about wedding planning.)

Obviously anti-spam forces are on to this one too, which is why this was in my spam filter. My spam filter (akismet) on average lets through one spam for every 705 submitted. Again, that's what spammers are going for; they know that 99.9% of their work will never be seen by anyone. They're trying to randomly explore the space of false negatives in spam filters, a space which is getting quite small indeed.

Eric Lippert
  • 4,386
  • 2
  • 16
  • 12
  • 1
    Well, they didn't fool me once, but I certainly gave it far too much consideration. – Lucas Jun 13 '13 at 21:10
  • 13
    @TildalWave: The sentences become ungrammatical when local substitutions break context-sensitive rules. Substituting "is" for "am", "are", "were", "was", "been" or "being" is almost always going to make an ungrammatical or bizarre-sounding sentence. And even the "normal" rules for inflections and agreements in English are pretty bizarre and easy to get wrong. – Eric Lippert Jun 13 '13 at 21:33
  • @TildalWave: As for what is so hard about it -- it's not that hard. Remember, spammers are looking to deliver what, one message in a thousand? Ten thousand? If they have a cheap way to fool a filter one time in a thousand that's return on investment right there. – Eric Lippert Jun 13 '13 at 21:35
  • 1
    @TidalWave, that's an old story: Link spam. The payload is in a URL embeded in the spammer's username, or something like that. –  Jun 13 '13 at 22:51
  • 24
    @TildalWave: First, you seem to be taking this awfully seriously. It's a StackExchange question. Lighten up, and if you don't like this answer, write a better one. What you'll "accept" is not particularly of concern to me; my answers do not come with a service level agreement. Second, of course the OP omits details. OPs always omit details. Since the OP has a WordPress blog, as do I, I've seen about 100000 spams just like his. Third, lots of web sites strip out the user metadata. Fourth, don't think of spammers as *smart*. They are throwing a billion spams a day and hoping a few stick. – Eric Lippert Jun 13 '13 at 23:10
  • 2
    @TidalWave (and Eric) I definitely *do* get the link-to-product type. Though they are not really what I was interested in. Of those that are badly written, the ones with a payload constitute a large minority. Usually, it is an unresolvable host name and some randomly generated email address. All in all, most spam is of the promoting links variety, including the two I posted. But many do not have a link at all. – Lucas Jun 13 '13 at 23:58
  • 3
    Great answer. Thanks for the insight from your blog. Makes for an interesting read. Glad I don't have to worry about my sites getting hit so hard yet. – AJ Henderson Jun 14 '13 at 03:19
  • 3
    The unresolvable hostnames are ones that were up and providing some unsolicited content at one point but have now gone down (Some of these go up and then down again very quickly). In regards to messages with no links, it's pretty trivial to strip links out of a message, some people don't block spam comment s but just strip the links from them, this results in lots of semi authentic looking comments that have no obvious reason to be spam, they still are though. – Ardesco Jun 14 '13 at 07:43
  • 20
    I suddenly have the strangest urge to buy a toaster.... – Mansfield Jun 14 '13 at 12:12
  • @TildalWave , link spam may not contain text even remotely related to their products. Some are simply trying to establish an association between a popular site and their link farms. They understand this association can help raise their Google page rank. There is a whole "artificial web" of sites that don't serve any actual people, but the search engine spiders can't tell the difference. Essentially, they are leeching the reputation of the blogs they spam. – John Deters Jun 15 '13 at 16:47
  • 3
    You mentioned that you sometimes get comment spam containing the entire substitution script. Here’s [a full example of such a script](https://gist.github.com/shanselman/5422230). – Rory O'Kane Jun 15 '13 at 18:06
  • Receiving the entire substitution script is just too funny. +1! – Andrew Grimm Jun 16 '13 at 09:45
28

The language may have a little to do with a sig like TidalWave was talking about.

A little harmless spamdexing.

I've been getting a few of the first example on my blog. While it looks harmless, they're actually spamdexing (a little bit of "black hat seo") by trying to associate their user account (and website links by extension) with the keywords in the blog (like Xander was saying, it's marketing). When you click on the link it counts as a positive hit from the blog. If a blog has enough hits positively for a key search their link will get a +1 bump up from the search engines in regard to relativity for the keywords. Most of the search engines have caught onto this and try to prevent it with relevance matching in their formulas.

The downside is if a user comes to your site for something off-topic because of this spam and leaves (bounces) the search engines will penalize your ranking overall (because of lack of substance) as well as your ranking for the page with the off-topic content. While there's not a lot to do with IT Security in spamdexing (unless they use an infected site as their own URL), it does impact the [social] performance of the site negatively overall if enough spammers do this and knock your site down in the rankings.

In regard to the second example it contains a hook for a two post spam operation (Commonly found in forums). The first poster will create an account and post a question that looks like a legitimate concern.

... Where else may I am getting that kind of information written in such an ideal means? ...

A short while later (within 20 minutes or so, up to even a couple of days) another poster (from the same country usually, if not the same IP range) will create a new account and post the answer, which contains the link in relevance to the original poster's question. Since most board moderators won't delete what looks like a real discussion, their spam fools someone again... it's still spamdexing though. A better-crafted marketing-style example might be:

I found a great resource for [keywords here] at [http://www.example.com/]. You should take a look since they have a lot of information related to [more keywords]. It should help you out.

Some of the other tricks they'll do is have a signature image that is a transparent GIF only 1 pixel by 1 pixel and wrapped in an <a> tag. This creates a link to some other website anywhere the poster has typed out their gibberish content. Just because you can't see it, doesn't mean it's not there.

Not so harmless Spam Threats impact Server Security

Some of the worst examples of spam will actually contain a link to an infected site, or they'll install a javascript keylogger. (I've seen the SVG hack used in signature lines to inject malicious script.) The keylogger is the one you'll need to watch-out for because they can capture the username and password of the blog/site admin or another user with elevated privileges when they try to log in (or any user creating an account) on the same page to delete the spam. Best case scenario, is if the user has enough access to see other users, the attacker will download the list of e-mail addresses from the users and send out spam e-mail messages to a market-targeted (marketing) list.

Innocent new users can have their credentials stolen, and since most people use the same passwords and the same e-mail address everywhere, now their accounts elsewhere can be compromised. (Facebook, LinkedIn, etc)

Worst case scenario, because most web developers of the CMS systems don't expect someone with "skillz" to get into the backend via one of these methods (trusted), they're not doing things like checking all of the admin forms for XSS or MySQL Injections (I've caught a few of my developers cutting corners in this method). From XSS to SQL injection it then depends on the security of the box, the limitations on the user accounts (don't run Apache as root), and the read/write access. Since they would be in the CMS you can assume that the user can likely write anything to the box they want. Delete the database, infect the site with a backdoor... now it's an IT security issue.

AbsoluteƵERØ
  • 3,104
  • 17
  • 20
20

The company I used to work for used to do "spinning", which as one of the answers above mentioned is programatically doing thesaurus search and replaces on the text. However, we would do it in multiple, complex layers.

  1. We actually employed real, American writers to write the original copy.
  2. Those original writers would mark up their own document using a special syntax that we created, marking words, word groupings, phrases, and entire sentences, including the synonyms that they felt were appropriate for each case. This meant synonyms for entire phrases that could be exchanged without changing meaning. They would do this in a text editing software we created that would provide them with auto-complete suggestions.
  3. Each time a writer would mark up their document, we would store all of their synonyms and phrases in a dictionary and use them to add suggestions to the writer for their next assignment.
  4. Hit GO on the machine, and spin out hundreds/thousands of variations.
  5. Divvy out blocks of variations to our SEO team in the Philippines whose sole job was to find high PR blogs, forums and other websites too dumb to block us.

Interestingly, we never automated the actual posting part, since that was the easiest thing for machines to spot. A real human was posting that trash.

Ah, the good old days of ruining the internet for everyone.

Dan Gayle
  • 391
  • 1
  • 6
  • 7
    Cool. Well, totally not cool. But thanks for sharing it. – Lucas Jun 14 '13 at 04:17
  • 3
    Why did you people do taht ? to make money ? how can you make money by spamming ? China pays you to ruin the internet for everyone ? – Chani Jun 14 '13 at 05:42
  • 14
    @RitwikG: The way you make money on it is: the owners of CrappyToasterOvens.com calls you up and says **We want to be the #1 Google hit when someone searches for "toaster oven wedding present". Make it happen.** So that's your job. How are you going to do it? Google looks for *popular pages that link to other web sites with keywords*, so you think OK, I'll put a million comments on a million blogs with the words "toaster oven wedding present" and a link to the site, and *some* of them will be popular blogs. – Eric Lippert Jun 14 '13 at 16:11
  • 1
    It seems to me that if you did that for enough documents, you could start doing some predictive processing to determine likely candidates for the syntax. Essentially, the knowledge base would not only maintain collections of the various synonymous elements but also how certain elements would often be arranged (in other words, building some sort of predictive parse tree through machine learning). Using that for generation probably wouldn't give optimal results, but I feel it could be useful for suggesting markups for the written documents. – JAB Jun 14 '13 at 17:02
  • 1
    @EricLippert +1 for toaster oven reference ;) – Lucas Jun 14 '13 at 17:31
17

I don't know if in your case the text you reported was the entire comment (what would then be its purpose, either as a genuine comment or as spam/scam?).

In case it was not – and when the spam needs to work as a prelude to future interaction – then writing it in poor english might be done on purpose, as a "check" for a victim that is dumb enough not to immediately recognise the scam and hence worth investing time on.

Source: Why do Nigerian Scammers Say They are from Nigeria? by Cormac Herley, Microsoft Research.

Alberto Santini
  • 271
  • 1
  • 4
  • 3
    +1 for mentioning the Herley paper. All of the explanations above assume huge amounts about spammers that can't often all be true. – Bruce Ediger Jun 14 '13 at 01:48
10

Maybe this won't answer the OP's question but those spams are not meant to make anybody buy anything.

The point is to create the maximum number of comments with links to particular pages or sites that spammers want to improve their PageRank. Those sites are where the real work of seducing potential buyers (or hacking computers of potential victims, or both) will take place.

That's why almost every spam has at least, one link. And when it doesn't, it's generally a specially crafted comment ("A brilliant article," "Thank you for sharing this" ...) where the goal is to get the comment approved and to grant the bot with direct access without passing the moderation queue. Because in some CMS and forums, when a user reaches a minimum number of approved messages, it will be 'tagged' as trusted and need not to get approved every time.

So spam is not meant for humans but for machines (search engines) and spammers need to make as much as they can to influence search engines. So, they do not waste time on the content, since no human will read it, and concentrate on mechanisms that make a lot of messages faster and simpler.

In a word, you're are not the target, you are just a collateral damage.

ahmed
  • 245
  • 1
  • 7
9

It is probably a combination of the two. If they use language that doesn't properly make grammatical sense, there is more likelihood someone might misinterpret it as actual feedback on a post since they'll try to fill in the blanks in a way that makes sense. Ultimately most of this kind of spam is trying to spread links around the web to try and impact search rankings.

In order to get links to stay up, they need their comments to look genuine to make them harder to easily pull out from genuine comments. They make generic sounding responses that "could" plausibly be valid in the hope that they will be left active.

In other situations, this is the result of trying to insert keywords in to the comment so as to increase the association of the link with those keywords.

AJ Henderson
  • 41,816
  • 5
  • 63
  • 110
6

In addition to the fine answers posted above there is a strong sampling bias to your question.

You only recognize poorly crafted spam blog posts as blog spam. You never recognize the really well crafted blog spam as blog spam. Hence it seems that all blog spam is poorly crafted.

AmIRight?

AllInOne
  • 467
  • 3
  • 6
  • 7
    If I spend the time to write thoughtful, grammatically correct, relevant, useful, etc. posts to blogs which I happen to add links purely for ulterior motives - is that blog spam? – emory Jun 13 '13 at 21:55
  • 6
    @emory Nope, that's marketing. :-) – Xander Jun 13 '13 at 22:00
  • 1
    @Xander then it is a definition problem not sampling bias. If my self serving blog posts are poorly crafted then they are blog spam; if they are well crafted then they are marketing. Blog spam is poorly crafted by definition. – emory Jun 13 '13 at 22:21
  • 1
    Actually no. If I had a really successful blog, then maybe you could say that, but as it is, it's pretty easy to tell the spam from the non-spam (do I know them, no, well it's probably spam). – Lucas Jun 13 '13 at 22:21
  • 13
    Your answer reminds me of this (profane) xkcd comic: http://xkcd.com/810/ – Eric Lippert Jun 13 '13 at 23:02
  • @emory Yup, I agree. – Xander Jun 14 '13 at 00:03
4

Quite often blogspammers use content spinners. They replace words with synonyms, which should work in theory, but in reality it makes the comment look like written by a 4 year old; or someone who does not have english as first language.

Most content spinners share a common syntax (example from Eric Lippert's answer):

I can't [believe/understand/comprehend] the [great/superior/amazing] [content/information/data]...

This means the content spinner will choose one random word from each bracket to build the sentence. This way you can get a large variety of similar comments, without having exact duplicates, making it a bit harder for anti spam plugins to identify similar content if they use a checksum like md5 to compare comments with previous spam.

iHaveacomputer
  • 523
  • 3
  • 6
4

They may be going off templates like this: https://gist.github.com/shanselman/5422230 , which was recently accidentally posted to Scott Hanselman's site: http://www.hanselman.com/blog/ExposedABlogCommentSpammersSourceTemplate.aspx

As others have mentioned, all that needs to be done is to write a script to pull a word at random out of the bracketed lists.

1

It can be said simply that you must be aware of SEO(Search Engine Optimization) IT has 2 types of techniques in major 1) Black Hat and 2) White Hat

White hat do the genuine way or authentic work.

but where black hat comes your problem starts, what they do is they have created number of user name , password, or list of open blogs... they keep on posting content on the basis of their requirement( keywords) so that will give them inward clicks on their site..

As the first answer says they use smart software that understand language partially, and create a paragraph on the basis of given keywords.

So, that will make some sense, but will not make sense at all... :)

I hope this makes sense in context to your question..

MarmiK
  • 111
  • 3