Researchers at Microsoft have developed a tool to scrub search engines of major spam Web sites that pollute results.
The Redmond, Wash., software giant’s Cybersecurity and Systems Management Research Group has taken the wraps off Strider Search Defender, an experimental project that automates the discovery of search spammers through non-content analysis.
Microsoft Research has embarked on a new project to automatically seek out search engine spam before it can be used to defraud advertisers on MSN, Yahoo and Google. Called Strider Search Defender, the tool combines two other projects from MSR: Strider Honey Monkey and URL Tracer.
The tool “Strider Search Defender,” is designed to dig out Web pages that are a front for spam Web sites, according to a paper published by Microsoft researchers. These Web pages typically reside on blogging sites and other services that provide free Web space, the researchers said.
Spammers soil the Web with countless links to their spam fronts in order to gain a higher ranking in search engines. "By cleaning up Web search, hopefully we can discourage spammers from cluttering the Web with spam," Yi-Min Wang, principal researcher at Microsoft, said in an interview.
The effort is being headed up by researcher Yi-Min Wang and focuses on a major problem now plaguing the Web: blog spam. The basic premise of Strider Search Defender is that spammers utilize what Yi-Min calls “doorway pages” — sites at reputable hosts and blog services. The doorway pages pull ads from a “target page” operated by the spammer.
Instead of reading the actual content of a page to see if it could be classified as spam, Microsoft is taking a context-based approach that analyzes URL redirection. Because many Web sites will use redirection to serve up different pages to search engines and humans, this methodology could prove more effective.
In addition, Yi-Min notes that large-scale spammers create hundreds or thousands of ‘doorway pages’ that either redirect to or retrieve ads from a single domain. By finding these target pages that are connected to a large number of doorways; an entire spam operation can be stopped in a single pass.
According to data from Automattic Kismet, a tool that helps bloggers thwart comment spammers, a whopping 93 percent of all blog comments are spam. With Strider Search Defender, Wang’s team is taking a context-based approach that uses URL-redirection analysis to pinpoint spammers.
For the spammers to be successful, they have to post millions of fake comments on message boards and blogs. That is the only way to get picked up by search engines. If we can find a way to pinpoint them before they get indexed by search engines, the problem is solved, Wang said.
“They want to be found by search engines, that is why they are spamming. Well, now we are finding you,” he added.
Microsoft’s tool does not find spam the traditional way, by looking at the site’s content. Instead, it turns the spammers’ activities against them by using search engines to find links to potential spam pages. These links are often posted as comments on blogs, in online discussion forums and in guestbooks, also called "comment spam."
Search Defender starts with a list of confirmed spam Web addresses. A "Spam Hunter" part of the tool runs those addresses through search engines to find pages that link to the spam sites, using the "link:" query tag. Additional spam URLs found on those sites are, in turn, run through the Spam Hunter, resulting in a long list of potential spam sites.
Next, that list is fed into Strider URL Tracer to find which domains are associated with a high volume of doorway pages. False positives are reduced by checking the URLs against a whitelist of legitimate ad and Web analytics providers that were compiled through the Strider Honey Monkey project.
"We use search engines to find them," Wang said. "Spammers are basically telling us: Here are my spam URLs."
Spammers use various online services to host spam fronts, including free Web hosting providers such as Tripod, Angelfire and Yahoo’s Geocities, Microsoft said. Blogging services are also often abused; Google’s Blogger at blogspot.com is especially popular, according to the research report.
Our preliminary investigation shows that spam blogs hosted on blogspot.com appear to be particularly widely spammed and effective against search engines, the report said.
The problem is tied to the use of spam blogs, or splogs, to earn money from pay-per-click advertising programs offered by Google, Yahoo and MSN. Content on fake blogs often contain text stolen from legitimate Web sites and include an unusually high number of links to sites associated with the splog creator. The sole purpose is to boost the search engine rank of the affiliated sites and cash in on ad impressions from unsuspecting surfers.
In one scenario, Wang said the Spam Hunter collected more than 17,000 BlogSpot URLs and fed them into the URL Tracer. The group was able to identify the top 25 target-page domains that are behind the Google-hosted splogs. The top six are particularly active, Wang said, identifying them as s-e-arch.com, speedsearcher.net, abcsearcher.com, eash.info, paysefeed.net and veryfastsearch.com, which collectively were responsible for approximately 45 percent of the BlogSpot URLs.
Microsoft’s researchers are working with the MSN Search team to see how the search service could be cleaned up, Wang said. Additionally, he called on the Web community, especially the operators of blog and free hosting sites, to cooperate to combat the Web spam problem.
“In the end it is all about protecting the search engines. Because if the spam does not show up in any search engine result the spammer will not receive traffic,” Wang said.