A Bully By Any Other Name
From the New York Times:
“Shopping online in late July, Clarabelle Rodriguez typed the name of her favorite eyeglass brand into Google’s search bar.
In moments, she found the perfect frames — made by a French company called Lafont — on a Web site that looked snazzy and stood at the top of the search results. Not the tippy-top, where the paid ads are found, but under those, on Google’s version of the gold-medal podium, where the most relevant and popular site is displayed.
Ms. Rodriguez placed an order for both the Lafonts and a set of doctor-prescribed Ciba Vision contact lenses on that site, DecorMyEyes.com. The total cost was $361.97.
It was the start of what Ms. Rodriguez would later describe as one of the most maddening and miserable experiences of her life…” [continue reading]
For those without the patience to read the seven page article, the owner of DecorMyEyes.com, Vitaly Borker, deliberately mischarged his clients, and then bullied any who complained. Constantly dishing out vulgar threats, Borker committed wire-fraud, impersonation, and stalking as part of his business strategy.
Of course, every angry customer went directly to online web forums and business review companies to complain. Unbeknownst to them, this was Borker’s hope. Every review posted about DecorMyEyes.com, no matter how negative, was another backlink to boost the site’s page rank. After enough negative press, sunglasses product pages from DecorMyEyes.com soon showed up even higher on search results than the websites of their designers!
Shortly after the New York Times’s article stirred up a serious conversation about DecorMyEyes’s business practice, the police struck up an investigation, and Borker was promptly arrested.
The Honest Algorithm
This story illustrates a compelling point. Once Google released the information behind their ranking algorithm (of course, it happened before Google was even a company), people could take advantage of it! A staggeringly large number of services exist to boost your page rank (129 million results on Google search). And so the intended way to get a high page rank (others like your content and link to it) is superseded by some method of manufacturing backlinks.
Unfortunately for the rest of us, PageRank seems to be an honest algorithm. It only judges pages appropriately when the pages aren’t competing to be judged. Once Google became a popular option for search, rankings could make or break a start-up tech company. It was only the natural response to exploit PageRank and boost business, but of course this undermines the assumptions of the algorithm.
It is certainly obvious at this point that while PageRank may be a critical component to Google’s overall ranking algorithm, it is certainly not the only factor Google considers. Its plausible that Google has very many alternative ranking criteria which trump PageRank. Undoubtedly, Google was forced to come up with these criteria specifically to combat sites like DecorMyEyes.com, and identify manufactured links.
And so this opens the floor for discussion: what alternative ranking systems would you consider? Can you think of easy ways to identify these maliciously manufactured links? This seems in general to be a hard problem, unless the pages providing the additional links are blatantly obvious.
Other than a potential discussion in the comments, that wraps up the series on PageRank. We hope the readers have enjoyed it!
Page Rank Series
An Introduction
A First Attempt
The Final Product
Why It Doesn’t Work Anymore
Thanks for this fascinating article. I doubt that there exists an algorithm to combat sites like DecorMyEyes.com that is based solely on the structure of the Internet. It’s going to take some natural language wizardry that can characterize the nature of a site that links to another and weight correspondingly. A link from a forum that’s critical of the site could be weakened.
In this case, knowing how often a link from one site to another is clicked might help. Not many people visit a site that has been slammed in a review. But I have no idea how available this information is.
I believe that’s one of the ways Google has compensated for this: doing NLP on the content of the posting website. It’s a seemingly daunting task: how do you know which text refers to the website in question?
I think one good way to stop this problem would be to standardize some sort of metadata associated with the link tag that allows the person serving the webpage to be in control of their “vote.” If course it wouldn’t be realistic to assume all pages would update like this, but certainly prominent review sites like Yelp have it in their power to make this change, and these kinds of sites are the source of the problem in the first place, so for pages that don’t include this extra metadata the default can be a positive vote as usual.
Bear with me here if I’m completely missing the mark. Why don’t you solve to problem of the quality of each link by putting each link up to the same equation that the original page undergoes. Furthermore ranking each page along with links on a logical level like Bruce stated by how many clicks each site averaged along with duration, and giving them a combined base score up to 10/100 or whatever is the most logical number. Now I understand if this were to happen to sites such as DecorMyEyes, the influx would potentially create lag or a slower search time. Wouldn’t tweaking and further testing the code eventually lead to a faster system? I don’t have much mathematical or computer science knowlege but I’m just curious.
I do see a problem however, sites that are meant to trap you and force you to click ok until the code is finished could then skyrocket to the top of search results. The only way I see it truly working is if code, math, and languages somehow combined and could understand eachother.
For the first remark, the equation doesn’t apply to links because links aren’t connected to each other in the same fashion. The model simply doesn’t apply. As to the second, search engines have access to click-through rates, and (while they probably shouldn’t) it’s conceivable that they have access to duration. The problem is that these metrics don’t correlate with the quality of the webpage, and this effectively leaves the quality of a webpage to the arbitrary habits of users. For instance, I might visit http://www.officialwebsiteofamerica.com/ and then decide to play with my cat for an hour, forget my computer is on, and then leave it on all night. Certainly that doesn’t mean it’s a good website. This kind of thing happens all the time (just count how many tabs you have open on average), and while I bet most search engines incorporate it somehow, it is likely too variable to rely on.
Pingback: Matemáticas y programación | CyberHades
I really thought this series provided a great explanation on the math behind search engines. I really enjoyed the incorporation of advanced mathematical analysis without bogging the article down with complicated terminology and notation. I always thought that the algorithms behind search engines were impossible to understand for regular folks like me, but the author did a great job elaborating on them. I especially liked the last page where the author described how rankings have flaws and can be exploited for corporate gain.
Well with or without the page rank, the website must have a hard doffolow backlinks with high authority to get a good rank in search engines