Built-in randomness in Google’s ranking algo, even more so for AdWords

April 26, 2008

Google started out by building a search engine based on automated algorithms. Clean, comfortable, don’t have to deal with people – just what the doctor ordered for nerds.

However, people are smarter than algorithms and any bogus AI, so they figured it out. They are the cats in the cat-and-mouse.

What’s poor Google to do?

They can apply human input (as they already do now). But it’s not really scalable, it’s unreliable, it’s messy – and webmasters, marketers and spammers will figure this out, too.

One thing people cannot figure out is randomness.

So they put randomness in the algorithm; basically retrofitting the most critical and most vulnerable elements of their ranking algo. (In other words, the important yet easy-to-reverse-engineer parts.)

And this also explains neatly why the quality score for AdWords behaves in such irrational and unfathomable ways at times.

It has even more built-in randomness than organic rankings. It needs to. Why?

Because, compared to the organic SERP, here the feedback users receive is more instantaneous and more structured, the scope for user experimentation is greater (and the incentive to do so is more obvious), therefore the potential for gaming the system is much larger.

However, randomness built into the organic results is not much of a problem in terms of relevance. Searchers won’t notice, or care about, anything odd as long as the SERP remains acceptably relevant.

At the same time, AdWords data are closely monitored and analyzed, sliced and diced, so the randomness is more observable. It’s more disturbing and it hurts budgets.

In the end, it has the potential to alienate the very people Google makes a (very good) living from.

Lack of “Similar pages” info – a signal of the Google Sandbox?

February 21, 2008

When you look at a Google SERP, the results usually have a “Cached” and a “Similar pages” link.

If you click “Similar pages”, it brings up pages from other domains that are supposed to be related to that particular page, but the selection is based on a logic that’s hard to fathom. Some of those “related” pages are not even similar in topic, while some others are just very loosely related, through a third party etc.

It’s a thoroughly useless feature in Google’s search engine.

But not every result has “similar pages”. Sometimes you click and it says: “Your search – related:www.example.com – did not match any documents“.

That’s usually the case with relatively fresh pages but also with several months old sites.

There is a possibility that it’s part of the Google sandbox phenomenon.

By the sandbox phenomenon, I mean the following scenario:

Your site and its pages are indexed. They may receive lots of traffic from Google. They may rank for a lot of queries. But somehow there are a couple of queries you just can’t rank for, although even pages shown in the top ten have less content, worse technical layout and an inferior link profile compared to your page.

Then one fine day, something happens and you start to rank for stuff you should’ve ranked almost from day one. You’re out of the sandbox. Your site is finally inducted to Google’s trusted circle.

It would be interesting to check whether there is a correlation between sites getting out of the sandbox, and gaining their “Similar pages” lists.

Why do I think they could be related? The time frame for both seems to be on the same scale, 6-12-18 months, depending on the topic of the site.

Just an idea.

How Google might reassign incoming links for duplicate content

January 14, 2008

Maybe it’s a glitch, maybe it’s a new approach from Google to police duplicate (and possibly stolen) content: all links pointing to the duplicate being credited to the original.

I.e. the incoming links of the thief site accounted as incoming links for the victim site. At least for that particular page where the stolen content resides.

This approach would indeed have a couple of advantages.

1. Compensation for the victim; targeted penalty for the thief

Any links the crook gets for the stolen content will be attributed to the original. Even the links that point to that page from his own site. It should hurt. And still, it’s better than the idiocy of blanket penalties (like minus six, minus thirty, minus etc.).

2. An alert for the victim

When such unknown links show up in the External links report of Webmaster Tool, upon a quick investigation the owner can easily identify sites that stole his content.

3. An inherent logic

Any link the stolen content receives is a vote. Those votes should honor the creator of the content even if given (perhaps in good faith, by third parties) to the duplicate.

Problem is, this approach stands only when the original piece can be identified with certainty. Which is where Google and other search engines still seem to have difficulties.

Reference sites add trash to Google’s index

January 10, 2008

Like Merriam-Webster and Reference.com.

A possible long term bet on nofollowed links

December 21, 2007

It’s interesting to think about how to beat Wikipedia for “SEO” in Google. Or in fact, how to beat any seemingly invincible competitor in the rankings.

A not too creative but probably effective method would be to build a great non-profit resource. A site that provides consistently high quality SEO stuff, including news, opinions and education, but without a single ad, affilate link or self-promotion would become a top favorite to link to within 1-2 years.

Sure, it should be a long term investment with uncertain benefits. You wouldn’t be able to monetize it later, unless you really want to antagonize everybody in the industry. You would, however, be able to build a strong reputation on it even with the handicap of beginning a decade after most SEO big guns.

Another, quite risky method would be to build your castle mostly on nofollowed links.

“Buy” on the cheap

In the current online environment, nofollowed links are looked down on as a poor cousin to clean links. A low quality, low value commodity. Counterfeit money in an economy where PageRank is the de facto currency.

By expressly encouraging people to link to your site with “rel=nofollow”, you’d basically say, Please help me out with something that’s worthless anyway. It should be much easier than asking for something of value.

Plus, you’ll have instant trust because if there is no PageRank flow involved, you apparently make it impossible for yourself to bait and switch.

Why do it, then?

The tables might turn. Just as domain names were cheap after the dot-com crash and a great investment in hindsight, perhaps links with an obscure little attribute in the code might some day re-gain their value.

There could be several reasons for this.

  • Shifting search market positions, with one or more nofollow-ignoring search engines taking business away from the current leader.
  • Google backtracking on nofollow if and when they finally get back to the search engine business from their current empire building spree, and realize they can actually solve their problems (paid links, PR flow etc.) on their own, without coded hints from intimidated webmasters.
  • Google waiving the enforcement of nofollow for inbound links if the resulting omission of a site clearly hurts relevancy.

This last point is based on the assumption that you can successfully go against Google if you build something they can’t afford not to rank.

Online outlaws

In the wider scheme of things, “rel=nofollow” may even be part of an alternative online reality. Imagine a directory that only lists sites that robots.txt-d Google away.

That directory would offer something truly unique. Exclusive stuff you just won’t find in Google. And if the directory itself also joins the “outlaws”, with a clever marketing campaign it could receive a huge amount of nofollowed links as a show of solidarity.

Even so, in times of peace (when Google plays nice), that directory would perhaps be just an interesting footnote. But in times of upheaval, it would be the standard bearer of revolution and with some luck, it could even gain traction enough to change the rules.

Google reconsideration request now without self-incrimination

December 17, 2007

As indicated some weeks ago, Google has finally dropped the “I believe this site has violated Google’s quality guidelines in the past” part from the Reconsideration request form of Webmaster Tools.

So now you don’t have to acknowledge your guilt even when you believe otherwise. A small step forward. (Or back towards the do-no-evil territory Google left behind long ago).

[Added: Though some semantic issues remain.]

Being relevant to site vs. being relevant to site owner

December 7, 2007

It is some kind of paradigm that good links should mostly come from related, topically relevant sites. Sounds OK in theory.

The problem is that the majority of people have at most one or two sites in their possession or editorial control, but they have dozens (or hundreds) of interests, hobbies, affiliations and opinions.

Someone may be a movie buff and have a website dedicated to this. At the same time, he may also have a friend starting a new business, a compassion for abandoned animals, strong political views, or enthusiasm for a useful web service.

Shall he link to his friend’s company site; the local animal shelter; a presidential candidate’s home page; or that particular web service? And all this from a movie site? (Well, he doesn’t have any other…)

Or shall he create content for each and every topic he wants to put up links about? Really? So that search engines accept those as relevant links instead of being some suspect worthless semi-spammy stuff?

It’s his site. He has full editorial control, he is not paid for those links, he actually understands and endorses them.

Still, how the hell would a search engine tell those links apart from non-relevant, paid-for links?

Matt Cutts’ manipulative piece on paid reviews

December 2, 2007

In the early part of the film “The Patriot” (starring Mel Gibson), we are presented with an absolute idyl of country life brutally destroyed by mindless monstrosity. This extreme contrast between good and evil invariably makes anybody watching the movie gasp in horror. Which makes for no good cinema but a cynical exploitation of movie-goers’ emotions.

The recent post of Matt Cutts on paid reviews reeks of this method of emotional manipulation.

His example to make a point is intentionally heart-wrenching and serious: a relative with a brain tumor. Then he creates an extreme contrast by setting this life-or-death issue against the poorest and most pathetic of paid reviews he could find on the net.

And then he asks the rhetorical question: “would you prefer that radiosurgery overview article from the Mayo Clinic, or from a site which appears to be promoting a specific manufacturer of medical equipment via paid posts?”

I wonder how such demagoguery could actually lower Matt Cutts’ standing as a well respected voice in the search arena.

In fact, his wife might have been right in questioning him about staying with Google beyond the originally planned time span.

Were he to leave now, he would mostly be remembered as the face of Google at a time when altogether we still loved Google. If he stays too long, he risks going down with this ship. Not necessarily financially, but in reputational terms, definitely.

When you’re a big name in SEO, there is no such thing as a stupid question

November 30, 2007

Being a “superstar” among SEOs, Matt Cutts does not seem to be overly active on forums and in comments. Oh, yes, one or two appearances of such an important person is already a fantastic compliment to any blog or website. Unless you happen to be another big name. Because then no irrelevant, inconsequential and accidental question is irrelevant, inconsequential and accidental enough for Matt to ignore.

  • Hey, why am I second for a keyword combination, when I should be first? That’s a really disconcerting relevancy problem in Google!
  • Why did my PageRank fall? It is supposed to go up and only up!
  • Why does the datacenter I am currently using show only 893 results instead of 1000? Who knows what I miss out on…
  • Why do I see a strange list here and here? Oh, because I forgot about a script…

There are possibly a couple hundred more important and more pressing questions regarding Google than those. But Matt Cutts, happy to help out whiners as long as they are his buddies and they pose easily answerable questions, makes no less than a dozen appearances there.

Thank you, Matt. We can count on you wherever there is some low-hanging fruit. Anything you can solve or ridicule easily. Like the idiots who could not configure their cloaking scripts. Morons with painfully primitive keyword stuffing and hiding techniques. Or non-controversial stuff you can nicely explain on a blogspot.com PR blog.

And the rest, you’re corporate enough to pass by.

Build a little goodwill when submitting a link

November 29, 2007

Doing a link campaign the disciplined and organized way is definitely the way to go. You can enhance it a little bit further, however:

Try to find a broken link on the page you are submitting to, and mention this in your e-mail.

Instead of just saying, “Please add my link”, you might say “Please add my link. BTW, the link pointing to xxx is broken”. Being helpful like this can somewhat increase the chance of your link being accepted.

On the other hand, it serves as a control, too. If you don’t get any reply to your submission, and the broken link is still there, it’s “No response” in your Excel-table. With no reply, but the broken link fixed, it’s “Link denied“.

(To make it more efficient, concentrate on checking http://www.example.com/subthis/~subthat/jsmith.html kind of links first. Those, especially if added several years ago, are much more likely to be not working than http://www.example.com type links.)

Malware, link injection…

November 29, 2007

As if to underscore the web spam issue just mentioned in the previous post, some news on SEL:

Search spam, using techniques that manipulate the search results, is becoming more dangerous each and every day. […] As the search landscape becomes more competitive, it is natural for some players to take more extreme measures to achieve their goals.

No startup’s gonna beat Google (only Yahoo and Microsoft can)

November 29, 2007

Try as they might, even with more sensible marketing instead of hype, there is only a very very small likelihood that any shiny new search engine can take on Google. And that’s because in the last couple of years, an insurmountable barrier to entry has materialized.

The barrier is perhaps not frighteningly high if we talk about “entry to the market”. Another garage firm conquering the world – that would be sooooo exciting. However, what really matters is “entry to the big league”, and, perhaps a bit counter-intuitively, that big league is insulated from newcomers by the intractable issue of webspam.

For the three major players, webspam is probably the single biggest challenge technologically. But they can take consolation from this: if something is so serious even they struggle with it, it is definitely going to be lethal to smaller fish.

Not immediately after they start out, of course. Nobody cares about new search engines in the beginning. But with growing awareness among users, spam will increase, too. Aided by automation, it will actually grow faster than awareness and renown. So any new SE will be crippled before it could reach a critical mass to pose a challenge to G.

What this means? That the age factor is not only important for domains and sites. It will be important for search engines as well. To put it simply, only those can succeed who were already on the scene before the onslaught of webspam began in earnest.

Aaron Wall said about MSN being relatively late to the party:

By the time Microsoft got in the search game the web graph was polluted with spammy and bought links.

While this may be true, they probably still arrived in time to stand at least a chance. Yahoo has an even better position in case Google trips up. For more recent entrants, the odds are extremely long.

The new danger of links looking paid-linkish

November 27, 2007

The increasingly high-stakes competition for rankings provokes increasingly nervous responses from search engines, especially Google. As mentioned in a previous post, this ultimately means less safety for any site.

In an online world where the (supposedly) good guy is becoming more and more trigger-happy, there is a growing chance that you can hurt yourself unwittingly, compromising your site’s prospects without any deliberate malevolence.

Case in point (not suprisingly): paid links. Many times they differ from normal links only in their intent. Intent is hard to recognize for outsiders, be they humans or bots, so Google has no choice but to make assumptions that are bound to be inaccurate.

Now let’s see a couple of techniques that just became more dangerous to use because of this.

1. Cross-linking your online properties

Promoting one of your sites on a topically unrelated site of yours is virtually undistinguishable from a paid link. You may put a nofollow on that link, but why should you? It’s your site and you definitely vouch for it. It’s an editorial link: it points to something you endorse. Still, it can make you look like a spammer.

2. Having a “Link to us” page

Such pages usually suggest ways to link to the site, as well as offering banners with their code. Many people who like the site will find those codes convenient to use (after all, that’s the purpose of it). However, a bunch of such uniform links to your site will look suspicious, no matter how organic and non-paid they are.

3. Selling/buying banner ads the old way

Obviously, this is less of a problem for people reading SEO/SEM news. Most webmasters, however, may still buy and sell banner ads without giving a thought to nofollowing links behind them. But then, although they believe they’ve only traded banners, in Google’s view they’ve just traded paid links.

What’s common in the above listed three issues? They are all perfectly legitimate marketing, still they became risky to employ because an industry-leader could not sort out its problems on its own. Instead, it used its muscle to shift the responsibility. Now that‘s a slippery slope.

A penalized site in Google Webmaster Tools

November 23, 2007

Google’s Webmaster Tools is supposedly a communication channel between webmasters and the search engine. As Vanessa Fox said back when it was just called Sitemaps: “Our goal is two-way communication between Google and webmasters.”

Although anyone who uses the tool is probably already aware that this goal was not achieved, a rundown of how a penalized site is represented in Webmaster Tools is perhaps interesting.

The site:

lost its rankings several months ago (outside top 30 for its own name);
lost its visible PageRank later;
is still included in the index and receives traffic from Google.

Message Center

Empty, ever since it appeared. Probably the small inconvenience of being out-penalized from view was not deemed important enough to communicate about.

Top search queries / Top clicked queries

These tables are the only segment of Webmaster Tools that actually mirrors the woeful state of the penalized site. Mostly uninspiring queries with positions #5-10, and a couple of relatively exciting ones, unfortunately with positions #40-70.

Crawl stats

In spite of visible PR being comprehensively zeroed out, PageRank distribution is diligently reported to this very day as well as “Your page with the highest Pagerank”. This last data can be out of sync with toolbar PR for an extended period even in the case of unpenalized sites.

Pages with internal links

An interesting aspect is that not a single page that was created (and internally linked to) after the penalty kicked in shows up in this list. Basically, Webmaster Tools does not acknowledge the existence of any new pages since that point; although those pages got crawled and indexed. Helpful, indeed.


Sitemaps re-submitted still get downloaded instantly and quite regularly thereafter.

Set crawl rate

Googlebot activity is as strong as ever. Not that it mattered even before the penalty, anyway. Google’s cache offers 2-8 week old copies although “Number of pages crawled per day” implies that on average the whole site is re-crawled every 2-3 days.

Other data

Pages like “What Googlebot sees”, “Pages with external links”, “Diagnostics” etc. are regularly updated.

Business as usual, according to Webmaster Tools

The complete lack of communications about a penalty is especially frustrating when a site owner does not even know why the site was pushed back in Google. Almost every bit of information that comes from the console suggests all is OK when the opposite is painfully evident in the SERPs and traffic charts.

Google obviously won’t want to educate spammers, but being in Webmaster Tools is a form of cooperation. So why not give webmasters at least a hint. Is it a technical issue? Is it suspicious-looking link patterns?

Without the slightest help from Google, an honestly operated site that happened to be the “statistically acceptable amount of collateral damage” during an algorithmic spambusting initiative might be left out in the cold forever.

Will search ever become more relevant?

November 20, 2007

If we put aside public relations and self-delusion originating from search engines themselves, what is the probability of search engine results getting more relevant in the coming years?

Not much.

Even as they are telling us the best is yet to come and search will only get better, the “best” may in fact be behind us.

The end of the nineties and the beginning of the new millenium was perfect for Google. They already had huge “firepower” (their most potent algorithms, their most important ideas) while the enemy was puny: a pathetic bunch of metataggers, keywordstuffers and invisibletexters. And not that many people realized the potential of search.

Now the same battlefield is a trillion dollar business opportunity. No matter how many PhDs they line up, ten times as many will probe their algorithms, look for loopholes, use raw force, devise sophisticated assault tactics from the other camp.

All search engines can realistically hope is to keep up with them in this arms race. Of course, “keeping up” will come to us in the guise of exciting buzzwords, feverishly discussed new patents and proudly touted innovative approaches, but for all this, the SERP will not be a nicer, cleaner, more useful and relevant place than it is now. Not any more.

Maybe in highly visible areas, where hand-editing is feasible. In low-value areas also. Wherever the race is for real money (and that’s a sharply increasing segment), it will only get messier.

Which, in turn, is going to make search engines more desperate. And being more desperate will inevitably result in less ethics, less easily applied simple truths, less straightforward optimization techniques – less safety for any site.

The current Google uproar over paid links is just the first stage of this new era.

Google Analytics throws useful data out the window

November 19, 2007

A strange neglect causes an irretrievable data loss in Google Analytics.

One of the most important and exciting metrics in web statistics is to know what keywords brought visitors to your site. For this, an analytics application needs to recognize search engine domains and extract the query from such strings.

Example URL: http://www.google.com/search?q=viral+marketing

All that should be noted in this string is that:

1. it is a search engine domain name
2. it includes keywords after “q=”
3. keywords are separated by a “+” character

It’s that simple. But for some reasons, the Analytics team does not find the time to keep this hugely complicated “AI” regularly updated in their software.

When you see the label “organic” next to a referring site, it shows Google Analytics recognizes that site as a search engine and extracts keywords from related strings.

Google, Yahoo and Live are recognized as search engines

They even know about some of their partner sites:

Traffic from the search function of CNN is recognized as search engine traffic

But when the label says “referral”, the referring string is not considered as belonging to a search engine, so everything after a “?” gets discarded. (Which is stupid in itself but that is another matter.)

Interestingly, Google doesn’t recognize many of its own domains as belonging to a search engine.

Google domain names not recognized as belonging to a search engine

So what exactly those German, Spanish and French users searched for will remain a mystery. Googlr.com is also their domain for catching mistyped traffic, but the Analytics staff weren’t told about this, either. And you won’t know what image searches led people to your site.

Google Image Search is not recognized as search, either

International websites fare even worse because country-specific search engines are comprehensively ignored. A site that derives a lot of its traffic from local search engines will never see a significant amount of its keyword data in Analytics.

100% indexing and free site search should make a dent in Google’s domination

November 17, 2007

Newsweek has run a cover story on Google, saying ‘Once hailed as a world beater, the internet colossus now has real rivals all over the world’ (Searching for the Best Engine, Nov. 5, 2007).

Real rivals? A curious statement, more like wishful thinking, just at a time when Google is transforming into a truly omnipresent giant with no serious competitors.

New, upcoming search engines seem to be more about self-hyped technological twists smartly sold to investors and easily swayed journalists than about being useful search alternatives. If only the same smartness showed in selling their services to the public.

One idea that somehow never ocurred to any of them is promoting a general purpose search engine on the back of free, thorough, up-to-date site search.

Many sites use Google as an internal search results provider even though this is a less than perfect solution because most websites are not fully included in Google’s index. In other words, those sites willingly integrate a site search solution that may or may not bring up a certain page.

When there is a niche like that, any aspiring new search engines should offer webmasters an “Add our search box” option with the following guarantee:

If you add our search box to your site, we will regularly crawl and index all of your pages, so your site can have a 100% reliable, constantly updated site search function and will also be fully represented in our general index.

This would be a marketing move that could not be neutralized by Google. They just wouldn’t be able to offer the same deal as they’d perceive it as an all-too-easy way into their sacrosanct index, in return for something they almost take for granted. After all, Google already has its search box on hundreds of thousands of sites.

But for a new player, whose main objective should be building awareness, such a guarantee would constitute a fairly cheap marketing tool. (And the chance to give us something useful instead of fancy features mainstream users don’t give a damn about.)

[I was wrong. Google did manage to find a middle road to offer a usable site search without providing an easy way into their index, called On-Demand Indexing:

This feature allows site owners to update search results on their website On-Demand, by adding them to a special, separate index for their own site. Pages indexed with On-Demand Indexing do not impact the ranking or indexing of pages on Google.com.”]

Managing traffic source proportions in Google Analytics

November 14, 2007

It is known that Google actually uses Analytics data in rankings.

(Eric Enge: Right, indeed. What about the toolbar data, and Google analytics data? Matt Cutts: Well, I have made a promise that my Webspam team wouldn’t go to the Google Analytics group and get their data and use it. Search quality or other parts of Google might use it… – Eric Enge Interviews Google’s Matt Cutts)

And if they do use it, they certainly look for quality signals in it, both negative and positive, that may influence how they assign a value to and rank that particular site.

One such signal might be the distribution of traffic sources for that site. Does traffic come from several different types of sources? Does it come mainly from search engines? Mainly from Google?

If the latter is the case, it may create a ceiling for that site in terms of rankings.

After all, why would Google choose to promote a site even further when it already receives most of its traffic from them? A percentage like the one below basically says, If it wasn’t for us (i.e. Google), you wouldn’t even be in the game.

Google with a high share in All Traffic Sources

This is pure speculation, obviously. But it certainly won’t hurt to “dilute” that high concentration and push Google’s share down.

One effective way is to advertise the site off-line. It would bring in direct traffic, and direct traffic is probably considered a positive signal. Why? Because that includes bookmarks and type-in traffic, which usually comes to authority sites, quality content and strong brands.

Funnily, even the Live.com referral spam issue may be helpful in this respect, as it reduces Google’s share in the traffic source pie. With the added benefit (for Microsoft) that it not only obfuscates statistics for webmasters, it does the same for Google’s own Analytics data.