The rel=”canonical” tag has been a Godsend for many of us SEO’s. In case you have been having a real life instead of devoting your time to search here is a quick breakdown of how they are used:
You have two pages ( blue-widgets.html and pink-widgets.html ). Both of them are identical other than the colour.
These two pages are duplicates as far as Googlebot is concerned. While duplicate pages don’t actually incur a penalty, it is unclear which should be indexed (and only one will be). The result is generally that they will swap places being in the SERPs while steadily falling in rank. It’s not a penalty, but it sure feels like one if you don’t know what’s happening.
Adding a rel=”canonical” tag to both pages pointing to just one of them tells Google which one you have a preference for to be indexed. So the solution in this case is to have a tag reading :
<link rel="canonical" href="http://www.example.com/pink-widgets.html" />
and placing it in the <head> section of your site.
Now, when search engines crawl either page they know which to index only the pink-widget.html page. The result is better ranking and no jumping about swapping places.
Rel=”canonical” Doesn’t always work
I’m not talking about improper implementation. I’m assuming that everything is as it should be.
I had a case recently where a site had products as top-level orphans – nearly. The products quite logically belonged in multiple categories. This would result in a situation like in diagram 1 below:
The solution to the diagram above is to place the product as a top level orphan as shown in diagram 2 below:
The other solution is to chose a single category path and use canonical tags on the product pages to indicate that the product on that path is the one to be indexed.
Still doesn’t sound like a problem right? The company in question had implemented the top level orphan method in diagram 2 which would be my preferred method. There would have been no problem if it was not for deciding after that to add in breadcrumbs.
Why were breadcrumbs a problem?
All of a sudden they needed to pass the user route information from cat-1 or cat-2 to prod.html. The most reliable way to do that was to add in the information to the url, so now we had a structure that looked like diagram 1 but with a url structure of “home/prod1.html” on the listing from cat-1 and “home/prod2.html” on the listing from cat-2.
This requires us to use canonical tags to prevent duplicate content issues. You would implement canonical tags just like you would for diagram 1.
Still no problem right?
Why you can’t rely on canonical tags
Canonical tags are a “serving suggestion” to search engines. They promise to try to honour them, but reserve the right to override them. This is great if you make a mess of implementing them. Google in particular should notice and then ignore the tags.
However, in this case there was no error in the implementation of the canonical tags. Google for the most part honoured them too. Sometimes however they didn’t.
The result was inconsistent. Pages were bouncing around all over the place. It was hard to find out what the extent of the problem was too because as one canonical fixed itself, another would be broken. The only way to find them was to check which one was indexed. With a lot of products that’s a difficult and frustrating.
The problem was compounded by the products fitting into not just two, but multiple categories. In fact, after crawling the site it became apparent that 75% of the site was product pages with canonical tags pointing to a different product page.
It would appear that the sheer volume of canonical tags was the problem. I haven’t found this happening on any other site so I’m not sure if it is volume or percentage of site that is the problem. Maybe a bit of both.
What it does mean is that, where possible you are better off implementing the model in diagram 2 rather than relying on canonical tags. Canonical tags should be used as a last resort rather than a first resort as I see so often.
There are a few possible solutions. None of them are ideal.
One possible solution would be to remove breadcrumbs from just the product pages. I’ve seen that done, but I don’t like it.
Another way would be to pass the information in a session. If a user has opted out of sessions on their browser they will get nothing though.
You could use the HTTP_referer. This too can be problematic. Browsers and some anti-virus software can prevent it from passing. Also if you have a site that is part HTTPS and part HTTP (which you really shouldn’t by the way), then the referer information will not pass from a HTTPS page to a HTTP page. Within a HTTP or HTTPS site is ok so long as you are not trying to pass referer data cross domain with HTTPS.
Finally you could change the type of breadcrumb to attribute crumbs instead of location crumbs. This wouldn’t have worked in this case though.
What we did
In our case using the HTTP referer was the easiest fix since the coding didn’t have to be changed radically from what we started with. We prevented the loss of a visible breadcrumb for users who’s browsers prevent the passing of the referer by putting in a default crumb matching the original canonical.
The result is a site that is only a quarter of the size, so it’s a good housekeeping exercise as well as making it impossible for Google to get confused about which product page to index.