Shopping websites often allow users to search for items and filter them by color, size, price, type, etc. What usually happens here is that the filter queries returns a page with a URL that attaches the words “color=5” or some such variation to the URL (as in ww.samplewebshop.com/skirts/?color=5 )
This means that if Google indexes that page, it might rank for “color 5”, pretty useless. Also, this means that the website might be returning millions of page URLs with different filter combinations. This makes your website, hard to crawl. Google will only crawl a portion of your site, not all of it, depending on your site’s Page rank. If it crawls the pages that don’t get you any traffic, then that’s a waste of Google’s crawler allocation/budget. Another problem is that your filters may combine the filter queries in different ways, thus coming up with different URLs for the same content. This will be considered as duplicate content by Google and your site will face dire consequences.
How to Fix It
What you need to do before you start is to determine which pages are important to you. Which pages would you want Google to index? This question can be answered depending on the keyword list that your site is targeting to rank in. After that, cross check the entire query attributes in your database with your principle keyword list.
Once you see which of your keywords are related to the query attributes, your next task is to determine which of the core keyword pages are being pulled up by organic searches using the query attributes in your database.
If, for example, you find that your core keyword “affordable skirts” is being used in conjunction with the term “pink”, this means that your sites needs to make available a landing page for “pink affordable skirts” and this page should obviously be indexed by Google. You also need to change the URL so that you can be more specific. Instead of the code “Color 5”, it’s better to use “pink” as your filter URL addition (which now might look like ww.samplewebshop.com/skirts/pink).
Also, you need to tweak a few things to make sure this is a part of what Google bots are indexing. Once you can now see which URLs you want to rank for and which ones you don’t, then this step will have more direction. The next problem is knowing if the pages you do not want indexed has already been indexed. Hopefully, they haven’t been yet, in which case you can just place them in your robots.txt site file with the help of REGEX.
But that won’t do any good if the page has already been crawled. If so, your only other course of action is to make use of the rel=canonical page tag. This is not the most elegant solution perhaps, as it means some pages will have rel while other’s wont. But it will do for now since they are already indexed. Just tag this to the pages that you don’t want to be crawled and placed in Google’s index.