Robots.txt & Meta robots, two very similar techniques to accomplish a similar task — but in effect they are significantly different!
Robots.txt was my very first introduction to blocking pages from Google, it’d require a relatively small change to execute and it was very easy to explain if you’re working with third-party programmers. It is too easy to use this lazily nonetheless, and robots.txt became called the “lazy option”. The apparent knee-jerk reaction? Meta robots (noindex, follow) everything — something made much easier with Plugins like Yoast (for those WordPress Users around).
The result? Neither strategy was the simplest choice. Nightmare!
The problem? Many people I’ve worked without really understood what the difficulty that you truly need to resolve was, therefore couldn’t chose the suitable solution.
However, it could be done really easily.
Crawl or Index Issue?
When deciding how to block/limit Google (or other search engines) in your website, this is truly the most essential question that you inquire. Do you have an problem with creep budget? Or is being indexed?
Put even more simply, is Google seeing too small or too much of your website?
I’ve got to be somewhat cautious oversimplifying a lot, in fact it is a fairly tricky procedure identifying & repairing these problems.
The aforementioned is made worse by the fact it normally affects very large websites. In a bid to rectify the situation you have to make changes that impact a whole lot of pages, which means it takes more time to take effect. This is worse news if you really do something wrong — placing it correctly again will require even longer!
You’ve been warned.
Is it a Crawl Budget Issue?
Much was written about this over the past 12-18months –Dawn Anderson or Barry Adams are great people to begin with if you desire a more comprehensive introduction. In a nutshell, you have have a large crawl budget issue when your website is x7-x10 larger than is being crawled daily. Easy enough to describe.
- The Vast Majority people do not understand this as well as we ought to
- Placing your own crawl budget Is Quite hard
- Promoting the Idea of crawl budget into a client is easy… till they start demanding particulars
Having said that, they believe you’re making it up as a tall story when you begin talking about crawlers and robots. You are just the daft winner afterward
— Dawn Anderson (@dawnieando) August 28, 2017
The great news? In case you have under 1,000 pages you most likely don’t have an issue– even Google ought to be able to manage it.
Placing Crawl Budget Issues
Spotting where you have a crawl budget issue is comparatively straightforward — begin a crawl using your preferred software and consider where it has got stuck. If you’ve ever employed Screaming Frog or comparable, you’ll see that most commonly with layered navigation, filters/sorting options and with a few event/calendar functionality.
Logfiles are one of the most sure-fire strategies to determine whether you have an problem or maybe not — you just need a Log File Analyser, Deep Crawl/Botify or any Grep skills. It is too far beyond this post to go in the procedure –read here in case you’re curious– but like the aforementioned, you will need to track where Google spends its time when it is not needed.
For your “quick & dirty” method to understand if you Have to Be worrying about low-budget funding Joost has you covered:
- Look at GSC Crawl stats ordinary pages crawled daily
- Require your website pages amount & divide it from the typical crawled daily
- When you have 10, you have x10 the webpage than Google is running -a fairly big crawl difficulty
So the preceding example:
9,781 / 1,466 = 6.6 — almost x7 days the daily creep.
As a rough guide — in case your website is over 1,000 crawled pages and you’re scoring over x5, you’re going to want to investigate further.
Can You’ve Index Issues?
This one is completely less difficult to diagnose, with search operators (website:, inurl:, intitle: etc) to see where essential, top-level duplication might be occurring.
Also most decent search engine optimization audit tools can get you nearer to the — when I’m feeling lazy I typically turn into Siteliner. The true key question is, is the website larger than it needs to be? When it is a yes and Google’s index is more bloated than it needs to be, you’ve got an index problem!
Crawl/Index Issues Aren’t Mutually Exclusive
Having made this relatively easy so far, here’s the curve-ball — you’ll have crawl & index issues at the identical time.
Index bloat at scale may, definitely lead to creep problems (Google’s crawling more than it ought to), however if you have significant crawl problems, fixing another website problems, like duplicate content will probably require longer and be harder to gauge the consequences of.
Scale is the real killer here, the larger the site is, then one will really drive another. But, I will fix index issues (largely duplicate/thin content) before crawl difficulties. The logic being I’d rather bring the perceived level of quality content up, rather than create Google crawl worse grade material faster.
The above presumes you can’t correct both at once, and it will be a whole big collection of all “ifs” and “buts”– it also ignores other methods to handle such problems, canonical tags, nofollow, GSC parameter handling & 301 redirects being the most likely candidates.
Putting from the “Right Fix”
If you’ve just got a crawl problem, blocking the pages using robots.txt is the very simple answer.
Likewise, if you’ve got a substantial index difficulty, “noindex follow” using meta robots is going to be the perfect method to eliminate unwanted pages from Google without affecting pagerank flow.
In a practical situation, this means that if, as an instance, we needed to fixed layered navigation that was causing creep problems AND index problems, I’d initially used meta robots “noindex, follow” to make sure these pages were dropped. After these pages have been dropped, you can then block using Robots.txt (although not before).
Don’t use robots.txt to block some pages that you’re using meta robots noindex — if Google can’t see it, it won’t require the robots directive in account.
A less frequently used method of using both together — from inside Robots.txt. Using “noindex:” along with “disalow:” rules, so need to ensure that the pages is noindexed and then blocked from the crawl. Some think this is less reliable others disagree& have seen benefits.
The question is that the durability of this approach, John Mueller previously advised against that. Many evaluations verify that it functions (such as this 1), however I’d test/use in moderation — have a “plan b”!
There must be a much better way
I’ve been focusing on a flow graph to help simplify the aforementioned even farther — click to expand, download it again and have a look. Start at “do you have a duplication problem?” And follow the trail until you reach a circle, the most relevant action to your circumstances.
We’ve done some fairly extensive testing with this, but every website is different so in the event you’ve got some feedback — please allow me to know!
Take your copy of the 2017 Events List
Would you like to have the whole events list available in Excel? Or wish to bring the events to your calendar or on publish it on your site? You could! Simply add your information below and we’ll send all you need to you!
Want to add your event to your listing? Submit your event here.