By David Hunter, CEO of Epic Web Studios and ASAPmaps in Erie, PA. He also co-founded dbaPlatform, a local SEO software.
Suppose you’ve just composed the most objectively useful, engaging and brilliant web content ever. Now suppose that content remained unseen and unheard of, never once appearing in search results. While that may seem unconscionable, it’s exactly why you cannot overlook website indexing.
Search engines like Google love delivering the good stuff just as much as you love discovering it, but they cannot serve users results that haven’t been indexed first. Search engines constantly add to their colossal libraries of indexed URLs by deploying scouts called “spiders,” or “web crawlers,” to find new content.
How Web Crawlers Index Content
Even for spiders, the web is a lot to navigate, so they rely on links to guide their way, pointing them from page to page. In particular, they’ve got their eyes on new URLs, sites that have undergone changes and dead links. As the web crawlers come across new or recently altered pages, they render it out much like a web browser would, seeing what you see.
However, whereas you might skim over the content quickly for the information you need, the crawlers are much more thorough. They scale the page up and down, creating an index entry for every unique word. Thus it’s possible that a single web page could be referenced in hundreds (if not thousands) of index entries!
Getting To Know Your Crawlers
At any given time, there may be hundreds of different spiders crawling the internet, some good and some bad (e.g., those looking to scrape email directories or collect private information for spamming purposes). But there are a handful you want to be particularly aware of.
• Googlebot (Google)
• Bingbot (Bing)
• Slurp (Yahoo)
• Facebot (Facebook external links)
• Alexa crawler (aka ia_archiver, for Amazon’s Alexa)
Give Crawlers Guidelines With Robots.txt And Meta Directives
There may be situations where you do not want certain pages indexed, such as:
• Those that would not make quality landing pages from search (e.g., a “thank you” page for form submissions, a promo code reveal page)
• Those intended for internal use only (testing or staging purposes)
• Those containing private or personal information
What’s more, Googlebot and other prominent spiders have crawl budgets built into their programming — they’ll only crawl so many URLs on your site before moving on (although it should be noted that crawl budgets are massive compared to what they once were).
So as a site administrator, not only do you want to lay down some rules, you also want to set some priorities (crawl budget optimization). There are two primary ways you can do this: robots.txt files and meta directives.
A robots.txt file tells web crawlers where they should and should not go on your website — although not all of them will listen. To access it, just add /robots.txt to the end of your URL (if nothing pops up, you don’t have one). The basic syntax of a robots.txt instruction is very simple:
1. User-agent: [insert name of the user-agent (i.e., the crawler/spider/bot you want to call out here — if you want to call out all of them, leave an asterisk *)]
2. Disallow: [insert the URL string you’d rather the crawler not visit — a standalone backslash can be used to tell certain spiders not to crawl your site at all]
“Disallow” is the most common instruction you’ll give in robots.txt, but you can also suggest a “Crawl-Delay” (number of milliseconds you want the crawler to wait before loading the specified URL), “Allow” an exception within a disallowed URL string (Googlebot only) or submit an XML “Sitemap” containing your website’s most crucial URLs — a key to crawl budget optimization.
Robot meta directives (a.k.a. meta tags) tell web crawlers what they can and cannot do in regard to indexing — although, again, malicious bots may disregard. Because it is written into the code of a web page, it’s more a demand than a suggestion. Using various parameters, website administrators can finetune whether or not (or for how long) a page is indexed, whether its links are followed, whether a search engine can pull snippets and more.
Is Your Site Getting Indexed?
These are the most common reasons why your site might not be getting indexed:
• Your robots.txt file or meta tags are blocking the crawlers.
• It’s brand new — for example, Googlebot can take anywhere from weeks to months to index a new site, depending on the size.
• It’s not linked to from anywhere else on the web.
• The site’s navigation is difficult to follow.
• Your site has been flagged for black hat SEO tactics.
How To Make Your Website More Crawlable
Here are some ways to make indexing work better for your site.
Since links are the crawler’s primary mode of transit, ensure your site has clear navigation pathways. If you want something to be indexed, it absolutely must be linked to from somewhere else on the site — at a bare minimum the main navigation menu, but ideally from other relevant, related pages throughout the site.
Submit a sitemap.
Link your sitemap in the robots.txt file, and submit through Google Search Console. From the Search Console control panel, site owners can get very specific about how they want Googlebot to crawl their pages. Depending on the size of your website, you can have your CMS generate your sitemap for you, do it manually or have it done automatically using third-party software.
How To Check Indexed Pages
To see the pages Google has already indexed, simply query “site:[your domain name]” — this will generate a complete list in search results. It’s a good way to see if there’s anything important missing — or anything unnecessary. Check up on it every so often after changes are made to ensure Google is seeing exactly what you want it to see.