Now that you’ve built your ship all you have to do is hoist the flags and set sail into the Google Kingdom.
However, to get there faster, you’ll need to equip your vessel with the robots.txt file, a file through which you can send various instructions and directions to crawlers.
If you don’t know what it is or you want to know why it’s so important and how to use it you can rest assured because I will follow you and guide you to discover this other treasure.
Robots.txt file: definition
As you already know, in the realm of Google, crawlers, small vessels in the service of the search engine, sail day and night to reach new ships like yours.
Once they reach a new site they take time to and then index it and then respond appropriately to user questions.
So, search engines perform two main jobs:
- Scanning the web – to discover new or updated content;
- indexing – placing that content within Google’s index so that it can be offered to users searching for information.
After boarding the website, before crawling, the crawler will search the hold of your ship for a robots.txt file. Once found, it will read the file before continuing on the page.
The robots.txt file, which is part of the robots exclusion protocolcontains information and rules indicated by webmasters on how to search within the site: the information found will tell the crawlers what they should or should not scan.
So, it’s a file that by means of an indication blocks scanning by the crawler: imagine being confronted by a ship with a pirate flag, any living soul with half a brain would go no further and change course immediately!
On the contrary, if the robots.txt file does not contain any directive, the small boat will proceed to scan all the pages and other information on the site.
When to use a robots.txt file
Because robots.txt files affect how search engines see your site and how they can then present it to users they are closely related to SEO.
It is mainly used to manage crawler traffic to your site and usually to exclude a page from Google depending on the type of file:
- a web page;
- a multimedia file;
- a resource file.
This process will help improve crawling and, in turn, indexing of your site: so put down the rum if you want your sails to be spotted by adventurers!
The file is used for web pages as it allows you to:
- manage crawl traffic to prevent the server from being overloaded with requests from the crawler;
- prevent crawling of pages that are similar or have duplicate content in order to avoid a penalty from search engines;
- avoid crawling unimportant pages of your site;
- better manage the budget crawl which is the number of URLs that GoogleBots can and will scan;
Also to manage traffic, through the file you can specify the location of the sitemap of your site, a map that indicates to small boats which route to follow and where to land during the scan;
One must keep in mind that even if search engine bots respect the instruction of do not scan of specific web pages, they may still be indexed if other sites have links to the page.. Do you understand?
On the other hand, it is also possible that bots. ignore the file. For these reasons, when you want to avoid indexing a page with, for example, the sensitive data, I recommend you use password protection or an instruction noindex.
For all the buccaneers! Please do not use the robots.txt file to hide a page containing confidential information because the robots.txt file is publicly accessible.
So, the file can be used to give some rules to crawlers but not to hide your pages from the Google Kingdom.
You can use robots.txt file to manage crawl traffic and also to prevent image, video and audio files from being displayed in search results.
However keep in mind that it will not prevent other users or pages from linking to your media file.
You can use the robots.txt file to manage crawl traffic and block resource files, such as unimportant image files, scripts, or style, only if you believe that pages loaded without such resources will not suffer significant consequences as a result of their loss.
However, if the absence of these resources complicates the understanding of the page for the crawler you should not block them otherwise the search engine cannot guarantee good results in the analysis of your pages that depend on these resources.
Remember that in front of the pirate ship, the crawlers would change their route? So make sure you don’t block the whole site or any content you want scanned and indexed.
Finally, if there are no sections on your site where you want to control small craft access you may not need a robots.txt file.
Some tips on how to write a robots.txt file
Before you start writing the file I suggest you rummage through the hold of your ship and see if it already exists.. To do this just type in your main domain and then enter /robots.txt, for example: latuanave.it/robots.txt.
At this point, if you have discovered that you don’t have a robots.txt file or you want to modify yours, creating one is an easy task. simple processYou just need to follow a few of my tips.
First of all, to realize it, you can use any program able to create a text file valid as a Notepad, some online generators or the plugin Yoast SEO directly from WordPress.
If you feel like you’re sinking among these words, don’t panic: you can ask for help from a crew of pirates who have been sailing these seas for a long time and who will know how to show you the right way forward.
Contact us and we will help you face this journey.
When you create the file for your website, make sure you name it exactly robots.txt, otherwise bots won’t be able to recognize it.
Therefore, use only lowercase letters and do not add any characters or symbols.
The codes to use
A robots.txt file consists of one or more rules: each rule blocks (or allows) access for a given crawler to a specified file path.
The codes that can be used within the file to provide instructions to search engines are basic and are divided into:
- user-agent – is used to indicate the bots to which the instructions are addressed (all or some of them specific bots)
- disallow – introduces the list of pages or sections that crawlers should not visit;
- allow – directory or page that should be crawled by the mentioned crawler.
- sitemap – must be a full URL and indicates the location of a sitemap for the website.
So, to ask to apply restrictions, or not, of analysis on pages will allow (allow) or prohibit (disallow) the behavior of certain (or all) crawlers.
I’m going to show you an example now, so be very careful!
In this robots.txt, my crew and I asked the Googlebot not to scan the https://veliero.com/tuastiva/ folder and any subdirectories to prevent everyone from discovering our secret rum stashes.
Next, it is specified that all other crawlers (*) are allowed access to the entire site (/).
Finally, the exact location of the sitemap.xml file is specified.
In summary, the robots.txt file will be very important for your ship if you want to manage crawler traffic to it and exclude a page or media file from Google, all to improve crawling and, in turn, site indexing.
However, to accomplish this you need to take into account the suggestions I’ve given you: stay the course in these safe waters and your ship and crew will be grateful!