About Robots.txt Generator

What is a robots.txt file?

Robots.txt is a text file webmasters create to instruct web robots (typically search engine robots) how to crawl pages on their website. The robots.txt file is part of the the robots exclusion protocol (REP), a group of web standards that regulate how robots crawl the web, access and index content, and serve that content up to users. The REP also includes directives like meta robots, as well as page-, subdirectory-, or site-wide instructions for how search engines should treat links (such as “follow” or “nofollow”).

Let’s say a search engine is about to visit a site. Before it visits the target page, it will check the robots.txt for instructions. These crawl instructions are specified by “disallowing” or “allowing” the behavior of certain (or all) user agents.

Basic format:
User-agent: [user-agent name]Disallow: [URL string not to be crawled]

Creating a robots.txt file

You can create a new robots.txt file by using the plain text editor of your choice. (Remember, use any plain text editor.)

If you already have a robots.txt file, make sure you’ve deleted the text (but not the file).

First, you’ll need to become familiar with some of the syntax used in a robots.txt file.

Google has a nice explanation of some basic robots.txt terms.

How does robots.txt work?

Search engines have two main jobs:

  1. Crawling the web to discover content;
  2. Indexing that content so that it can be served up to searchers who are looking for information.

To crawl sites, search engines follow links to get from one site to another — ultimately, crawling across many billions of links and websites. This crawling behavior is sometimes known as “spidering.”

After arriving at a website but before spidering it, the search crawler will look for a robots.txt file. If it finds one, the crawler will read that file first before continuing through the page. Because the robots.txt file contains information about how the search engine should crawl, the information found there will instruct further crawler action on this particular site. If the robots.txt file does not contain any directives that disallow a user-agent’s activity (or if the site doesn’t have a robots.txt file), it will proceed to crawl other information on the site.

Technical robots.txt syntax

Robots.txt syntax can be thought of as the “language” of robots.txt files. There are five common terms you’re likely come across in a robots file. They include:

  • User-agent: The specific web crawler to which you’re giving crawl instructions (usually a search engine). A list of most user agents can be found here.

  • Disallow: The command used to tell a user-agent not to crawl particular URL. Only one "Disallow:" line is allowed for each URL.

  • Allow (Only applicable for Googlebot): The command to tell Googlebot it can access a page or subfolder even though its parent page or subfolder may be disallowed.

  • Crawl-delay: How many seconds a crawler should wait before loading and crawling page content. Note that Googlebot does not acknowledge this command, but crawl rate can be set in Google Search Console.

  • Sitemap: Used to call out the location of any XML sitemap(s) associated with this URL. Note this command is only supported by Google, Ask, Bing, and Yahoo.

Why do you need robots.txt?

Robots.txt files control crawler access to certain areas of your site. While this can be very dangerous if you accidentally disallow Googlebot from crawling your entire site (!!), there are some situations in which a robots.txt file can be very handy.

Some common use cases include:

  • Preventing duplicate content from appearing in SERPs (note that meta robots is often a better choice for this)
  • Keeping entire sections of a website private (for instance, your engineering team’s staging site)
  • Keeping internal search results pages from showing up on a public SERP
  • Specifying the location of sitemap(s)
  • Preventing search engines from indexing certain files on your website (images, PDFs, etc.)
  • Specifying a crawl delay in order to prevent your servers from being overloaded when crawlers load multiple pieces of content at once

If there are no areas on your site to which you want to control user-agent access, you may not need a robots.txt file at all.

How to make Robot.txt by using  google robots file generator?

Robots txt file is easy to make but people who aren’t aware of how to, they need to follow the following instructions to save time.

  1. When you have landed on the page of New robots txt generator, you will see a couple of options, not all options are mandatory, but you need to choose carefully. The first row contains, default values for all robots and if you want to keep a crawl-delay. Leave them as they are if you don’t want to change them as shown in the below image:
  2. The second row is about sitemap, make sure you have one and don’t forget to mention it in the robot’s txt file.
  3. After this, you can choose from a couple of options for search engines if you want search engines bots to crawl or not, the second block is for images if you're going to allow their indexation the third column is for the mobile version of the website.
  4. The last option is for disallowing, where you will restrict the crawlers from indexing the areas of the page. Make sure to add the forward slash before filling the field with the address of the directory or page.