Robots.txt and SEO: The Detailed Guide

Robots.txt is a file placed to control crawling. Sites with a large number of pages have advantages such as improved crawlability, but incorrect descriptions can greatly reduce the inflow of the site, so use a robots.txt tester, etc. Also, for sites with a small number of pages, crawl control is often unnecessary, so there is no need to force it.

In this article we will provide robots.txt and SEO: The detailed guide

What is robots.txt?

A robots.txt is a text file placed in the root directory of your site that allows you to deny crawling by content.

You can control the crawling frequency of each page by writing a syntax that prohibits the crawler from crawling in the robots.txt file and placing it in the root directory of your domain.

In other words, you can direct search engine crawlers to the important content of your site.

In general, crawling is considered good, so some people may wonder, “Isn’t it better to crawl all the content on a web page?”

However, crawling member-only content, shopping carts, and duplicate pages that are unavoidably automatically generated systematically may actually affect the SEO of the entire site. 

SEO effect to set robots.txt

With robots.txt, you can control crawl-free content and efficiently direct crawlers only to the pages you want to crawl.

As a result, the frequency of crawling the content that you want to be evaluated on the site will increase, and it will have the effect of speedily obtaining SEO evaluation. Set priority to crawl content that is important to your site and index it efficiently.

Robots.txt is a file that controls content that you don’t want to be crawled on your site. It is said that if this is set properly, important content will be crawled preferentially, and it will also have a positive effect on the SEO of the entire site.

If you’re not using robots.txt, it’s likely that you’re causing search engines to crawl pages you don’t need, degrading the overall quality of your site. By introducing robots.txt, it is possible to restrict crawling to useless pages, optimize crawling, and prioritize crawling to important content within the site.

As a result, it is said to be effective for SEO of the entire website in a short period of time.

If you would like to learn more about the impact of robots.txt in SEO and Search Engine Ranking, read our article on, Technical SEO – A Detailed Guide. Also, if you would like to learn Technical SEO in depth, we have three courses in Digital Marketing, Post Graduate Program in Digital Marketing with Gen AI, Professional Diploma in Digital Marketing with Gen AI, and Performance Marketing Course with Gen AI. We ranked one of the top 5 online digital marketing courses in India with 100% Placement Guarantee.

Difference from noindex

A setting that can easily be mistaken for robots.txt is a method that utilizes noindex. Noindex is a setting to prevent search engines from indexing (storing information on data pages), and it is a mechanism to describe meta tags in HTML code.

Therefore, there is a big difference between the purpose and setting methods.

Item

Robots.txt

Noindex

Format

Text file

Meta element or HTTP header

Subject

Can be set for the entire site

Set on the individual page

Purpose

Deny crawl

Reject index

Also, noindex rejects the index, so it will not be displayed in the search results. However, be aware that robots.txt is set to refuse crawls and may appear in search results. 

How to create robots.txt

Robots.txt is a very strong specification, and incorrect descriptions can cause serious problems for your website.

Even if you create good quality content, it would be a waste if it wasn’t crawled well by creating the wrong robots.txt.

To avoid such a situation, it is important to know how to write an appropriate robots.txt.

Now I will explain how to write robots.txt correctly.
You don’t need any special tools to create a robots.txt, anyone can do it with Notepad. The contents of the description are mainly explained in the order of “User-agent”, “Disallow”, “Allow”, and “Sitemap”. Only User-agent is required and must be filled in.

Purpose of robots.txt

Robots.txt’s main purpose is to deny crawling, but it also has many other functions, such as optimizing crawlability and submitting XML sitemaps. Also, as an alternative to noindex, which can only be used in HTML, the use of robots.txt is an essential element.

Here, we will explain in detail the purpose of robots.txt. 

  1. Don’t crawl a particular page

The main purpose of robots.txt is to deny crawling for specific content, but content can be set in various hierarchies such as page by page or directory by directory.

For example, it can be used when the following content exists.

  • unfinished page
  • Pages that require login
  • Pages that are open only to members

If you have a page like this, you probably don’t want it to appear in your search results, or you’re not aware of SEO at all. Crawling unwanted content can be counterproductive in SEO.

Therefore, by using robots.txt and refusing to crawl, it is possible to prevent situations where the site rating is unnecessarily lowered.

  1. Do not crawl image or video files

There are many opportunities to utilize images and videos to operate the site, but since the images and video files are not HTML, no index can be set on the page.

However, with robots.txt, you can set crawl denial even for non-HTML files. Therefore, it can be said that there are many opportunities to use it as an alternative when noindex cannot be used.

However, depending on the site you are using, an image page may be automatically generated or created separately, and there are patterns in which noindex can be used in the same way as regular pages.

  1. Crawlability optimization

Robots.txt also helps optimize crawlability by encouraging crawls to important content.

This is not a problem for sites with a small amount of content, but if there are a large number of pages such as an EC site, the crawler cannot crawl to all the pages.

Some pages are important, but they are not crawled due to the number of pages.

If it has an important role such as a page that can be accessed or a page that connects to inquiries, it will be a big hurt for the site.

Therefore, it is effective to prevent unnecessary page crawling by robots.txt and ensure that important pages are crawled. Also, if robots.txt optimizes crawls, it will increase the frequency of crawls for the entire site and increase the number of crawls. 

  1. Presenting an XML Sitemap

robots.txt can contain an XML sitemap and is responsible for sending the sitemap to search engines. You can send a sitemap from “Google Search Console” for Google or “Bing Webmaster Tools” for Bing, but some search engines may not have the tool.

Therefore, robots.txt is a very convenient way to efficiently communicate the crawl status and site structure.

How to write robots.txt

When setting robots.txt, it is a mechanism to enter the relevant content according to the determined items. There are four main description items, and for specific sample code, please check ” Google Search Central “.

Here, I will explain how to write robots.txt.

  1. User-Agent

User-Agent is a descriptor used to specify the crawler you want to control.

  • All crawlers: * (asterisk mark)
  • Google’s crawler: Googlebot
  • Crawler for smartphones: Googlebot
  • AdSense crawlers: Mediapartners-Google
  • Google Image Search Crawler: Googlebot-Image

The basic description method is to enter “*” for all crawlers. If you want to deny crawling from Google, enter the Google crawler “Googlebot”.

  1. Disallow

Disallow is a description item used when specifying a page or directory to deny crawling. By entering the URL path, you can set crawl refusal in a limited way.

  • Whole site: “Disallow: /”
  • Directory specification: “Disallow: /abc9999/”
  • Page specification: “Disallow: /abc9999.html/”
  • Enter the URL path in “abc9999”

Disallow is a descriptive item that will be used in many situations, so remember what you enter. 

  1. Allow Allow

Allow is a description item for allowing crawl, and has the opposite role of Disallow. However, normally, crawling is permitted without entering the Allow item. Therefore, it can be said that there are few situations to utilize it.

Basically, disallow is entered, but allow is used to allow crawling only for specific pages and directories.

Specifically, the situation is as follows.

  • User-agent: *
  • Disallow: /sample/
  • Allow: /sample/abc9999.html

In the above case, crawling of the directory “sample” is permitted, but among them, only the page “abc9999.html” is permitted to be crawled. 

  1. Sitemap

As the name implies, Sitemap is a description item used to send a sitemap.

Input is optional, but entering a Sitemap tends to speed up the crawl. Therefore, if you want to improve crawlerability, we recommend inputting.

  • Sitemap: http://abc9999.com/sitemap.xml
  • Enter the site map path in “abc9999.com”

If there are multiple site map paths, enter them on a new line. 

How to set robots.txt

In order to actually set robots.txt, let’s implement it from the following method

  • Using plugins
  • Upload directly

If you are using a WordPress site, we recommend using a plugin that can be easily set up.

Here, I will explain how to set robots.txt. 

  1. Using plugins

If it is a WordPress site, you can easily set robots.txt by using a plugin called “All in One SEO Pack”.

It can be activated by the following process.

  • Download and then activate the SEO Pack.
  • Display the setting screen of “Post”-> “Robots.txt” from the WordPress administration screen
  • Enable all “features” free

If you set up so far, the following will be written at the bottom of “Create Robots.txt file”.

  • User-agent: *
  • Disallow: / wp / wp-admin /
  • Allow: /wp/wp-admin/admin-ajax.php
  • Sitemap: https://sample.com/sitemap.xml

All you have to do is edit it referring to the above “how to write” and you’re done. 

  1. Upload directly

A common method for all sites is to upload directly to the site’s directory top.

  • File format: “UTF-8” encoded plain text
  • File size: Maximum 500KB

Subdomains are fine, but be aware that they are not found in subdirectories.

How to check robots.txt

There is also a method of checking the robots.txt file directly, but it is recommended to use a tool because there is a possibility of overlooking or checking errors. “robots.txt tester” is a free tool provided by Google, and you can easily check for errors simply by entering a URL.

Here, we will explain how to use the “robots.txt tester” to check robots.txt. 

  1. Check syntax

Checking the syntax is a method for determining whether the contents of the robots.txt file are grammatically correct.

You can check the syntax by the following method.

  • Access ” robots.txt tester “
  • Enter the corresponding URL path in the URL input field at the bottom of the screen and click “Test”
  • Test results are displayed

Before testing, make sure your site works properly. If your site does not reflect, the robots.txt file is not installed correctly. Therefore, install the robots.txt file again before testing.

  1. Syntax correction

After checking the test results with the “robots.txt tester”, check if there are any errors.

If an error occurs, first fix it in the “robots.txt tester”. Clicking on the error and entering the characters directly will change the syntax.

If the syntax is incorrect again, an error will occur, so it is important to correct the content until there are no errors.

However, even if you make corrections in the “robots.txt tester”, the actual robots.txt file will not be changed. Therefore, after checking the error content, correct the actual file.

Perform the test again according to the above flow, and if no error occurs, the confirmation is completed. 

Precautions when setting robots.txt

Robots.txt can be said to be a relatively simple setting, as you only need to enter it according to the description items.

However, the following items are easy to make mistakes, so be careful not to deviate from the purpose when using them.

  • Do not use for index rejection purposes
  • Do not use for the purpose of preventing duplicate content
  • It does not restrict user access

I will explain each of them; 

  1. Do not use for index rejection purposes

A common mistake is to use robots.txt to deny indexing.

Since robots.txt is set to reject crawls, you must use noindex for index rejection. At first glance, it seems to have the same effect, but if you use it incorrectly, you will get an error such as “It will be displayed in the search results without the description of the site”, so be careful.

It also affects the evaluation of the entire site, so it is important to make settings according to the original purpose. 

  1. Do not use for the purpose of preventing duplicate content

For the same reason as “index denial” above, stop using robots.txt as a countermeasure against duplicate content.

As the amount of content on your site increases, duplicate content is more likely to occur. Therefore, it is easy to think that you can eliminate the duplicate by rejecting the crawl, but when it is indexed, it will be recognized as duplicate content by the search engine.

Since robots.txt cannot completely deal with it, let’s deal with duplicate content by “no index” or “URL normalization”. 

  1. It does not restrict user access

The last misuse of robots.txt is to restrict user access. Like indexes, it’s easy to make mistakes when it’s considered to be completely excluded from search engines.

However, robots.txt does not have the effect of restricting user access. Therefore, if the URL is listed on the Internet, you can access it even if you refuse to crawl.

Separate settings are required to restrict access, so be careful not to mistake the effect you get. 

Purpose of setting robots.txt

Robots.txt has four main purposes:

  1. Prevent the crawler from crawling specific pages and directories

There are pages on the site that many people do not want to see, such as unfinished pages in the staging environment, login required pages, and member-only content.
In the case of a staging environment, it is possible that it is in the process of being created, and in many cases, login-required pages are created without much awareness of SEO. And most of all, you don’t want these pages to appear in search results.
So use robots.txt when you want to block crawls to pages that you don’t want search engines to see or that you don’t want to index.

* If you do not want to index clearly, please set a password or set noindex to browse the site. 

  1. Optimization of crawler ability

On EC sites, etc., a large number of similar pages may be generated due to narrowing down the conditions, etc., and the number of pages may increase. Under these circumstances, search engines may not be able to crawl all pages.

Robots.txt is also used to control crawls to such pages that are necessary on the site but are not expected to have a search influx. By controlling it, you can increase the frequency of crawls to the pages you want search engines to evaluate, or you can crawl the entire site. In terms of merits in SEO, this is the most noticeable.

  1. Presentation of XML sitemap

The URL of the XML sitemap can be included in robots.txt and can be shown to search engines.
It is possible to send an XML sitemap from the Google Search Console or Bing Webmaster Tools, but there are search engines that do not have such a means, so be sure to mention it when installing robots.txt. prize.

  1. Prevent images, videos, and audio files from being displayed in search results

Images and videos are not HTML, so noindex cannot be placed on the page.
You can use robots.txt to prevent it from appearing in search results. 

How to set robots.txt

  1. Installation method

Be sure to install robots.txt in the root directory of the operating site.
Example 1: http://www.example.com/robots.txt
Example 2: https://www.seohacks.net/robots.txt
Example 3: https://sub.seohacks.net/robots.txt

It can also be applied to subdomains, as in Example 3.
Also, be sure to name the file “robots.txt” when installing. Everything else is NG. 

  1. Description method

The idea of how to write a crawl control in robots.txt is very simple.

  • which robot
  • Which page / directory / file
  • May crawl / Do not crawl

I will write three things.

1. Which robot

First, specify the robot that controls the crawl.

Google alone

  • googlebot (applies to all)
  • googlebot-news (applies to news, news images)
  • googlebot-image
  • googlebot-video

There are various robots such as, and of course there are robots of other search engines, so let’s set it according to the purpose.
If you want to give instructions to all robots, use * (wildcard). Use wildcards unless otherwise specified.

2. Which page / directory / file

Set up a page that suits the purpose of setting up robots.txt at the beginning.
Regular expressions can be used at the time of setting, so let’s set it referring to the following page.

3. May crawl/Do not crawl

If you allow crawl

allow: [page path]

Example 1: allow: /
Example 2: allow: /pasta

If you do not allow crawl

disallow: [page path]

Example 1: disallow: /
Example 2: disallow: /pasta

It is described as.

Both the descriptions of allow and disallow may apply, but in that case, the more specific specification will apply.

Example 1
User-agent: *
Disallow: /
allow: / blog

Crawling to all URLs is blocked, but since /blog is specifically permitted with allow, in this case crawling of URLs other than /blog is blocked.

Example 2
User-agent: *
Disallow: / aaaaa
allow: / aaaaa / bbbbb
allow: / aaaaa / ccccc

Crawling to all URLs under / aaaaa is blocked, but since allow specifically allows crawls to / aaaaa / bbbbb and / aaaaa / ccccc, in this case / aaaaa / Crawling of URLs under / aaaaa other than bbbbb and / aaaaa / ccccc is blocked.

Use a robots.txt tester

By uploading robots.txt in the production environment, you can use the “robots.txt tester”.

However, there is a risk to suddenly uploading to the production environment, so we recommend that you upload it in a state where there are no problems.

In the above case, crawling is allowed for all user agents, so if it is the first time to publish, there will be no problem.

This robots.txt tester can be edited on the screen, so it’s a good idea to test it here. Please note that the edits made here will not be reflected in the actual robots.txt, so click the submit button when the test is complete and download it.

When testing, you can specify a robot.

In this case, “googlebot-news” is specified in robots.txt, and “googlebot” is selected for the test robot, so crawling is not blocked but allowed.

When you’re done, click the Submit button to download the edited robots.txt.
After uploading the file back to the root directory, click “2 Check uploaded version” to check if the file has been uploaded correctly. You can also access it directly.
If everything looks okay, you can tell Google the latest version of robots.txt by finally “3 Request an update from Google”. Thank you for your hard work. 

Notes on robots.txt

Here are some common caveats when using robots.txt.

  1. Do not use for index control

It is certainly difficult to index because it blocks crawling to the specified page or directory, but it may be indexed if the page is found by an external link or the like.
Also, in this case, the page information cannot be obtained (crawled), so the display will be special when displayed in the search results.

Therefore, if you do not want to index, set noindex instead of using robots.txt.
And when you set it, you cannot detect noindex if it is blocked, so let’s set it after canceling the block.

  1. Users can continue to access

Only crawlers can be blocked with robots.txt. Note that the user can access it without any problem.
Therefore, it is not effective to set it in the sense of prohibiting user access to members-only content or paid pages, so be careful. 

  1. Sites with a small number of pages do not pay much attention

As I mentioned at the beginning, crawl problems are less likely to occur on sites with a small number of pages, and robots.txt is not required to be installed, so I think that it is okay to lower the response priority on sites with a small number of pages increase. 

Summary

In this article, ”robots.txt and SEO: The detailed guide” we have explained about robots.txt from basic knowledge to description method and concrete setting method. It’s easy to mistake it for noindex, but be aware that rejecting an index and rejecting a crawl have very different effects.

In addition, the original effect may be lost, so it is important to use it according to the purpose.

As you can see, robots.txt is a fairly technical area, and in practice you may be wondering what to do. If you have any problems, please feel free to comment below. Also visit NIDE for various course like digital marketing course online, performance marketing, social media marketing with 100% guaranteed placement.

 

Facebook
LinkedIn
Threads
WhatsApp
Telegram
Twitter
Pinterest
Reddit
Tumblr
StumbleUpon
Pocket
VK
OK
Digg
Mix
X
XING
Email

Let's have a chat

Download Brochure

Fill this form to download the brochure