robots.txt

How Search Engines Originally Work

  • It is the Robots Exclusion Protocol.
  • A robot called a crawler travels around the Internet and collects site information.
  • The indexer analyzes the information collected by the crawler.
  • Based on the analyzed data, each search engine returns search results according to its algorithm.

What Is robots.txt?

  • robots.txt is a text file that tells crawlers which pages to crawl or not crawl.
  • It is published in the top-level directory of the domain.
  • robots.txt is still a recommendation, so there is no absolute obligation to follow it.

robots.txt Format

  • User-agent: search bot name
  • Allow: access permission setting, available only for Googlebot
  • Disallow: access blocking setting
  • Crawl-delay: delay before the next visit, in seconds
  • Sitemap: sitemap specification

robots.txt Examples

Allow all search bots to access all documents

User-agent: *
Allow: /

* means all robots, and / means all directories.

Block all search bots from all documents

User-agent: *
Disallow: /

Allow access to a specific directory

User-agent: Googlebot
Allow: /foo/bar/

Block access to a specific directory

User-agent: Googlebot
Disallow: /foo/bar/

Allow only Googlebot and block all others

User-agent: Googlebot
Allow: /

User-agent: *
Disallow: /

Expose Only Part of a Homepage Directory to Search Engines

 User-agent: * 
 Disallow: /conection/ 
 Disallow: /my_conection/ 

Block Part of a Homepage Directory from Search Engines

User-agent: *
Disallow: /my_page/

Site Load and Performance Perspective

If crawler visits increase site load, unimportant large amounts of content can be removed from crawler traversal using robots.txt, reducing site load and improving crawl efficiency for important content.

Separating important content from unimportant content is also beneficial for SEO and site load.

Unimportant content may include:

  • Pages that do not need to be indexed by search engines
  • Low-value content pages
  • Multiple pages with identical content
  • Landing pages for ads placed on the site
  • Pages you want to make available only to limited people
  • Management system files

Security Perspective

If “pages to crawl” or “pages not to crawl” are set in robots.txt, content intended only for limited people may become visible.

If management system files or pages intended for limited disclosure are set in robots.txt, they may not appear in search engine results, but they are still publicly exposed through robots.txt. In other words, files related to management or pages meant only for limited people can be revealed.

Using robots.txt can reduce the risk of appearing in search results, but if security-sensitive content is exposed in robots.txt, a security risk occurs.

Therefore, files related to security-sensitive management or pages meant only for specific limited people must have reliable access restrictions such as login authentication or IP address restrictions.

References