robots.txt
How Search Engines Originally Work
- It is the Robots Exclusion Protocol.
- A robot called a crawler travels around the Internet and collects site information.
- The indexer analyzes the information collected by the crawler.
- Based on the analyzed data, each search engine returns search results according to its algorithm.
What Is robots.txt?
robots.txtis a text file that tells crawlers which pages to crawl or not crawl.- It is published in the top-level directory of the domain.
robots.txtis still a recommendation, so there is no absolute obligation to follow it.
robots.txt Format
- User-agent: search bot name
- Allow: access permission setting, available only for Googlebot
- Disallow: access blocking setting
- Crawl-delay: delay before the next visit, in seconds
- Sitemap: sitemap specification
robots.txt Examples
Allow all search bots to access all documents
User-agent: *
Allow: /
* means all robots, and / means all directories.
Block all search bots from all documents
User-agent: *
Disallow: /
Allow access to a specific directory
User-agent: Googlebot
Allow: /foo/bar/
Block access to a specific directory
User-agent: Googlebot
Disallow: /foo/bar/
Allow only Googlebot and block all others
User-agent: Googlebot
Allow: /
User-agent: *
Disallow: /
Expose Only Part of a Homepage Directory to Search Engines
User-agent: *
Disallow: /conection/
Disallow: /my_conection/
Block Part of a Homepage Directory from Search Engines
User-agent: *
Disallow: /my_page/
Site Load and Performance Perspective
If crawler visits increase site load, unimportant large amounts of content can be removed from crawler traversal using robots.txt, reducing site load and improving crawl efficiency for important content.
Separating important content from unimportant content is also beneficial for SEO and site load.
Unimportant content may include:
- Pages that do not need to be indexed by search engines
- Low-value content pages
- Multiple pages with identical content
- Landing pages for ads placed on the site
- Pages you want to make available only to limited people
- Management system files
Security Perspective
If “pages to crawl” or “pages not to crawl” are set in robots.txt, content intended only for limited people may become visible.
If management system files or pages intended for limited disclosure are set in robots.txt, they may not appear in search engine results, but they are still publicly exposed through robots.txt. In other words, files related to management or pages meant only for limited people can be revealed.
Using robots.txt can reduce the risk of appearing in search results, but if security-sensitive content is exposed in robots.txt, a security risk occurs.
Therefore, files related to security-sensitive management or pages meant only for specific limited people must have reliable access restrictions such as login authentication or IP address restrictions.