SEO Glossary

SEO has its own set of tools and terms.

Crawler / Search Engine Bot

a program that scans the internet looking for pages to index.

Crawling is about discovery. A crawler finds URLs from different sources:

Pages linked from other (indexed) pages.
1. For example when the homepage is read all the links are also followed and indexed and the behaviour repeats recursively.
2. Pages can be linked from external sites, too.
Manual suggestions via online tools.
The sitemap.xml file.

Indexing

act of collecting and storing a page in a database for search purposes.

A page can only be found in search results if it has previously been indexed by a search engine.

Even if a page is discovered by crawlers, by leveraging robots `meta` tags we can instruct search engines not to include pages in their search indexes.

`robots.txt`

a file hosted on the root of a market’s website which contains rules for disallowing specific URLs or folders from being indexed.

If a page is indexed already, disallowing it in the robots.txt file will have no effects.

To remove it from the index, the page should contain a robots meta tag.

It also links to the sitemap.

`sitemap.xml`

a file hosted on the root of a market’s website which contains a list of URLs that a search engine bot should index. The purpose of this file is suggesting URLs to the crawler. Nothing more.

Sitemaps are an important part of website optimization as they provide search engines an avenue for discovering pages on a site. It isn't always possible to internally link every page on a site, especially when dealing with a large website, however, with sitemaps, you can ensure that Google is able to discover important pages, even those that have been orphaned.

Consider robots and sitemap as the yin and yang ☯️ of URLs crawling.
They are not strictly opposite but they serve opposite purposes.
The robots file disallows URLs and the sitemap suggests them.