GoogleBot What is it? A spider or tracking robot

Googlebot is Google’s web tracking robot, through which Google discovers new or updated pages and adds them to the search engine index.

GoogleBot What is it? A spider or tracking robot | Websites management | Googlebot is Google’s web tracking robot, through which Google discovers new or updated pages and adds them to the search engine index

Googlebot is Google's web crawling robot (sometimes also called "spider"). Tracing is the process by which Googlebot discovers new and updated pages and adds them to the Google index.

We use a huge amount of computer equipment to obtain (or "track") billions of Web pages. Googlebot uses an algorithmic tracking process: through computer programs, the sites to be tracked are determined, the frequency and the number of pages to be searched in each site.

The Googlebot crawling process starts with a list of web page URLs generated from previous crawling processes and is expanded with the data from the sitemaps offered by the webmasters. As Googlebot visits each of these websites, it detects links (SRC and HREF) on its pages and adds them to the list of pages to crawl. New sites, changes to existing ones and obsolete links are detected and used to update the Google index.

How Googlebot accesses your site

On average, Googlebot does not usually access most sites more than once every few seconds. However, due to network delays, this frequency may seem slightly higher for brief periods of time. In general, Googlebot downloads a single copy of each page simultaneously. If it detects that Googlebot downloads the same page several times, this is likely due to the stopping and rebooting of the crawler.

Googlebot is designed to be distributed across multiple teams in order to improve performance and reach as the Web develops. In addition, to reduce the use of bandwidth, many of the trackers run on computers located near the sites they index in the network. Therefore, it is possible that your records show visits from several computers to the google.com page, in all cases with Googlebot as "user-agent". Our goal is to track as many pages of your site as possible on each visit without collapsing the bandwidth of your server.

Blocking Googlebot access to your site content

It is virtually impossible not to post links to a web server to keep it secret. The moment a user uses a link from his "secret" server to access another web server, his "secret" URL may appear on the reference label, and the other web server may store it and publish it in his reference register . In addition, the Web contains a large number of obsolete and damaged links. Whenever an incorrect link is posted to your site or links are not updated correctly to reflect changes made to your server, Googlebot will try to download an incorrect link from your site.

Blocking Googlebot access to your site's content

It is practically impossible not to publish links to a web server to keep it secret. The moment a user uses a link from his "secret" server to access another web server, his "secret" URL may appear on the reference label, and the other web server may store it and publish it in his reference register . In addition, the Web contains a large number of obsolete and damaged links. Whenever an incorrect link is posted to your site or links are not updated correctly to reflect changes made to your server, Googlebot will try to download an incorrect link from your site.

You have several options to prevent Googlebot from crawling the content of your site, including the use of the robots.txt file to block access to files and directories on your server.

Googlebot may take some time to detect the changes once you have created the robots.txt file. If Googlebot continues to crawl blocked content in the robots.txt file, verify that the location of this file is correct. The robots.txt file must be located in the main directory of the server (for example, www.mihost.com/robots.txt), since its inclusion in a subdirectory will have no effect.

If you only want to avoid error messages indicating that the file can not be found in the web server log, create an empty file with the name "robots.txt". To prevent Googlebot from following links to a page on your site, use the nofollow meta tag. To prevent Googlebot from following a specific link, add the rel = "nofollow" attribute to the link.

Here are some other suggestions:

  • Check if your robots.txt file works correctly. The Test Webmaster Tools tool in GoogleMgr allows you to see how Googlebot will accurately interpret the content of your robots.txt file. The Google "user-agent" robot is very apt Googlebot.
  • The Explore as Googlebot tool in Google Webmaster Tools allows you to see exactly how you see your Googlebot site. This tool can be very useful for solving problems related to the content of your site or the visibility of the same in the search results.

How to make sure your site can be crawled

Googlebot finds sites by following links between pages. The Crawl Errors page of Webmaster Tools lists the problems detected by Googlebot when crawling your site. We recommend that you regularly check these tracking errors to identify problems related to your site.

If you are running an AJAX application with content that you want to appear in the search results, we recommend that you check our proposal on how to make AJAX-based content crawlable and indexed.

If your robots.txt file works correctly, but the site does not show traffic, the position of the content on the results pages may not be good for any of the reasons listed below.

Problems related to spammers and other user-agents

The IP addresses that Googlebot uses vary from time to time. The best way to identify Googlebot accesses is to use the "user-agent" robot (Googlebot). To check if the robot that accesses your server is really Googlebot, perform a reverse DNS lookup.

Googlebot, like the rest of the robots of the accredited search engines, will respect the robots.txt guidelines, but it is possible that some spammers and other malicious users do not respect them

Google also has other user-agents, such as Feedfetcher (user-agent: Feedfetcher-Google). Feedfetcher requests come from explicit actions by users who have added feeds to the Google home page or Google Reader (and not automated crawlers), so Feedfetcher does not follow the robots.txt file guidelines. To prevent Feedfetcher from crawling your site, configure your server to show error status messages 404, 410 or any other type to the user-agent robot Feedfetcher-Google.

Did you like it or was it useful?

Help us share it in Social Networks

Professor at the University of Guadalajara

Hugo Delgado Desarrollador y Diseñador Web en Puerto Vallarta

Professional in Web Development and SEO Positioning for more than 10 continuous years.
We have more than 200 certificates and recognitions in the Academic and Professional trajectory, including diploma certificates certified by Google.

IT ALSO DESERVES TO PAY TO VISIT:

Not finding what you need?

Use our internal search to discover more information
Sponsored content:
 

Leave your Comment

SPONSOR

Your business can also appear here. More information