Search Marketing

Screaming Frog: The Full Guide

Screaming Frog is a very powerful SEO Spider able to perform in-depth SEO OnSite analysis. In this guide, we will see some of the main features very useful during SEO analysis. The free version of Screaming Frog allows you to analyze up to 500 URLs.

Crawling

Screaming Frog allows you to crawl a specific website, subdomain or directory.

Crawling: Subdomain

In the paid version, the SEO Spider allows you to select the “Crawl all Subdomain” option if you have more than one subdomain. If you only need to crawl one subdomain, simply add the URL in the appropriate box.

The most commonly used features are monitoring status queues on a website (40x,50x,200 and 30x).

Crawling: Subfolder

Screaming Frog by default crawls a directory by simply adding the address in the bar as presented in the image below.

If you need to perform an advanced crawling you can use the wildcard that tells the SEO Spider to crawl all pages that precede and/or follow the “Wildcard”. The path to use this feature is:

Spider > Include and add in the box that appears the desired syntax, for example with this syntax: https://www.bytekmarketing.com/about/.* the spider only crawls the sections of the website that are present in the “About Us” branch of the website, then all the resources that are after the “Jolly” character. Starting the crawl will extract all the “Daughters” URLs of the “About Us” section, for example: https://www.bytekmarketing.com/about/roberto-paolucci or https://www.bytekmarketing.com/about/mario-rossi.

This option is particularly useful with large websites where we do not have resources to work on very large data. Keep in mind that the crawling data will have to be (in most cases) processed in Excel, so the starting point will have to be a workable data in an easy way to “Search Vert”, work with filters and charts.

Crawling: List of URLs

From the “Mode” tab you can select the crawling mode, in case you want to crawl a set of URLs the mode to set is “List” because you can import an Excel file with a column containing the list of URLs.

The other option to scan a list of URLs is “copy and paste”, then copy from an external source (Excel, CSV, TXT or HTML page) the list of URLs and click “Paste”.

N.B. It is necessary that each URL also contains the http or HTTPS protocol including the www, so the correct structure of each URL should be: http://www.test.it.

Crawling: Large Website

When you need to analyze a large website and it’s not enough to just crawl through HTML and images (in SEO perspective very often it’s good to also analyze the status queues of CSS and JS files to make sure that search engine spiders are able to correctly render pages) you can work on the settings:

1.Configuration > System > Memory and allocate more memory, for example 4GB
2.Set the storage to the database instead of RAM.

If even with these two configurations it is not possible to analyze a large website, the only settings that can be activated are:

1.Start crawling by website branches, one and more branches at a time:

  • With wildcard character;
  • Include/Exclude option;
  • Custom robots.txt;
  • Navigation Depth (Crawl Depth);
  • Query string parameters.

2.Exclude from crawling: Images, CSS, JS and other non-HTML resources.

From an SEO perspective, it is essential to perform a single crawl because it allows you to have a complete view, for example the pair of URLs From and URL To in reference to 301, 404 or the monitoring of the distribution of internal links.

N.B. It may happen that Screaming Frog goes in time out or, in general, it can’t analyze resources (or it’s very slow) even on small websites; in this case the problem could be related to other factors, such as hosting performance or the fact that our IP address (from which we started Screaming Frog) has been blocked by the website owner (or by the dedicated IT resource).

Our IP address can be banned by a provider because the action of Screaming Frog is very similar to a hacker attack (e.g. DOS attack) aimed at running out of server resources and causing 50x errors.

Saving the Crawl

After finishing crawling the website there are multiple export options:

  • Save the source of Screaming Frog:
    • Having the source allows you to control the crawling data without having to start it again. Especially useful for large websites or to collaborate with colleagues and share the source.
  • Save only the necessary tab;
  • Export all pages to a single Excel file;
  • Bulk export, very useful to have, for example, full internal link distribution:
    • All inlinks (for internal linking analysis);
    • All outlinks;
    • All anchor text;
    • All images;
    • Schema.org structured data;

The image below shows how to export schema.org structured data.

Configuration Files

Screaming Frog allows you to export a configuration file that can be reused for future projects/customers. It is particularly useful if you perform SEO analysis for similar clients (similar website structure) and have configured advanced filters or special extraction options (filters, exclude/include or wildcard).

The configuration file is also useful if custom scripts have been programmed, for example in Python or from the command line to automate purely mechanical operations. For example, if we need to perform a series of purely technical SEO Audits and the output requires the same data, it would make no sense, for each website, to re-configure Screaming Frog.

File Robots.txt

Screaming Frog is “Robots.txt Compliant” so it is able to perfectly follow the guidelines indicated in robots.txt exactly like Google Search. Through the configuration options it is possible:

  • ignore the robots.txt;
  • see the URLs blocked by the robots.txt;
  • option to use custom robots.txt
  • The last option may come in handy before the go-live of a website to test the robots.txt file to see if the directives in the file are correct.

Cookies

By default, Screaming Frog does not accept cookies, as do search engine spiders. This option is often underestimated or ignored but in fact, for some websites, it is of fundamental importance because by accepting cookies you can unlock features and add code that can give extremely useful SEO and performance information.

For example, by accepting cookies you can unlock a small JavaScript that adds code to the HTML of the page… and if this code creates SEO side problems how can I verify it? Screaming Frog helps us in this case as shown in the image below.

Creating a Sitemap

One of the best methods to create a sitemap is to use an SEO Tool like Screaming Frog, also the use of WordPress plugins like SEO Yoast are fine, but there may be update and non-compatibility problems, for example, it may happen that the URLs in the sitemap return status code 404.

It is recommended to generate a sitemap that contains only canonical URLs with status code 200. For large websites, it is recommended to create a sitemap for each type of content (PDF, images and HTML pages) and a sitemap for each branch of the information architecture.

Having specific sitemaps allows the search engine to better analyze URLs and file types and allows you to have full control and easily make a comparison between URLs in Google Search index (site operator:) and individual sitemaps.

Please note that the limit of URLs to add in a sitemap is 49,999. For details on standards see: https://www.sitemaps.org/protocol.html-

To generate a Screaming Frog sitemap follow the steps below:

Sitemaps (top bar) > XML Sitemap or Images Sitemap

Among the Screaming Frog options you can decide which pages to include based on:

  • Pages;
  • status code;
  • noindex pages;
  • canonicalised;
  • paginated URLs;
  • PDFs.
  • Last Modified;
  • Priority;
  • Change frequency;
  • Images:
  • Include/Exclude;
  • Noindex images;
  • Include relevant images based on the number of links they receive;
  • Include images from a CDN. For large websites, e.g. e-commerce, product photos can be uploaded to a subdomain or external hosting, for a variety of reasons such as:
  • Avoid the absorption of resources allocated to the CMS;
  • Ease of management, as management scripts can be created only to images to improve their performance,
  • Management of cron job for synchronization between physical warehouse and e-commerce.

Views: Graphs and Diagrams

With regard to the structure of the website with a particular focus on the information architecture, the “Visualisations” section is useful as it allows to have a graphic vision of the website structure, in diagrams or graphs.

During an internal linking analysis, this section is fundamental but it is recommended to integrate it with mind-map programs, such as XMind and with standard tools:
https://rawgraphs.io/.

Configuration Options

The configuration options of the SEO Spider are collected and organized in tabs, in this paragraph, we will examine the macro tabs without going into detail on all the individual options.

Basic Tab

  • Images;
  • CSS;
  • JavaScript;
  • SWF;
  • External links;
  • Link outside of the start folder;
  • Follow internal or external nofollow;
  • Crawl all subdomains;
  • Crawl outside of the start folder;
  • Crawl canonical;
  • Extraction of hreflang;
  • Crawl of links inside the sitemap;
  • Extraction and crawl of AMP links;

Limits Tab

This tab is particularly useful for analyzing very large websites but not only. From this section you can:

  • Set the total crawl limit, expressed in the number of URLs;
  • The crawl depth expressed in the number of directories;
  • The limit in the number of query strings;
  • The limit of redirect 301 to follow (to avoid the chains of 301, harmful in terms of resource use and therefore crawl budget);
  • Length of URLs to follow, default is 2,000 characters;
  • Maximum weight of pages to analyze.

Advanced Tab

  • Allow cookies;
  • Pause on high memory usage;
  • Always follow redirects;
  • Always follow canonicals;
  • Respect noindex;
  • Respect canonical;
  • Respect Next/Prev;
  • Extract images from img srcset Attribute;
  • Respect HSTS Policy;
  • Respect self-referencing meta refresh;
  • Response timeout;
  • 5xx Response Retries;
  • Store HTML;
  • Store rendered HTML;
  • Extract Microdata;
  • Extract RDFa;
  • Schema.org Validaton;
  • Google Validation.

Top Tabs

In the main menu at the top of the tool there are a series of buttons (tabs) which open sections, see them in detail.

Internal

The internal tab combines all the data extracted during crawling and added in the other tabs (excluding external, hreflang and custom tabs). The usefulness of this tab lies in having an overview and the possibility to export and work the data externally, for example in Excel, with Data Studio or mind-map tools.

External

This tab shows information related to URLs outside the domain.

Protocols

From this section you can see information related to HTTP and HTTPS protocols of both external and internal URLs. This tab is useful to verify, for example, the correct migration to HTTPS.

Response code

This tab provides information on response queues, both internal and external.

Page Titles

This tab provides information related to page titles, in particular for:

  • Duplicate titles;
  • Absent titles;
  • Title less than 35 characters;
  • Title greater than 65 characters;
  • Title equal to H1;
  • Multiple titles.

Meta description

Provides meta description information, length (min and max in SEO optics), if duplicate or absent.

H1

It provides information about the H1 tag heading, for example if it is equal to the title, because very often (especially in E-commerce) products have the H1 equal to the title. This criticality can be solved, for example, by concatenating the product variant to the current H1 and having an original tag.

H2

Information on the length and originality of H2 tags.

Images

The data provided in this tab are related both to the weight of the image and to the number of internal links it receives both to the Indexability Status. It is remembered that an image in SEO optics must be considered as an HTML page because, if well optimized, it is able to carry organic traffic, for example through searches by image.

Canonical

This tab shows the list of cononic resources.

Pagination

It provides information about paging and paged resources, particularly the use of Rel Next and Rel Prev tags.

Hreflang

This tab provides information on using the Hreflang tag for the correct setting of a multi-language or multi-language and multi-country website.

SEO audits for multi-language websites require effort aside from the complexity and analysis to be performed on multiple markets.

Custom

The Custom tab allows you to control the URLs obtained through the use of custom filters and extractions.

Analytics and Search Console
Through this tab, you can integrate your Google Analytics and Google Search Console accounts.

Conclusions

This is a basic guide to using the SEO Spider to understand its potential and areas of use. To date, Screaming Frog is one of the best tools to conduct technical SEO analysis. It is certainly very useful to integrate this guide with real case studies applied to clients during our SEO Audits in order to make it more enjoyable to follow.

We are part of

Datrix SPA
Together with

© 2020 ByTek SRL - P. IVA: 13056731006 - REA: MI - 2562796 - Privacy Policy - Cookie Policy