Extracting Index Coverage Report Stats in Bulk from Search Console using Node.js

In my previous article I talked about how to get indexing information in bulk from Google Search Console using the URL Inspection Tool and Node.js. This tool is great to gather individual information about specific URLs in your site. However, Google also provides site owners with a more holistic view of the indexing status of their sites with the Index Coverage Report.

Index Coverage Report Search Console

You can check Google’s own documentation and video tutorial to understand in more detail the data this section provides, but at a very top level the key data points are:

  1. The amount of pages that Google has indexed.
  2. The amount of pages that Google has found but has not indexed (either because of an error or purposefully excluded).
  3. How big your site is from Google’s point of view (Valid + Excluded + Errors).

Right now there are four main categories: Errors, Valid with warning, Valid and Excluded subdivided into 29 subcategories. Each of these subcategories provide an additional level of classification to help site owners and SEOs understand why your URLs belong in the main category. Not all subcategories will be visible, only the ones that apply to your site.

Details table Index Coverage Report Search Console

Unfortunately, the export option on the Index Coverage Report view (pictured above) only gives you the top level numbers per report. If you want to know and export which URLs are inside the multiple reports, you have to click on each report and export them one by one.

Export individual coverage report detail

This way to extract the data is very manual and time consuming. Hence, I decided to automate it with Node.js and add it a few more features.

Installing and running the script

Make sure that you have Node.js in your machine. At the time of writing this post I’m using version 14.16.0. In this script I'm using a specific syntax that can only be used from version 14 onwards so double check that you are above that version.

# Check Node version
node -v

Download the script using git, Github’s CLI or simply downloading the code from Github directly.

# Git
git clone https://github.com/jlhernando/index-coverage-extractor.git

# Github’s CLI
gh repo clone https://github.com/jlhernando/index-coverage-extractor

Then install the necessary modules to run the script by typing npm install in your terminal

npm install

In order to extract the coverage data from your website/property update the credential.js file with your Search Console credentials. Update credentials.js with your Search Console user & password

After that use your terminal and type npm start to run the script.

npm start

The script logs the processing in the console so you are aware of what is happening. Index Coverage Extractor in action

Like in the URL Inspection Automator, the script uses Playwright and runs in headless mode. If you want to see the browser automation in action, simply change the launch option to headless: false in the index.js file and save it before running the script.

change headless mode to see the browser automation in action

The output

The script will create a "coverage.csv" file and a "summary.csv" file.

The "coverage.csv" will contain all the URLs that have been extracted from each individual coverage report. Coverage report detail csv

The "summary.csv" will contain the amount of urls per report that have been extracted, the total number that GSC reports in the user interface (either the same or higher) and an "extraction ratio" which is a division between the URLs extracted and the total number of URLs reported by GSC.

Coverage report summary csv

I believe this extra data point is useful because GSC has an export limit of 1000 rows per report. Hence, the "extraction ratio" gives you an accurate idea of how many URLs you have been able to extract versus how many you are missing from that report.

Future updates

There are a few features that I think would be really nice to have for future releases. For example, extract the "update" date and modify the script as a Google cloud function to store the data in BigQuery only if the date is different to the previous date stored.

I know other SEOs are doing this kind of extraction already using Google Sheets which is very cool so I might give that a try. In the meantime, I hope that you find this script useful and if you have any thoughts hit me up on Twitter.

Share this post