Sitemap Icon
Icon by Freepik from Flaticon. Licensed under CC 3.0 BY.

Having dead URLs in your sitemap.xml file is a surefire way to tank your website’s search rankings. The curl command can be used to check every <loc> element defined in the file to find any broken links.

This command was adapted from Analyzing XML Sitemap Files with Bash which goes into much greater detail.

Output sitemap.xml Content

The first step is to write the content of the sitemap.xml file to stdout.

If the file resides on a publicly accessible web server, use the curl command to fetch and output the content:

curl blog.atj.me/sitemap.xml

If the file exists on the local file system, use the cat command to output the content:

cat ./public/sitemap.xml

Parse sitemap.xml URLs

Use grep to match lines containing “loc” and use sed to extract the URL:

curl blog.atj.me/sitemap.xml | grep -e loc | sed 's|<loc>\(.*\)<\/loc>$|\1|g'

This should output a list of URLs.

Check sitemap.xml URLs

Running this command will send a request to each URL in the sitemap.xml file. Be careful not to DOS yourself.

Use the xargs command to invoke curl for each URL in sequence. Running curl with the following arguments will output the HTTP status code of the response and the requested URL (but not the content of the response).

curl blog.atj.me/sitemap.xml | grep -e loc | sed 's|<loc>\(.*\)<\/loc>$|\1|g' | xargs -I {} curl -s -o /dev/null -w "%{http_code} %{url_effective}\n" {}

Filter Output

For sites with a lot of pages, it may be useful to suppress the response output for successful requests. To only show response messages for status codes not equal to 200:

curl blog.atj.me/sitemap.xml | grep -e loc | sed 's|<loc>\(.*\)<\/loc>$|\1|g' | xargs -I {} curl -s -o /dev/null -w "%{http_code} %{url_effective}\n" {} | grep -v 200

Makefile

The “$” character must be escaped if using this command in a Makefile:

crawl:
    curl blog.atj.me/sitemap.xml | grep -e loc | sed 's|<loc>\(.*\)<\/loc>$$|\1|g' | xargs -I {} curl -s -o /dev/null -w "%{http_code} %{url_effective}\n" {}