Having dead URLs in your sitemap.xml file is a surefire way to tank your website’s search rankings.
The curl
command can be used to check every <loc>
element defined in the file to find any broken links.
This command was adapted from Analyzing XML Sitemap Files with Bash which goes into much greater detail.
Output sitemap.xml Content
The first step is to write the content of the sitemap.xml file to stdout.
If the file resides on a publicly accessible web server, use the curl
command to fetch and output the content:
curl blog.atj.me/sitemap.xml
If the file exists on the local file system, use the cat
command to output the content:
cat ./public/sitemap.xml
Parse sitemap.xml URLs
Use grep
to match lines containing “loc” and use sed
to extract the URL:
curl blog.atj.me/sitemap.xml | grep -e loc | sed 's|<loc>\(.*\)<\/loc>$|\1|g'
This should output a list of URLs.
Check sitemap.xml URLs
Running this command will send a request to each URL in the sitemap.xml file. Be careful not to DOS yourself.Use the xargs
command to invoke curl
for each URL in sequence. Running curl
with the following arguments
will output the HTTP status code
of the response and the requested URL (but not the content of the response).
curl blog.atj.me/sitemap.xml | grep -e loc | sed 's|<loc>\(.*\)<\/loc>$|\1|g' | xargs -I {} curl -s -o /dev/null -w "%{http_code} %{url_effective}\n" {}
Filter Output
For sites with a lot of pages, it may be useful to suppress the response output for successful requests. To only show response messages for status codes not equal to 200:
curl blog.atj.me/sitemap.xml | grep -e loc | sed 's|<loc>\(.*\)<\/loc>$|\1|g' | xargs -I {} curl -s -o /dev/null -w "%{http_code} %{url_effective}\n" {} | grep -v 200
Makefile
The “$” character must be escaped if using this command in a Makefile:
crawl:
curl blog.atj.me/sitemap.xml | grep -e loc | sed 's|<loc>\(.*\)<\/loc>$$|\1|g' | xargs -I {} curl -s -o /dev/null -w "%{http_code} %{url_effective}\n" {}