generate a list of a site's URLs using wget

You can use wget to generate a list of the URLs on a website.

Spider example.com, writing URLs to urls.txt, filtering out common media files (css, js, etc..):

wget --spider -r http://www.example.com 2>&1 | grep '^--' | awk '{ print $3 }' | grep -v '\.\(css\|js\|png\|gif\|jpg\|JPG\)$' > urls.txt

Note that this gives a list that duplicates URLs.

If you mirror instead of spider you seem to get a more comprehensive list without duplicates:

wget -m http://www.example.com 2>&1 | grep '^--' | awk '{ print $3 }' | grep -v '\.\(css\|js\|png\|gif\|jpg\|JPG\)$' > urls.txt

This will download all pages of the site into a directory with the same name as the domain.

About

A personal repository of GNU/Linux hints, recipes and everyday troubleshooting.