how to download specific files from some url path with wget

July 11, 2018 - Reading time: ~1 minute

wget -r -l1 --no-parent -A ".deb" http://www.shinken-monitoring.org/pub/debian/

-r recursively
-l1 to a maximum depth of 1
--no-parent ignore links to a higher directory
-A "*.deb" your pattern


generate a list of a site's URLs using wget

June 12, 2018 - Reading time: ~1 minute

You can use wget to generate a list of the URLs on a website.

Spider example.com, writing URLs to urls.txt, filtering out common media files (css, js, etc..):

wget --spider -r http://www.example.com 2>&1 | grep '^--' | awk '{ print $3 }' | grep -v '\.\(css\|js\|png\|gif\|jpg\|JPG\)$' > urls.txt

Note that this gives a list that duplicates URLs.

If you mirror instead of spider you seem to get a more comprehensive list without duplicates:

wget -m http://www.example.com 2>&1 | grep '^--' | awk '{ print $3 }' | grep -v '\.\(css\|js\|png\|gif\|jpg\|JPG\)$' > urls.txt

This will download all pages of the site into a directory with the same name as the domain.


download all links from a site and save to a text file

June 11, 2016 - Reading time: ~1 minute

wget is not designed for this. You can however parse its output to get what you want:

$ wget http://aligajani.com -O - 2>/dev/null | grep -oP 'href="\Khttp:.+?"' | sed 's/"//' | grep -v facebook > file.txt

You could use lynx for this:

lynx -dump -listonly http://aligajani.com | grep -v facebook.com > file.txt

This command dumps the links of a single page. To do this recursively:

      wget -r -p -k http://website 

or

wget -r -p -k --wait=#SECONDS http://website

The second one is for websites that may flag you if downloading too quickly; may also cause a loss of service, so use second one for most circumstances to be courteous. Everything will be placed in a folder named the same as website in your root folder directory or whatever directory you have terminal in at time of executing command.