Archives June 3, 2011

Extracting all the links from a website using wget

Today I found myself needing to extract all the page links from a website to ensure that when we restructured the site, all the old links were redirected to the new page locations and there we no nasty 404's.

So here I present, my "Quick and dirty website link extractor". Complete with gratuitous command piping, ready to run on any Linux box with the appropriate programs installed:

MYSITE='http://example.com';wget -nv -r --spider $MYSITE 2>&1 | egrep ' URL:' | awk '{print $3}' | sed "s@URL:${MYSITE}@@g"

Obviously you'll need to replace example.com with your own ...

— Andrew


Site theme originally by styleshout.com