Extracting all the links from a website using wget

Today I found myself needing to extract all the page links from a website to ensure that when we restructured the site, all the old links were redirected to the new page locations and there we no nasty 404's.

So here I present, my "Quick and dirty website link extractor". Complete with gratuitous command piping, ready to run on any Linux box with the appropriate programs installed:

MYSITE='http://example.com';wget -nv -r --spider $MYSITE 2>&1 | egrep ' URL:' | awk '{print $3}' | sed "s@URL:${MYSITE}@@g"

Obviously you'll need to replace example.com with your own site address and wait patiently. It will only show output when the spidering is finished:

#-- end snip

It's actually pretty scary what you can do with wget and a little bit of imagination.

Ok, so that's it. And I don't want to hear any comments from the Perl monks with WWW:Mech one-liners! :)

— Andrew

Next entry

Previous entry

Related entries

Similar entries


No comments yet.


Pingbacks are open.

Post your comment

Site theme originally by styleshout.com