Extracting all the links from a website using wget

Today I found myself needing to extract all the page links from a website to ensure that when we restructured the site, all the old links were redirected to the new page locations and there we no nasty 404's.

So here I present, my "Quick and dirty website link extractor". Complete with gratuitous command piping, ready to run on any Linux box with the appropriate programs installed:

MYSITE='http://example.com';wget -nv -r --spider $MYSITE 2>&1 | egrep ' URL:' | awk '{print $3}' | sed "s@URL:${MYSITE}@@g"

Obviously you'll need to replace example.com with your own site address and wait patiently. It will only show output when the spidering is finished:

#--snip
/media/blog/article/python-cygwin-install/select-5.png
/media/blog/article/python-cygwin-install/select-6.png
/media/blog/article/python-cygwin-install/7.png
/media/blog/article/python-cygwin-install/8.png
/media/blog/article/python-cygwin-install/9.png
/media/blog/article/python-cygwin-install/10.png
/feeds/tags/Django/
/feeds/tags/Python/
/feeds/tags/Windows/
#-- end snip

It's actually pretty scary what you can do with wget and a little bit of imagination.

Ok, so that's it. And I don't want to hear any comments from the Perl monks with WWW:Mech one-liners! :)

— Andrew


Next entry

Previous entry

Related entries

Similar entries


Comments

No comments yet.


Pingbacks

Pingbacks are open.


Post your comment





Site theme originally by styleshout.com