Friday, June 25, 2010

Wget : downloading urls matching a regular expression

Hi,
I want to download urls recursively,
starting from : http://code.google.com/apis/maps/,
but I want to download only those URLs which
match the this pattern :
http://code.google.com/apis/maps/*

I tried wget -r -D http://code.google.com/apis/maps/ http://code.google.com/apis/maps/
but it downloads only index.html and stops.

I tried few other options but they didn't work as intended either.

Ans : 
Hm... no linux.... O.K. here is another alternative: w3mir. It's perl based and not restricted to linux. Actually I tried it on windows and it works as expected.

http://www.langfeldt.net/w3mir/

Download the w3mir. Unpack it and read the file INSTALL.w32. Basically it's the following steps to "install" it on windows.

get and install winzip from http://www.winzip.com/
get and install ActivePerl (now Build 509) from http://www.activeperl.com/
get nmake.exe from ftp://ftp.microsoft.com/Softlib/MSLFILES/nmake15.exe

After installing the tools above, do this in the unpacked w3mir directory
   perl makefile.pl
   nmake
  nmake install

After that w3mir will be installed in the default path of your perl Installation.

   w3mir -h

Here is a sample file for your problem: w3mir.cfg

# Retrive all of janl's home pages:
Options: recurse
#
# This is the two argument form of URL:.  It fetches the first into the second
URL: http://code.google.com/apis/maps/documentation/
Fetch-RE: m/flash/
cd: d:\mirror

Then run w3mir like this: 

   mkdir d:\mirror
   w3mir -cfgfile w3mir.cfg

No comments:

Blog Archive