Friday, June 25, 2010

Wget : downloading urls matching a regular expression

I want to download urls recursively,
starting from :,
but I want to download only those URLs which
match the this pattern :*

I tried wget -r -D
but it downloads only index.html and stops.

I tried few other options but they didn't work as intended either.

Ans : 
Hm... no linux.... O.K. here is another alternative: w3mir. It's perl based and not restricted to linux. Actually I tried it on windows and it works as expected.

Download the w3mir. Unpack it and read the file INSTALL.w32. Basically it's the following steps to "install" it on windows.

get and install winzip from
get and install ActivePerl (now Build 509) from
get nmake.exe from

After installing the tools above, do this in the unpacked w3mir directory
  nmake install

After that w3mir will be installed in the default path of your perl Installation.

   w3mir -h

Here is a sample file for your problem: w3mir.cfg

# Retrive all of janl's home pages:
Options: recurse
# This is the two argument form of URL:.  It fetches the first into the second
Fetch-RE: m/flash/
cd: d:\mirror

Then run w3mir like this: 

   mkdir d:\mirror
   w3mir -cfgfile w3mir.cfg

No comments:

Blog Archive