Tuesday, October 16, 2007

extract links from a webpage

this is useful if you decide to make a web crawler, and dont want to bother with an html parser. You have to read the body of the page into a string, and then use this regex to extract all the links.

supported link types: <img src=...<a href=...


an absolute url is in the form of "http://something.com/blah"
a relative url is in the form of "/something/path.blee"

now you can figure out what to do with these...

No comments:

Post a Comment