i have come to realize that even though i love to do everything in regex, extracting links with regex is not really the best idea.
the best idea is to use htmlparser to extract the links.
This code sniplet will extract links from a page body, and given the link where the body was found it will resolve the relative links. This is tailored to be used by a crawler.
Thanks for your post it helped me a lot.
ReplyDeleteBtw, after browsing HTMLParser's javadocs, I found a simpler way to do it :
import org.htmlparser.beans.LinkBean;
Collection< String> result = new ArrayList< String>();
LinkBean linkBean = new LinkBean();
linkBean.setURL(this.url.toString());
URL[] links = linkBean.getLinks();
for(URL link : links)
result.add(link.toString());
Thanks Sephi,
ReplyDeletedoes your way technique resolve relative links, or deal with javascript links?
If anything though, it makes the loop simpler. Thanks
I've been using the LinkBean approach in production for a while and it works great.
ReplyDeletebtw how to use the above code
ReplyDeletewhat is byte[] body and char set
Hello friends...
ReplyDeleteWhere can we get tutorials for html parser. Googling is not helpful right now for me as i am new to both java and html parser.