Note to self: extract links using htmlparser

Friday, November 09, 2007

extract links using htmlparser

i have come to realize that even though i love to do everything in regex, extracting links with regex is not really the best idea.

the best idea is to use htmlparser to extract the links.

This code sniplet will extract links from a page body, and given the link where the body was found it will resolve the relative links. This is tailored to be used by a crawler.

5 comments:

Anonymous9/17/2008 5:52 AM
Thanks for your post it helped me a lot.
Btw, after browsing HTMLParser's javadocs, I found a simpler way to do it :

import org.htmlparser.beans.LinkBean;

Collection< String> result = new ArrayList< String>();

LinkBean linkBean = new LinkBean();
linkBean.setURL(this.url.toString());
URL[] links = linkBean.getLinks();
for(URL link : links)
result.add(link.toString());
ReplyDelete
Replies
Mikhail Koryak1/24/2009 11:00 AM
Thanks Sephi,
does your way technique resolve relative links, or deal with javascript links?
If anything though, it makes the loop simpler. Thanks
ReplyDelete
Replies
Jay6/17/2011 11:37 AM
I've been using the LinkBean approach in production for a while and it works great.
ReplyDelete
Replies
Anonymous10/02/2011 5:04 AM
btw how to use the above code
what is byte[] body and char set
ReplyDelete
Replies
Hem4/22/2012 2:39 AM
Hello friends...

Where can we get tutorials for html parser. Googling is not helpful right now for me as i am new to both java and html parser.
ReplyDelete
Replies