Friday, November 09, 2007

extract links using htmlparser

i have come to realize that even though i love to do everything in regex, extracting links with regex is not really the best idea.

the best idea is to use htmlparser to extract the links.

This code sniplet will extract links from a page body, and given the link where the body was found it will resolve the relative links. This is tailored to be used by a crawler.

5 comments:

  1. Thanks for your post it helped me a lot.
    Btw, after browsing HTMLParser's javadocs, I found a simpler way to do it :

    import org.htmlparser.beans.LinkBean;

    Collection< String> result = new ArrayList< String>();

    LinkBean linkBean = new LinkBean();
    linkBean.setURL(this.url.toString());
    URL[] links = linkBean.getLinks();
    for(URL link : links)
    result.add(link.toString());

    ReplyDelete
  2. Thanks Sephi,
    does your way technique resolve relative links, or deal with javascript links?
    If anything though, it makes the loop simpler. Thanks

    ReplyDelete
  3. I've been using the LinkBean approach in production for a while and it works great.

    ReplyDelete
  4. btw how to use the above code
    what is byte[] body and char set

    ReplyDelete
  5. Hello friends...

    Where can we get tutorials for html parser. Googling is not helpful right now for me as i am new to both java and html parser.

    ReplyDelete