Note to self: extract links using htmlparser

Friday, November 09, 2007

extract links using htmlparser

i have come to realize that even though i love to do everything in regex, extracting links with regex is not really the best idea.

the best idea is to use htmlparser to extract the links.

This code sniplet will extract links from a page body, and given the link where the body was found it will resolve the relative links. This is tailored to be used by a crawler.

public static Collection<String> retrieveLinks(String url, byte[] body, String charSet) { Collection<String> result = new ArrayList<String>(); URI uriLink; try { uriLink = new URI(url);//url of the body String st = new String(body, charSet); Parser parser = new Parser(); parser.setInputHTML(st); NodeList list = parser.extractAllNodesThatMatch(new NodeClassFilter (LinkTag.class)); for (int i = 0; i < list.size (); i++){ LinkTag extracted = (LinkTag)list.elementAt(i); if(!extracted.isHTTPLikeLink()) continue; //ignore mailto / javascript and other weird protocol links String extractedLink = extracted.extractLink().replaceAll("&", "&");//we need to unescape these extractedLink = extractedLink.replaceAll(" ","%20");//URI class doesnt like spaces in URLs, but content creators dont care.. :P extractedLink = extractedLink.trim();//URI will barf on a link that look like "foo.com " if(extractedLink.length() == 0) continue; // i am betting that a link that looks like "" is user error. with our url scheme, it most certainly is.. if(extractedLink.startsWith("#")) continue; //skip all anchors, they are useless to a crawler if(extractedLink.matches("(?i)^javascript:.*"))continue; //HTMLParser thinks anything but lower case 'javascript:' is a non-js link URI uriRelative = new URI(extractedLink); //System.out.println("i:"+i); URI resolved = uriLink.resolve(uriRelative); //linkfilter will return true if crawling from //url --> resolvedUrl //is allowed by some rules String link = resolved.toString(); //here we should check if the link is acceptable. result.add(link); } } } catch (URISyntaxException e) { // TODO: this is a bad link, log this! log.info("Bad Link Syntax on page:["+url+"] BAD LINK:"+e.getMessage()); } catch (UnsupportedEncodingException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (ParserException e) { // TODO Auto-generated catch block e.printStackTrace(); } return result; }

5 comments:

Anonymous9/17/2008 5:52 AM
Thanks for your post it helped me a lot.
Btw, after browsing HTMLParser's javadocs, I found a simpler way to do it :

import org.htmlparser.beans.LinkBean;

Collection< String> result = new ArrayList< String>();

LinkBean linkBean = new LinkBean();
linkBean.setURL(this.url.toString());
URL[] links = linkBean.getLinks();
for(URL link : links)
result.add(link.toString());
ReplyDelete
Replies
Mikhail Koryak1/24/2009 11:00 AM
Thanks Sephi,
does your way technique resolve relative links, or deal with javascript links?
If anything though, it makes the loop simpler. Thanks
ReplyDelete
Replies
Jay6/17/2011 11:37 AM
I've been using the LinkBean approach in production for a while and it works great.
ReplyDelete
Replies
Anonymous10/02/2011 5:04 AM
btw how to use the above code
what is byte[] body and char set
ReplyDelete
Replies
Hem4/22/2012 2:39 AM
Hello friends...

Where can we get tutorials for html parser. Googling is not helpful right now for me as i am new to both java and html parser.
ReplyDelete
Replies

Add comment