July 31, 2009

HTMLParser

« Servlet Listener | Main | Redesign developers-blog.org [1] »

The HTMLPaser Java API library helps you to parse HTML based webpages. The library can be used for scrabing informations from a website. One possible case: You want to transform the informations of a website to a webservice. HTMLParser provides command line tools, self defined rules in Java and a intuitive user interface for the definition of parsing rules.

The following example defines a filter which search for "h1" tags with the class attribute "title" and the value "entry" on developers-blog.org .

The results of the filter are div tags with all blog entry informations.
The generated Java code:
import org.htmlparser.*;
import org.htmlparser.filters.*;
import org.htmlparser.beans.*;
import org.htmlparser.util.*;

public class DevelopersFilter
{
    public static void main (String args[])
    {
        TagNameFilter filter0 = new TagNameFilter ();
        filter0.setName ("H1");
        HasAttributeFilter filter1 = new HasAttributeFilter ();
        filter1.setAttributeName ("class");
        filter1.setAttributeValue ("title");
        NodeFilter[] array0 = new NodeFilter[2];
        array0[0] = filter0;
        array0[1] = filter1;
        AndFilter filter2 = new AndFilter ();
        filter2.setPredicates (array0);
        NodeFilter[] array1 = new NodeFilter[1];
        array1[0] = filter2;
        FilterBean bean = new FilterBean ();
        bean.setFilters (array1);
        if (0 != args.length)
        {
            bean.setURL (args[0]);
            System.out.println (bean.getNodes ().toHtml ());
        }
        else
            System.out.println ("Usage: java -classpath .:htmlparser.jar DevelopersFilter ");
    }
}
Regards
Rafael Sobek

Technorati Tags:

Posted by rafael.sobek at 7:21 AM in Uncategorized

 

[Trackback URL for this entry]

Your comment:

(not displayed)
 
 
 

Live Comment Preview:

 
« July »
SunMonTueWedThuFriSat
   1234
567891011
12131415161718
19202122232425
262728293031