Well, to parse HTML I used JTidy.
Basically, JTidy is a Java port of HTML Tidy, an HTML syntax checker, and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.
JTidy was written by Andy Quick, who later stepped down from the maintainer position. Now JTidy is maintained by a group of volunteers.
You can check here for more: http://jtidy.sourceforge.net/