Extracting HTML fragments in Java

Extracting HTML fragments in Java - java

I have text that may contain HTML islands.
Example:
qwwdeadaskdfdaskjfhbsdfkfSet attribute valuesgfkjgfkjrgjgjgjgjgroggjrog <b>jsoup</b>sdflkjsdfsfklsfklfjsfkljsfljsfJsoup.parse(String html)skgjdfgkjdfgkldfjgdfkgljdfg
How can I extract those HTML fragments?

Java supports both DOM and SAX parsing for XML, however they both require the document to be well-formed. Therefore your example would not be parsed. There is a project called NekoHTML (http://nekohtml.sourceforge.net/) that supports scanning non well-formed HTML.

I do exactly what you are asking -- find HTML fragments in a chunk of text -- by wrapping an enclosing tag around the text then using a java.xml.parsers.DocumentBuilder to create a DOM tree.
The basic idea (and omitting much) is just
String fragment = "<wrap_node>" + orig_text + "</wrap_node>";
Document d = builder.parse(fragment);
If tags aren't well-formed... missing end, improper nesting, etc. ... this won't work, but this works for me because I want to reject anything malformed.

Related

Java - Extract html information from string

All of the guides out there tell me on how to remove the HTML tags from the text to extract the text between them. What I am after is the extraction of the data that is within the HTML tags.
e.g.
If i have a string:
"<FONT SIZE="5">Hello World</FONT>"
I want to get the font size information to update other variables. How do I go about this?

I've used jsoup several times for this purpose. It's a lenient HTML parser. Beware trying to parse it as "standard" XML as XML-parsing is strict by nature and will fail if the page does not conform to XML markup specs (which few HTML pages do).

You go about this by using one of the available Java libraries for HTML parsing, like TagSoup.

You can use a library like jerichoHTML wich enables you to search for HTML tags as well as their attributes or you build some DOM on you own.

Take a look at this:
http://en.wikipedia.org/wiki/Java_API_for_XML_Processing
If you parse the HTML you should be able to extract the values from the DOM tree.

How to configure nekohtml parser to properly close the anchor tag?

I'm using the nekohtml parser to parse my html code. Sometime my mistake while using anchor tag the content has been written like this.
<a href="http://abc.com">abc</a>
After parsing throough the nekohtml i want the content to corrected like this.
abc
For this to achieve please help to configure the nekohtml parsing.
Update:
After i tried with settings as
parser.setFeature( "http://cyberneko.org/html/features/balance-tags", true );
it is of no use. i doesn't give the result as i expected. it returns the same html content as i given

Need to set a balance-tags feature that specifies if the NekoHTML parser should attempt to balance the tags in the parsed document.
config.setFeature( "http://cyberneko.org/html/features/balance-tags", true );
from the docs:
Balancing the tags fixes up many common mistakes by adding missing parent elements, automatically closing elements with optional end tags, and correcting unbalanced inline element tags. In order to process HTML documents as XML, this feature should not be turned off. This feature is provided as a performance enhancement for applications that only care about the appearance of specific elements, attributes, and/or content regardless of the document's ill-formed structure.

Parse Ampersand in XML with Java's DOM XML API

I am trying to parse an XML document with the Java DOM API (not SAX). Whenever the parser encounters the ampersand (&) when parsing a text node, it errors out. I am guessing that this is solvable with 1)escaping, 2) encoding or 3) Use a different parser.
I am reading an XML document that I dont have any control over, so I cannot precisely identify where the ampersand appears in the document every time I read it.
The answers I have seen to similar questions have advised replacing the entity type when parsing the XML, but I am not sure how I will be able to do that since, it doesnt even parse when it encounters the XML ampersand.
Any help will be appreciated.

As noted, the XML is malformed (oops!): all occurrences of & in XML (other than the token introducing a character entity [?]) must be encoded as &.
Some solutions (which are basically just as described in the post!):
Fix the XML (at source, or in hack-it-up phase), or;
Parse it with the "appropriate" tool (e.g. a "forgiving" HTML parser)
For the "hack-it-up" approach, consider a separate input stream -- see Working with Filter Streams -- that executes as a filter prior to the actual DOM parser: whenever a & is encountered (that is not part of a character entity) it "fixes it" by inserting & into the stream. Of course, if the XML source didn't get basic encoding correct...
Happy coding.

"I am reading an XML document that I dont have any control over".
No, you are reading a non-XML document. The reason you get an error is that XML parsers are required to give you an error when you read something that isn't XML.
The XML culture is that responsibility for producing well-formed XML rests with the sender. You need to change whatever produces this data to do it properly. Otherwise, you might as well forget XML and its benefits, and move back to the chaotic world of privately-agreed protocols and custom parsers.

Is it possible to use Apache Digester to filter dynamic xml leaf tags?

I've used Apache digester before and loved the branch based searching of xml tags.
Specifying a tag as
h\a\b\
is very intuitive.
Now i want to do xml filtering project, but apache digester doesn't seem like it will work, simply because there is no way to get to the underlying xml tags. As the faq says:
How do I get some xml nested within a tag as a literal string?
It is frequently asked how some XML (esp. XHTML) nested within a document can be extracted as a string, eg to extract the contents of the "body" tag below as a string:
...some xml code...
If you can modify the above to wrap the desired text as a CDATA section then things are easy; Digester will simply treat that CDATA block as a single string:
...some xml code...
If this can't be done then you need to use a NodeCreateRule to create a DOM node representing the body tag and its children, then serialise that DOM node back to text.
Remember that Digester is just a layer on top of a standard XML parser, and standard XML parsers have no option to just stop parsing input at a specific element - unless it knows that the contents of that element is a block of characters (CDATA).
If there was something that uses the same pattern system that i can use to filter xml? My idea is to use the patterns given by the user and blacklist them, and copy everything else.
Or maybe there is a way to find the location of a match in Apache Digester (the location on the xml, not just the displayed text). That would be enough for me to copy the other text by keeping a copy of it around, and skipping the matches.
Edit: I've since found out that XPath looks almost ok for doing this, but all applications i found were for selecting something, not removing it. Do you have a example for this?

Never mind, managed to do it with XPath.

Xpath error related to "nbsp"

Got this error when parsing my html page using XPATH.. i am also using HTMLcleaner
If it is not clear i can even post my java code and html code

The original input is HTML and you're treating it as XML. XML has less predefined entities than HTML has. Either use an HTML parser, or declare the entity in your XML parser, or textually replace with   in the original input.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.