How do I strip all attributes from HTML tags in a string, except "alt" and "src" using Java?
And further.. how do I get the content from all "src" attributes in the string?
:)
You can:
Implement a SAX parser;
Built a document with a DOM parser, walk it and prune it and then convert back to HTML; or
Use an identity transform in XSLT (assuming your HTML is in XHTML format or can be converted to that with, say, JTidy) with some additional cases to remove attributes you don't want.
Whatever you do, don't try and do it with regular expressions.
OK, solved this somehow.
Used the HTMLCleaner library to parse the input data to a valid format.
Then I use a DOM parser to iterate over everything, and strip all disallowed tags and attributes.
(and some minor ugly hacks;) )
This was kind of a lot of work.
Related
I am using lucene to index my data using java programming language. But still, when i retrieve the terms that lucene indexed, they appear with tags like html (html is considered as a term not a tag and lucene doesn't remove it).
Is there any code or library for example like English analyzer that can remove the desired html tags?
If you want to remove html tags before indexing them in Lucene, you might use PatternReplaceCharFilter. It uses a regular expression for the target of replace string.
You could create char filter like this:
CharFilter cf = new PatternReplaceCharFilter(Pattern.compile("<[^>]*>"), "", reader);
this, will replace all html tags with empty string, so it will be removed.
I am being feed an XML document with metadata about online resources that I need to parse. Among the different metadata items are a collection of tags, which are comma-delimited. Here is an example:
<tags>Research skills, Searching, evaluating and referencing</tags>
The issue is that one of these "tags" contains a comma in it. The comma within the tag is encoded, but the commas intended to delimit tags are not. I am (currently) using the getText() method on org.dom4j.Node to read the text content of the <tags> element, which returns a String.
The problem is that I am not able -- as far as I'm aware -- to differentiate the encoded comma (from the ones that aren't encoded) in the String I receive.
Short of writing my own XML parser, is there another way to access the text content of this node in a more "raw" state? (viz. a state where the encoded comma is still encoded.)
When you use dom4j or DOM all the entities are already resolved, so you would need to go back to the parsing step to catch character references.
SAX is a more lowlevel interface and has support via its LexicalHandler interface to get notified when the parser encounters entity references, but it does not report character references. So it seems that you would really need to write an own parser, or patch an existing one.
But in the end it would be best if you can change the schema of your document:
<tags>
<tag>Research skills</tag>
<tag>Searching, evaluating and referencing</tag>
</tags>
In your current document character references are used to act as metadata. XML elements are a better way to express that.
Using LexEv from http://andrewjwelch.com/lexev/, putting xercesImpl.jar from Apache Xerces on the class path, I am able to compile and run some short sample using dom4j:
LexEv lexEv = new LexEv();
SAXReader reader = new SAXReader(lexEv);
Document doc = reader.read("input1.xml");
System.out.println(doc.getRootElement().asXML());
If the input1.xml has your sample XML snippet, then the output is
<tags xmlns:lexev="http://andrewjwelch.com/lexev">Research skills, Searching<lexev:char-ref name="#44">,</lexev:char-ref> evaluating and referencing</tags>
So that way you could get a representation of your input where a pure character and a character reference can be distinguished.
As far as I know, every XML processing frameworks (except vtd-xml) resolve entities during parsing....
you can only distinguish a character from its entity encoded counterpart using vtd-xml by using VTDNav's toRawString() method...
I read elements with CDATA sections from a rss-feed which I need to convert to valid xml. The content in the CDATA section is mostly valid xhtml, but some times characters like ampersand appear in attributes (url's).
I can use .replaceAll("&", "&") to solve this but thinking a bit forward it may be that other invalid characters show up in attributes or text.
The CMS to which I'm importing the element, won't accept CDATA sections without setting up another configuration for the content, so my question is: is there any simple way to escape the string, only for attributes and text?
I'm using the jdom library to manipulate the xml after the import.
Edit: I've checked out apache's StringEscapeUtils, but this is escaping the whole string. I need something that will only escape attribute values and text inside elements.
Apache Commons provides handy functions for this: StringEscapeUtils
When you use JDOM it will automatically correctly escape ay content that needs it. Is your CMS loaded with the output of JDOM, or are you using some other library to populate the CMS...?
In essence, if you have valid XML input, and you use JDOM (something from org.jdom2.output.*) to output the data, then you will always have good output.... so, what are you doing to have broken output?
Rolf
I'd like to query a HTML document as XML (e.g. with XPath), so I need to pass the HTML through some form of HTML cleaner.
But I'd also like to make modifications to the original source string based on the results of the queries.
Is there a Java HTML parser around that retains indexes to the original source string, so I can locate a node and modify the correct part of the original string?
Cheers.
It sounds like Jericho is almost exactly what you want. It is a robust HTML parser designed specifically for making unintrusive modifications to the source document.
While it doesn't come with DOM, SAX, or StAX interfaces, it has custom APIs that are similar enough to those standards that you should be able to adapt your approach to them fairly easily, or write an adapter between whatever you are using and Jericho. For instance, you can do XPath queries on Jericho documents using Jaxen -- see this blog entry for an example.
Jericho has begin and end attributes for every element, and even for parts of the element like the tag name or even an attribute name, so you can edit the document yourself with that information, but where Jericho really shines is the OutputDocument class, which lets you specify replacements directly by calling the appropriate methods with the Jericho elements that match your query instead of having to explicitly call getBegin() and getEnd() on them and pass that to some replacement method.
We use jericho html parser to do the parsing and htmlcleaner to do the actual clean up.
We had problems with jericho's behavior within a server app ( memory management, logging ) that we fixed. (the original developer didn't think our issues were important enough to put in the main code branch). Our fork is on github.
We also made fixes to htmlcleaner.
I don't know about the "retain indexes to the original text" part but Jericho is a very good HTML parser library.
Here is an example of how to remove every span from a html:
public static String removeSpans(String html) {
Source source = new Source(html);
source.fullSequentialParse();
OutputDocument outputDocument = new OutputDocument(source);
List<Tag> tags = source.getAllTags();
for (Tag tag : tags) {
String tagname = tag.getName().toLowerCase();
if (tagname.equals("span")) {
//remove the <span>
outputDocument.remove(tag);
}
}
return outputDocument.toString();
}
I guess you could use HTML Parser.
You can get indexes to original Page using getStartPosition() and getEndPosition() from class Node.
As others have suggested, you probably want to render the DOM. This basically just means constructing the node tree, it wont alter the document source unless you use an HTML cleaner like jTidy. Then you have easy access to the document and can modify it as required. I would suggest DOM4J, it has a good api and xpath support too.
Re your "indexing" requirement, during your traversal/querying of the document you can cache in a list or map any elements or nodes that you wish to modify the text of at a later point.
this works great
http://jtidy.sourceforge.net/
EXAMPLE
Tidy tidy = new Tidy(); // obtain a new Tidy instance
tidy.setXHTML(boolean xhtml); // set desired config options using tidy setters
... // (equivalent to command line options)
tidy.parse(inputStream, System.out);
For crawling the DOM, i recommend using JDOM, its way faster then simple XML.
http://www.jdom.org/
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.newDocument();
Element root = doc.createElement("root");
Text text = doc.createText("This is the root");
root.appendChild(text);
doc.appendChild(root);
As far as implementation is concerned i would make a new document, and add nodes to it from the source.
You could try ANTLR with an HTML grammar.
You could take (at least) 2 approaches - try and use it as an actual HTML parser, and then get the indexes into the original string that you are interested in.
Or, it also has built-in support for doing in-place transformations on source text, where you define the transformations that you want to perform on the text as part of the grammar.
Sorry for asking about quite the same issue, but now i would like to:
write a dom4j Document which contains tags looking like this:
<Field>\r\n some text</Field>
to a xml file, but the \r\n should be escaped to
org.dom4j.Document.asXml()
does not work.
Assuming you mean that's a CRLF sequence in the text node (and not merely a literal backslash-r-backslash-n), you won't be able to persuade an XML serialiser to write them as
, because XML says you don't have to. The document is absolutely equivalent in XML terms, whether you escape it or not. The only place you need to escape the CRLF sequence as
is in an attribute value.
If you really must produce this output, you would have to write your own XML serialiser that followed special rules for escaping control codes. But if you are doing this because an external tool can't read the XML element with CRLF sequences in, you should concentrate on fixing that tool, because if it can't deal with newlines in text content it's broken and not a proper XML parser.
Walk the tree, applying String.replace to the Text nodes.