I have (still) problems with HTMLEditorKit and HTMLDocument in Java. I can only set the inner HTML of an element, but I cannot get it. Is there some way, how to get a uderlying HTML code of an element?
My problem is, that the HTML support is quite poor and bad written. The API does not allow basic and expected functions. I need change the colspan or rowspan attribute of <td>. The Java developers have closed the straightforward way: the attribute set of element is immutable. The workaround could be to take the code of element (e.g. <td colspan="2">Hi <u>world</u></td>) and replace it with new content (e.g. <td colspan="3">Hi <u>world</u></td>). This way seems to be closed too. (Bonus question: What's the HTMLEditorKit good for?)
You can get the selected Element html. Use write() method of the kit passing there offsets of the Element. But it will be included with surrounding tags "<html>" "<body>" etc.
Thanks for hint, Stanislav. That's my solution:
/**
* The method gets inner HTML of given element. If the element is named <code>p-implied</code>
* or <code>content</code>, it returns null.
* #param e element
* #param d document containing given element
* #return the inner HTML of a HTML tag or null, if e is not a valid HTML tag
* #throws IOException
* #throws BadLocationException
*/
public String getInnerHtmlOfTag(Element e, Document d) throws IOException, BadLocationException {
if (e.getName().equals("p-implied") || e.getName().equals("content"))
return null;
CharArrayWriter caw = new CharArrayWriter();
int i;
final String startTag = "<" + e.getName();
final String endTag = "</" + e.getName() + ">";
final int startTagLength = startTag.length();
final int endTagLength = endTag.length();
write(caw, d, e.getStartOffset(), e.getEndOffset() - e.getStartOffset());
//we have the element but wrapped as full standalone HTML code beginning with HTML start tag
//thus we need unpack our element
StringBuffer str = new StringBuffer(caw.toString());
while (str.length() >= startTagLength) {
if (str.charAt(0) != '<')
str.deleteCharAt(0);
else if (!str.substring(0, startTagLength).equals(startTag))
str.delete(0, startTagLength);
else
break;
}
//we've found the beginning of the tag
for (i = 0; i < str.length(); i++) { //skip it...
if (str.charAt(i) == '>')
break; //we've found end position of our start tag
}
str.delete(0, i + 1); //...and eat it
//skip the content
for (i = 0; i < str.length(); i++) {
if (str.charAt(i) == '<' && i + endTagLength < str.length() && str.substring(i, i + endTagLength).equals(endTag))
break; //we've found the end position of inner HTML of our tag
}
str.delete(i, str.length()); //now just remove all from i position to the end
return str.toString().trim();
}
This method can be easilly modified to get outter HTML (so the code containing the entire tag).
Related
I am trying to extract text between particular tags and attributes. For now, I tried to extract for tags. I am reading a ".gexf" file which has XML data inside. Then I am saving this data as a string. Then I am trying to extract text between "nodes" tag. Here is my code so far:
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
private static String filePath = "src/babel.gexf";
public String readFile(String filePath) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(filePath));
try {
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append("\n");
line = br.readLine();
}
return sb.toString();
} finally {
br.close();
}
}
public void getNodesContent(String content) throws IOException {
final Pattern pattern = Pattern.compile("<nodes>(\\w+)</nodes>", Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
}
public static void main(String [] args) throws IOException {
Main m = new Main();
String result = m.readFile(filePath);
m.getNodesContent(result);
}
}
In the code above, I don't get any result. When I try it with sample string like "My string", I get the result. Link of the gexf (since it is too long, I had to upload it) file:
https://files.fm/u/qag5ykrx
I don't think placing the entire file contents into a single string is such a great idea but then I suppose that would depend upon the amount of content within the file. If it's a lot of content then I would read in that content a little differently. It would of been nice to see a fictitious example of what the file contains.
I suppose you can try this little method. The heart of it utilizes a regular expression (RegEx) along with Pattern/Matcher to retrieve the desired substring from between tags.
It is important to read the doc's with the method:
/**
* This method will retrieve a string contained between string tags. You
* specify what the starting and ending tags are within the startTag and
* endTag parameters. It is you who determines what the start and end tags
* are to be which can be any strings.<br><br>
*
* #param inputString (String) Any string to process.<br>
*
* #param startTag (String) The Start Tag String or String. Data content retrieved
* will be directly after this tag.<br><br>
*
* The supplied Start Tag criteria can contain a single special wildcard tag
* (~*~) providing you also place something like the closing chevron (>)
* for an HTML tag after the wildcard tag, for example:<pre>
*
* If we have a string which looks like this:
* {#code
* "<p style=\"padding-left:40px;\">Hello</p>"
* }
* (Note: to pass double quote marks in a string they must be excaped)
*
* and we want to use this method to extract the word "Hello" from between the
* two HTML tags then your Start Tag can be supplied as "<p~*~>" and of course
* your End Tag can be "</p>". The "<p~*~>" would be the same as supplying
* "<p style=\"padding-left:40px;\">". Anything between the characters <p and
* the supplied close chevron (>) is taken into consideration. This allows for
* contents extraction regardless of what HTML attributes are attached to the
* tag. The use of a wildcard tag (~*~) is also allowed in a supplied End
* Tag.</pre><br>
*
* The wildcard is used as a special tag so that strings that actually
* contain asterisks (*) can be processed as regular asterisks.<br>
*
* #param endTag (String) The End Tag or String. Data content retrieval will
* end just before this Tag is reached.<br>
*
* The supplied End Tag criteria can contain a single special wildcard tag
* (~*~) providing you also place something like the closing chevron (>)
* for an HTML tag after the wildcard tag, for example:<pre>
*
* If we have a string which looks like this:
* {#code
* "<p style=\"padding-left:40px;\">Hello</p>"
* }
* (Note: to pass double quote marks in a string they must be excaped)
*
* and we want to use this method to extract the word "Hello" from between the
* two HTML tags then your Start Tag can be supplied as "<p style=\"padding-left:40px;\">"
* and your End Tag can be "</~*~>". The "</~*~>" would be the same as supplying
* "</p>". Anything between the characters </ and the supplied close chevron (>)
* is taken into consideration. This allows for contents extraction regardless of what the
* HTML tag might be. The use of a wildcard tag (~*~) is also allowed in a supplied Start Tag.</pre><br>
*
* The wildcard is used as a special tag so that strings that actually
* contain asterisks (*) can be processed as regular asterisks.<br>
*
* #param trimFoundData (Optional - Boolean - Default is true) By default
* all retrieved data is trimmed of leading and trailing white-spaces. If
* you do not want this then supply false to this optional parameter.
*
* #return (1D String Array) If there is more than one pair of Start and End
* Tags contained within the supplied input String then each set is placed
* into the Array separately.<br>
*
* #throws IllegalArgumentException if any supplied method String argument
* is Null ("").
*/
public static String[] getBetweenTags(String inputString, String startTag,
String endTag, boolean... trimFoundData) {
if (inputString == null || inputString.equals("") || startTag == null ||
startTag.equals("") || endTag == null || endTag.equals("")) {
throw new IllegalArgumentException("\ngetBetweenTags() Method Error! - "
+ "A supplied method argument contains Null (\"\")!\n"
+ "Supplied Method Arguments:\n"
+ "==========================\n"
+ "inputString = \"" + inputString + "\"\n"
+ "startTag = \"" + startTag + "\"\n"
+ "endTag = \"" + endTag + "\"\n");
}
List<String> list = new ArrayList<>();
boolean trimFound = true;
if (trimFoundData.length > 0) {
trimFound = trimFoundData[0];
}
Matcher matcher;
if (startTag.contains("~*~") || endTag.contains("~*~")) {
startTag = startTag.replace("~*~", ".*?");
endTag = endTag.replace("~*~", ".*?");
Pattern pattern = Pattern.compile("(?iu)" + startTag + "(.*?)" + endTag);
matcher = pattern.matcher(inputString);
} else {
String regexString = Pattern.quote(startTag) + "(?s)(.*?)" + Pattern.quote(endTag);
Pattern pattern = Pattern.compile("(?iu)" + regexString);
matcher = pattern.matcher(inputString);
}
while (matcher.find()) {
String match = matcher.group(1);
if (trimFound) {
match = match.trim();
}
list.add(match);
}
return list.toArray(new String[list.size()]);
}
Without a sample of the file I can only suggest so much. On the contrary, what I can tell you is that you can get the substring of that text using a tag search loop. Here is an example:
String s = "<a>test</a><b>list</b><a>class</a>";
int start = 0, end = 0;
for(int i = 0; i < s.toCharArray().length-1; i++){
if(s.toCharArray()[i] == '<' && s.toCharArray()[i+1] == 'a' && s.toCharArray()[i+2] == '>'){
start = i+3;
for(int j = start+3; j < s.toCharArray().length-1; j++){
if(s.toCharArray()[j] == '<' && s.toCharArray()[j+1] == '/' && s.toCharArray()[j+2] == 'a' && s.toCharArray()[j+3] == '>'){
end = j;
System.out.println(s.substring(start, end));
break;
}
}
}
}
The above code will search string s for the tag and then start where it found that and continue until it finds the closing a tag. then it uses those two positions to create a substring of the string which is the text between the two tags. You can stack as many of these tag searches as you want. Here is an example of a 2 tag search:
String s = "<a>test</a><b>list</b><a>class</a>";
int start = 0, end = 0;
for(int i = 0; i < s.toCharArray().length-1; i++){
if((s.toCharArray()[i] == '<' && s.toCharArray()[i+1] == 'a' && s.toCharArray()[i+2] == '>') ||
(s.toCharArray()[i] == '<' && s.toCharArray()[i+1] == 'b' && s.toCharArray()[i+2] == '>')){
start = i+3;
for(int j = start+3; j < s.toCharArray().length-1; j++){
if((s.toCharArray()[j] == '<' && s.toCharArray()[j+1] == '/' && s.toCharArray()[j+2] == 'a' && s.toCharArray()[j+3] == '>') ||
(s.toCharArray()[j] == '<' && s.toCharArray()[j+1] == '/' && s.toCharArray()[j+2] == 'b' && s.toCharArray()[j+3] == '>')){
end = j;
System.out.println(s.substring(start, end));
break;
}
}
}
}
The only difference is that i've added clauses to the if statements to also get the text between b tags. This system is extreemly versatile and I think you'll fund an abundance of use for it.
I have some HTML that looks like
<!-- start content -->
<p>Blah...</p>
<dl><dd>blah</dd></dl>
I need to extract the HTML from the comment to a closing dl tag. The closing dl is the first one after the comment (not sure if there could be more after, but never is one before). The HTML between the two is variable in length and content and doesn't have any good identifiers.
I see that comments themselves can be selected using #comment nodes, but how would I get the HTML starting from a comment and ending with an HTML close tag as I've described?
Here's what I've come up with, which works, but obviously not the most efficient.
String myDirectoryPath = "D:\\Path";
File dir = new File(myDirectoryPath);
Document myDoc;
Pattern p = Pattern.compile("<!--\\s*start\\s*content\\s*-->([\\S\\s]*?)</\\s*dl\\s*>");
for (File child : dir.listFiles()) {
System.out.println(child.getAbsolutePath());
File file = new File(child.getAbsolutePath());
String charSet = "UTF-8";
String innerHtml = Jsoup.parse(file,charSet).select("body").html();
Matcher m = p.matcher(innerHtml);
if (m.find()) {
Document doc = Jsoup.parse(m.group(1));
String myText = doc.text();
try {
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("D:\\Path\\combined.txt", true)));
out.println(myText);
out.close();
} catch (IOException e) {
//error }
}
}
To use a regex, maybe something simple
# "<!--\\s*start\\s*content\\s*-->([\\S\\s]*?)</\\s*dl\\s*>"
<!-- \s* start \s* content \s* -->
([\S\s]*?)
</ \s* dl \s* >
Here's some example code - it may need further improvements - depending on what you want to do.
final String html = "<p>abc</p>" // Additional tag before the comment
+ "<!-- start content -->\n"
+ "<p>Blah...</p>\n"
+ "<dl><dd>blah</dd></dl>"
+ "<p>def</p>"; // Additional tag after the comment
// Since it's not a full Html document (header / body), you may use a XmlParser
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
for( Node node : doc.childNodes() ) // Iterate over all elements in the document
{
if( node.nodeName().equals("#comment") ) // if it's a comment we do something
{
// Some output for testing ...
System.out.println("=== Comment =======");
System.out.println(node.toString().trim()); // 'toString().trim()' is only out beautify
System.out.println("=== Childs ========");
// Get the childs of the comment --> following nodes
final List<Node> childNodes = node.siblingNodes();
// Start- and endindex for the sublist - this is used to skip tags before the actual comment node
final int startIdx = node.siblingIndex(); // Start index - start after (!) the comment node
final int endIdx = childNodes.size(); // End index - the last following node
// Iterate over all nodes, following after the comment
for( Node child : childNodes.subList(startIdx, endIdx) )
{
/*
* Do whatever you have to do with the nodes here ...
* In this example, they are only used as Element's (Html Tags)
*/
if( child instanceof Element )
{
Element element = (Element) child;
/*
* Do something with your elements / nodes here ...
*
* You can skip e.g. 'p'-tag by checking tagnames.
*/
System.out.println(element);
// Stop after processing 'dl'-tag (= closing 'dl'-tag)
if( element.tagName().equals("dl") )
{
System.out.println("=== END ===========");
break;
}
}
}
}
}
For easier understanding, the code is very detailed, you can shorten it at some points.
And finally, here's the output of this example:
=== Comment =======
<!-- start content -->
=== Childs ========
<p>Blah...</p>
<dl>
<dd>
blah
</dd>
</dl>
=== END ===========
Btw. to get the text of the comment, just cast it to Comment:
String commentText = ((Comment) node).getData();
JAVASCRIPT or JAVA solution needed
The solution I am looking for could use java or javascript. I have the html code in a string so I could manipulate it before using it with java or afterwards with javascript.
problem
Anyway, I have to wrap each word with a tag. For example:
<html> ... >
Hello every one, cheers
< ... </html>
should be changed to
<html> ... >
<word>Hello</word> <word>every</word> <word>one</word>, <word>cheers</word>
< ... </html>
Why?
This will help me use javascript to select/highlight a word. It seems the only way to do it is to use the function highlightElementAtPoint which I added in the JAVASCRIPT hint: It simply finds the element of a certain x,y coordinate and highlights it. I figured that if every word is an element, it will be doable.
The idea is to use this approach to allow us to detect highlighted text in an android WebView even if that would mean to use a twisted highlighting method. Think a bit more and you will find many other applications for this.
JAVASCRIPT hint
I am using the following code to highlight a word; however, this will highlight the whole text belonging to a certain tag. When each word is a tag, this will work to some extent. If there is a substitute that will allow me to highlight a word at a certain position, it would also be a solution.
function highlightElementAtPoint(xOrdinate, yOrdinate) {
var theElement = document.elementFromPoint(xOrdinate, yOrdinate);
selectedElement = theElement;
theElement.style.backgroundColor = "yellow";
var theName = theElement.nodeName;
var theArray = document.getElementsByTagName(theName);
var theIndex = -1;
for (i = 0; i < theArray.length; i++) {
if (theArray[i] == theElement) {
theIndex = i;
}
}
window.androidselection.selected(theElement.innerHTML);
return theName + " " + theIndex;
}
Try to use something like
String yourStringHere = yourStringHere.replace(" ", "</word> <word>" )
yourStringHere.replace("<html></word>", "<html>" );//remove first closing word-tag
Should work, maybe u have to change sth...
var tags = document.body.innerText.match(/\w+/g);
for(var i=0;i<tags.length;i++){
tags[i] = '<word>' + tags[i] + '</word>';
}
Or as #ThomasK said:
var tags = document.body.innerText;
tags = '<word>' + tags + '</word>';
tags = tags.replace(/\s/g,'</word><word>');
But you have to keep in mind: .replace(" ",foo) only replaces the space once. For multiple replaces you have to use .replace(/\s+/g,foo)
And as #ajax333221 said, the second way will include commas, dots and other symbols, so the better solution is the first
JSFiddle example: http://jsfiddle.net/c6ftq/4/
inputStr = inputStr.replaceAll("(?<!</?)\\w++(?!\\s*>)","<word>$0</word>");
You can try following code,
import java.util.StringTokenizer;
public class myTag
{
static String startWordTag = "<Word>";
static String endWordTag = "</Word>";
static String space = " ";
static String myText = "Hello how are you ";
public static void main ( String args[] )
{
StringTokenizer st = new StringTokenizer (myText," ");
StringBuffer sb = new StringBuffer();
while ( st.hasMoreTokens() )
{
sb.append(startWordTag);
sb.append(st.nextToken());
sb.append(endWordTag);
sb.append(space);
}
System.out.println ( "Result:" + sb.toString() );
}
}
I need to extract text from a node like this:
<div>
Some text <b>with tags</b> might go here.
<p>Also there are paragraphs</p>
More text can go without paragraphs<br/>
</div>
And I need to build:
Some text <b>with tags</b> might go here.
Also there are paragraphs
More text can go without paragraphs
Element.text returns just all content of the div. Element.ownText - everything that is not inside children elements. Both are wrong. Iterating through children ignores text nodes.
Is there are way to iterate contents of an element to receive text nodes as well. E.g.
Text node - Some text
Node <b> - with tags
Text node - might go here.
Node <p> - Also there are paragraphs
Text node - More text can go without paragraphs
Node <br> - <empty>
Element.children() returns an Elements object - a list of Element objects. Looking at the parent class, Node, you'll see methods to give you access to arbitrary nodes, not just Elements, such as Node.childNodes().
public static void main(String[] args) throws IOException {
String str = "<div>" +
" Some text <b>with tags</b> might go here." +
" <p>Also there are paragraphs</p>" +
" More text can go without paragraphs<br/>" +
"</div>";
Document doc = Jsoup.parse(str);
Element div = doc.select("div").first();
int i = 0;
for (Node node : div.childNodes()) {
i++;
System.out.println(String.format("%d %s %s",
i,
node.getClass().getSimpleName(),
node.toString()));
}
}
Result:
1 TextNode
Some text
2 Element <b>with tags</b>
3 TextNode might go here.
4 Element <p>Also there are paragraphs</p>
5 TextNode More text can go without paragraphs
6 Element <br/>
for (Element el : doc.select("body").select("*")) {
for (TextNode node : el.textNodes()) {
node.text() ));
}
}
Assuming you want text only (no tags) my solution is below.
Output is:
Some text with tags might go here. Also there are paragraphs. More text can go without paragraphs
public static void main(String[] args) throws IOException {
String str =
"<div>"
+ " Some text <b>with tags</b> might go here."
+ " <p>Also there are paragraphs.</p>"
+ " More text can go without paragraphs<br/>"
+ "</div>";
Document doc = Jsoup.parse(str);
Element div = doc.select("div").first();
StringBuilder builder = new StringBuilder();
stripTags(builder, div.childNodes());
System.out.println("Text without tags: " + builder.toString());
}
/**
* Strip tags from a List of type <code>Node</code>
* #param builder StringBuilder : input and output
* #param nodesList List of type <code>Node</code>
*/
public static void stripTags (StringBuilder builder, List<Node> nodesList) {
for (Node node : nodesList) {
String nodeName = node.nodeName();
if (nodeName.equalsIgnoreCase("#text")) {
builder.append(node.toString());
} else {
// recurse
stripTags(builder, node.childNodes());
}
}
}
you can use TextNode for this purpose:
List<TextNode> bodyTextNode = doc.getElementById("content").textNodes();
String html = "";
for(TextNode txNode:bodyTextNode){
html+=txNode.text();
}
String k= <html>
<a target="_blank" href="http://www.taxmann.com/directtaxlaws/fileopencontainer.aspx?Page=CIRNO&
amp;id=1999033000019320&path=/Notifications/DirectTaxLaws/HTMLFiles/S.O.193(E)30031999.htm&
amp;aa=">number S.O.I93(E), dated the 30th March, 1999
</html>
I'm getting this HTML in a String and I want to remove the anchor tag so that data is also removed from link.
I just want display it as text not as a link.
how to do this i m trying to do so much not able to do please send me code regarding that i m
creating app for Android this issue i m getting in android on web view.
use JSoup, and jSoup.parse()
You can use the following example (don't remember where i've found it, but it works) using replace method to modify the string before showing it:
k = replace ( k, "<a target=\"_blank\" href=", "");
String replace(String _text, String _searchStr, String _replacementStr) {
// String buffer to store str
StringBuffer sb = new StringBuffer();
// Search for search
int searchStringPos = _text.indexOf(_searchStr);
int startPos = 0;
int searchStringLength = _searchStr.length();
// Iterate to add string
while (searchStringPos != -1) {
sb.append(_text.substring(startPos, searchStringPos)).append(_replacementStr);
startPos = searchStringPos + searchStringLength;
searchStringPos = _text.indexOf(_searchStr, startPos);
}
// Create string
sb.append(_text.substring(startPos,_text.length()));
return sb.toString();
}
To substitute all the target with an empty line:
k = replace ( k, "<a target=\"_blank\" href=\"http://www.taxmann.com/directtaxlaws/fileopencontainer.aspx?Page=CIRNO&id=1999033000019320&path=/Notifications/DirectTaxLaws/HTMLFiles/S.O.193(E)30031999.htm&aa=\">", "");
No escape is needed for slash.