XPath parsing failing with dom4j for text function - java

My input xml is
String xml= "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<disks-array>\n" +
"<array-item>\n" +
" <value>\n" +
"<scsi>\n" +
"<bus>0</bus>\n" +
"<unit>0</unit>\n" +
"</scsi>\n" +
"<backing>\n" +
"<vmdk_file>[909_TCUP_02] u999orcat017t/u999orcat017t.vmdk</vmdk_file>\n" +
"<type>VMDK_FILE</type>\n" +
"</backing>\n" +
"<label>Hard disk 1</label>\n" +
"<type>SCSI</type>\n" +
"<capacity>107374182400</capacity>\n" +
"</value>\n" +
"<key>2000</key>\n" +
"</array-item>\n" +
"</disks-array>"
and the XPath filter is
"//array-item[contains(./value/backing/vmdk_file/text(),'u999orcat017t/u999orcat017t.vmdk')]"
Here is my parsing and filtering code
Document doc = DocumentHelper.parseText(xml);
XPath xp = DocumentHelper.createXPath(xpathQuery);
// evaluate the xpath
Object xpResult = xp.evaluate(doc);
Ideally it should return me the array items /value/vmdk_file text contains the given text. However it gives me empty string.
I am using dom4j 1.61 and jaxen 1.1.1 version library.
What is going wrong ?

Finally after debugging for many hours figured out the root cause for incorrect parsing of xml. The text value is broken into multiple nodes instead of single node. See the highlighted picture
Turns out this is a bug in dom4j library which is still open
https://github.com/dom4j/dom4j/issues/21
The fix is to call document.normalize() to settle text nodes.

Related

jsoup output in angularjs web

doc = Jsoup.connect("https://www.example.com/p/laptop-aksesoris").get();
Element element = doc.select("div.product-card a").first();
Elements elements = element.getElementsByAttribute("href");
for (Element e:elements) {
System.out.println("url: " + e.attr("href"));
System.out.println("text: " + e.text());
}
output:
url: {{model.url}}
text: {{model.name}} {{model.price}} {{e.title}} Grosir {{wp.quantity_min}} - {{wp.quantity_max}} ≥ {{wp.quantity_min}} {{wp.price}} PO
this web using angularjs(version 1) for froent-end.
when i try web without angularjs, it's work well.
Questions:
what the happen?
how can i fix that?
AFAIK jsoup just parses HTML and DOM elements. It does not execute javascript and hence cannot parse any AngularJS dynamic elements.

How to remove space from excel reading from Java

Im importing Excel data to Java program. Having a column that should eliminate whitespace in case user mistype on the column.
Example value : "12341 "
i've used
replaceAll("\\s+", "");
replaceAll(" ", "");
StringUtils.trim(stringValue);
However, it still return "12341 " with length :6. It didn't remove the unnecessary white-spaces
EDIT
Complete code for replace return.
stringArray[x] = stringArray[x].replaceAll("\\s+", "");
stringArray[x] = stringArray[x].replaceAll(" ", "");
stringArray[x] = StringUtils.trim(stringArray[x]);
This should work:
stringValue = stringValue.replaceAll(" ", "");
You need to use the returned value.
I too face the same problem and i resolved it by using
text = text.replaceAll("[^\x00-\x7F]", "");
And make sure this will remove your special character too
Ref link :
https://howtodoinjava.com/regex/java-clean-ascii-text-non-printable-chars/

Java regular expression to remove empty xml nodes and childrens completely

I am struggling to find the best solution. Below is my XML :
<Dbtr>
<Nm>John doe</Nm>
<Id>
<OrgId>
<Othr>
<Id/>
</Othr>
</OrgId>
</Id>
</Dbtr>
This is should replaced like this below :
<Dbtr>
<Nm>John doe</Nm>
</Dbtr>
So all the empty nodes and children without any values should be left out.
I am using following expression and it don't work as per my wishes
docStr = docStr.replaceAll("<(\\w+)></\\1>|<\\w+/>", "");
Any help would be really appreciated.
Edit :
I am creating this XML (and not parsing it) this will be sent out to clearing house, who will reject this xml message because of this empty tags. The way I am creating this xml is not in my hand I just provide the values from the db and as you can see some of the values are empty, this code (I have no control) writes out the xml tag already and then writes the value, all I can control is to not write "null".
The best bet for me now is to get the output xml like this and replace it with some regexp logic and form an xml without empty tags, that can pass schema validation.
String xml = ""
+ "<Dbtr>"
+ " <Nm>John doe</Nm>"
+ " <Id>"
+ " <OrgId>"
+ " <Othr>"
+ " <Id/>"
+ " </Othr>"
+ " </OrgId>"
+ " </Id>"
+ "</Dbtr>";
while (true) {
String repl = xml.replaceAll("<(\\w+)>\\s*</\\1>|<\\w+/>", "");
if (repl.length() == xml.length())
break;
xml = repl;
}
System.out.println(xml);
// -> <Dbtr> <Nm>John doe</Nm> </Dbtr>

How to fetch an URL containing non-ASCII characters (ą, ś ...) with Jsoup?

I am using jsoup to parse some polish sites, but I have problem with special characters like "ą", "ś" in URL(!), for example example.com/kąt is readed like example.com/k
every query without this special characters works perfectly
I have tried Document doc = Jsoup.parse(new URL(url).openStream(), "ISO-8859-1", url) but it does not work.
any other tips?
You want to encode your URL before passing it to Jsoup.
SAMPLE CODE
String url = "http://sjp.pl/maść";
System.out.println("BEFORE " + url);
String encodedURL = URI.create(url).toASCIIString();
System.out.println("AFTER " + encodedURL);
System.out.println("Title: " + Jsoup.connect(encodedURL).get().title());
OUTPUT
BEFORE http://sjp.pl/maść
AFTER http://sjp.pl/ma%C5%9B%C4%87
Title: maść - Słownik SJP
French locale
Jsoup 1.8.3

Using WebView.Load() to read dynamic HTML with references to JS libraries?

I'm trying to write a class that will return a string of HTML with Javascript for use in a WebView. Returning HTML and Javascript works well, but I'm having issues loading libraries with the returned HTML. As an example, this works well:
public static String helloWorld() {
String html;
html = "<!DOCTYPE html>" +
"<html>" +
"<head>" +
"<title>" +
"</title>" +
"</head>" +
"<body>" +
"<script>" +
"document.write(\"Hello, World\");" +
"</script>" +
"</body>" +
"</html>";
return html;
}
However, this does not:
public static String helloWorld() {
String html;
html = "<!DOCTYPE html>" +
"<html>" +
"<head>" +
"<title>" +
"</title>" +
"<script src=\"../assets/src/external_script.js\"></script>" +
"</head>" +
"<body>" +
"<script>" +
"external_script_function();" +
"</script>" +
"</body>" +
"</html>";
return html;
}
I'm guessing that the filepath for the external_script.js import is incorrect? Using WebView.LoadUrl(myLocalHtmlFile); works when using the other JS file. How would I go about making this work properly? Or alternatively, is there a better way to achieve similar results?
You need to feed the assets folder to the WebView as the base URL. Like this:
MyWebView.loadDataWithBaseURL("file:///android_asset/", helloWorld());
And then the <script> tag would go like this:
<script src="src/external_script.js"></script>
Assuming the library resides under assets/src. And yes, you need to enable JavaScript like triad is suggesting.
Since you write assets i suppose you are trying to load a file you put in Eclipse or what ever IDE you use that is bundled with your app. This gets stored inside your APK and AFAIK it does not get "unzipped" to the working-directory of your app.
This means that eighter you will have to manually place it on the SDcard or internal memory or load it internally and dynamically inject it into the final HTML before you send it to your WebView.
Personally i would go for the second method because it does make less of a mess.

Categories