Java regular expression to remove empty xml nodes and childrens completely

Java regular expression to remove empty xml nodes and childrens completely - java

I am struggling to find the best solution. Below is my XML :
<Dbtr>
<Nm>John doe</Nm>
<Id>
<OrgId>
<Othr>
<Id/>
</Othr>
</OrgId>
</Id>
</Dbtr>
This is should replaced like this below :
<Dbtr>
<Nm>John doe</Nm>
</Dbtr>
So all the empty nodes and children without any values should be left out.
I am using following expression and it don't work as per my wishes
docStr = docStr.replaceAll("<(\\w+)></\\1>|<\\w+/>", "");
Any help would be really appreciated.
Edit :
I am creating this XML (and not parsing it) this will be sent out to clearing house, who will reject this xml message because of this empty tags. The way I am creating this xml is not in my hand I just provide the values from the db and as you can see some of the values are empty, this code (I have no control) writes out the xml tag already and then writes the value, all I can control is to not write "null".
The best bet for me now is to get the output xml like this and replace it with some regexp logic and form an xml without empty tags, that can pass schema validation.

String xml = ""
+ "<Dbtr>"
+ " <Nm>John doe</Nm>"
+ " <Id>"
+ " <OrgId>"
+ " <Othr>"
+ " <Id/>"
+ " </Othr>"
+ " </OrgId>"
+ " </Id>"
+ "</Dbtr>";
while (true) {
String repl = xml.replaceAll("<(\\w+)>\\s*</\\1>|<\\w+/>", "");
if (repl.length() == xml.length())
break;
xml = repl;
}
System.out.println(xml);
// -> <Dbtr> <Nm>John doe</Nm> </Dbtr>

Related

XPath parsing failing with dom4j for text function

My input xml is
String xml= "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<disks-array>\n" +
"<array-item>\n" +
" <value>\n" +
"<scsi>\n" +
"<bus>0</bus>\n" +
"<unit>0</unit>\n" +
"</scsi>\n" +
"<backing>\n" +
"<vmdk_file>[909_TCUP_02] u999orcat017t/u999orcat017t.vmdk</vmdk_file>\n" +
"<type>VMDK_FILE</type>\n" +
"</backing>\n" +
"<label>Hard disk 1</label>\n" +
"<type>SCSI</type>\n" +
"<capacity>107374182400</capacity>\n" +
"</value>\n" +
"<key>2000</key>\n" +
"</array-item>\n" +
"</disks-array>"
and the XPath filter is
"//array-item[contains(./value/backing/vmdk_file/text(),'u999orcat017t/u999orcat017t.vmdk')]"
Here is my parsing and filtering code
Document doc = DocumentHelper.parseText(xml);
XPath xp = DocumentHelper.createXPath(xpathQuery);
// evaluate the xpath
Object xpResult = xp.evaluate(doc);
Ideally it should return me the array items /value/vmdk_file text contains the given text. However it gives me empty string.
I am using dom4j 1.61 and jaxen 1.1.1 version library.
What is going wrong ?

Finally after debugging for many hours figured out the root cause for incorrect parsing of xml. The text value is broken into multiple nodes instead of single node. See the highlighted picture
Turns out this is a bug in dom4j library which is still open
https://github.com/dom4j/dom4j/issues/21
The fix is to call document.normalize() to settle text nodes.

How to remove space from excel reading from Java

Im importing Excel data to Java program. Having a column that should eliminate whitespace in case user mistype on the column.
Example value : "12341 "
i've used
replaceAll("\\s+", "");
replaceAll(" ", "");
StringUtils.trim(stringValue);
However, it still return "12341 " with length :6. It didn't remove the unnecessary white-spaces
EDIT
Complete code for replace return.
stringArray[x] = stringArray[x].replaceAll("\\s+", "");
stringArray[x] = stringArray[x].replaceAll(" ", "");
stringArray[x] = StringUtils.trim(stringArray[x]);

This should work:
stringValue = stringValue.replaceAll(" ", "");
You need to use the returned value.

I too face the same problem and i resolved it by using
text = text.replaceAll("[^\x00-\x7F]", "");
And make sure this will remove your special character too
Ref link :
https://howtodoinjava.com/regex/java-clean-ascii-text-non-printable-chars/

java / jsoup - retrieve language

i use jsoup to crawl content from specific website´s.
Example, meta-tags:
String meta_description = doc.select("meta[name=description]").first().attr("content");
What i need to crawl as well is the language, what i do:
String meta_language = doc.select("http-equiv").first().attr("content");
But what is thrown:
java.lang.NullPointerException
Anybody could help with with this?
Greetings!

Try this:
String meta_language = doc.select("meta[name=http-equiv]").get(0).attr("content");
System.out.println("Meta description : " + meta_language);
However if you have a list of content in your meta tag then you can use this :
//get meta keyword content
String keywords = doc.select("meta[name=keywords]").first().attr("content");
System.out.println("Meta keyword : " + keywords);

How can I adjust this regex to filter out "

I got the following regex working to search for video links in a page
(http(s?):/)(/[^/]+)\\S+.\\.(?:avi|flv|mp4)
Unfortunately it does not stop at the end of the link if there is another match right behind it, for example this video link
somevideoname.avi
would, after regex return this:
http://somevideo.flv">somevideoname.avi
How can I adjust the regex to avoid this? I would like to learn more about regex, its fascinating but so complex!

Here is how you can do something similar with JSoup parser.
Scanner scanner = new Scanner(new File("input.txt"));
scanner.useDelimiter("\\Z");
String htmlString = scanner.next();
scanner.close();
Document doc = Jsoup.parse(htmlString);
// or to get connect of some page use
// Document doc = Jsoup.connect("http://example.com/").get();
Elements elements = doc.select("a[href]");//find all anchors with href attribute
for (Element el : elements) {
URL url = new URL(el.attr("href"));
if (url.getPath().matches(".*\\.(?:avi|flv|mp4)")) {
System.out.println("url: " + url);
//System.out.println("file: " + url.getPath());
System.out.println("file name: "
+ new File(url.getPath()).getName());
System.out.println("------");
}
}

I'm not sure I understand the groupings in your regexp. At any rate, this one should work:
\\bhttps?://[^\"]+?\\.(?:avi|flv|mp4)\\b

If you only want to extract href attribute values then you're better off matching against the following pattern:
href=("|')(.*?)\.(avi|flv|mp4)\1
This should match "href" followed by either a double-quote or single-quote character, then capture everything up to (and including) the next character which matches the starting quote character. Then your href attribute can be extracted by
matcher.group(2) + "." + matcher.group(3)
to concatenate the file path and name with a period and then the file extension.

Your regex is greedy:
Limit its greediness read this:
(http(s?):/)(/[^/]+?)\\S+.\\.(?:avi|flv|mp4)

Find anchors in string and wrap those with a header

Hi I'm pretty new with Java. And I can't figure out what a nice solution is for my problem. I would like to add a header, in this case an H2 around the anchors in the following string.
Lets say li.getContent().toString contains the following:
TEST<div class="description">lorem ipsum</div>
With the following code:
for (Element li : liS) {
out.append("<div class=\"result\">\n" +
"\t\t\t<ul class=\"resultset\">\t\n" +
"\t\t\t\t<li><div class=\"item\"> " +
li.getContent().toString() + "</div></li>\n" +
"\t\t\t</ul>\n" + "\t\t</div>");
}
What I want to have is that li.getContent().toString shows like this, by adding the H2 headers :
<h2>TEST</h2><div class="description">lorem ipsum</div>
Is there some kind of wrap function where I can find the Anchor and wrap it with a header?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java regular expression to remove empty xml nodes and childrens completely - java

Related

XPath parsing failing with dom4j for text function

How to remove space from excel reading from Java

java / jsoup - retrieve language

How can I adjust this regex to filter out "

Find anchors in string and wrap those with a header

Categories

Resources