Java STX CDATA parsing - java

I am trying to anonymize an XML Export of confluence.
I found their export cleanner jar:
https://confluence.atlassian.com/doc/content-anonymizer-for-data-backups-134795.html
I have modified the clean.stx to remove all users like this:
<stx:template match="object[#class='ConfluenceUserImpl']/property[#name='name']/text() | object[#class='ConfluenceUserImpl']/property[#name='lowerName']/text() | object[#class='ConfluenceUserImpl']/id[#name='key']/text() | property[#class='ConfluenceUserImpl']/id[#name='key']/text()">
<stx:value-of select="translate(., '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx')"/>
</stx:template>
I need to modify the CDATA also using regex or similar in order to remove user mentions in the body of a confluence page.
The CDATA looks like this e.g.:
<property name="body">
<![CDATA[
<p>
<ac:link>
<ri:user ri:userkey="8a8300716489cc7d016489ce009a0000" />
</ac:link>
</p>
]]>
</property>
Here I only need to replace the value of ri:userkey to xxx or similar.
How can I do this?

Nevermind,
i now use the joost java version of the stx which is newer than the one used by attlassian in their jar:
http://joost.sourceforge.net/
I can use replace() here and use stx:cdata to disable escaping:
<stx:template match="property[#name='body']/cdata()">
<stx:cdata>
<stx:value-of select="replace(., '(ri:userkey=).*?\s', '$1"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" ')" />
</stx:cdata>
</stx:template>

Related

Search query in XML file from java

My Project Manager told me to move all the queries in a xml file (he even made for me), so when the user (via jsp) select the description: "Flusso VLT mensile" he has 2 options, click search, update or download, (the download it works now but I need to get the name of filename), he told me to work with jaxb but I don't think is necessary
<flow-monitor>
<menu1>
<item id="7" type="simple">
<connection name="VALSAP" />
<description value="Flusso VLT mensile" />
<filename value="flussoVltmensile" />
<select><![CDATA[
SELECT * FROM vlt_sap WHERE stato=7
]]>
</select>
<update>
<![CDATA[update vlt_sap set stato = 0 where stato =7]]>
</update>
</item>
<item id="11" type="simple">
<connection name="VALSAP" />
<description value="Flusso REPNORM BERSANI" />
<filename value="flussoRepnormBersani" />
<select><![CDATA[
select * from repnorm_bersani_sap where stato = 99
]]>
</select>
<update>
<![CDATA[update repnorm_bersani_sap set stato=0 where stato = 99]]>
</update>
</item>
</menu1>
</flow-monitor>
On java I should read this xml and depending on <description value=> I should execute the query inside them, any way to easily read the value inside without make a lot of if statement
Anybody knows a good and easy way to achieve all this?
Thanks
There are a few ways to read the XML file and extract the information you need without using a lot of if statements. One approach is to use an XML parsing library such as JAXB or SAX, and create Java classes to represent the XML elements.
In JAXB, you can use the javax.xml.bind.Unmarshaller class to unmarshal the XML file into a Java object, which you can then traverse to extract the information you need.
You should start creating a Java classes based on the XML structure, like FlowMonitor, Menu1, Item, Connection etc. , and use annotation to map the xml elements to the fields.
Then, you can use the unmarshaller.unmarshal() method to parse the XML file and create an instance of the FlowMonitor class, which will contain all the information from the XML file.
Once you have the FlowMonitor object you can loop through the items, and get the description and filename by calling getDescriptionValue() and getFilenameValue() of the item object....

Using regex in web harvest xml

I'm using web harvest to scrap some e-commerce site.I'm iterating over the search page and getting each product details in output xml.But now I want to use regular expression in anchor(a) tag while scraping and get particular string.i.e.,
let $linktoprod :=data($item//a[#class="fk-anchor-link"]/#href)
The above line returns anchor tag href value of each product i.e., for first product the value returned is,
/casio-sheen-analog-watch-women/p/itmdaqmvzyy23hz5?pid=WATDAQMVVNQEM9CX&ref=6df83d8f-f61f-4648-b846-403938ae92fa
Now I want to using the regular expression like /([^/\?]+)\? and get the string between last / and ? i.e.,
itmdaqmvzyy23hz5
in the output xml.
Please anyone who has any idea regarding this help me.
Thank you.
Updated -
<?xml version="1.0" encoding="UTF-8"?>
<config charset="ISO-8859-1">
<function name="download-multipage-list">
<return>
<while condition="${pageUrl.toString().length() != 0}" maxloops="${maxloops}" index="i">
<empty>
<var-def name="content">
<html-to-xml>
<http url="${pageUrl}"/>
</html-to-xml>
</var-def>
<var-def name="nextLinkUrl">
<xpath expression="${nextXPath}">
<var name="content"/>
</xpath>
</var-def>
<var-def name="pageUrl">
<template>${sys.fullUrl(pageUrl.toString(), nextLinkUrl.toString())}</template>
</var-def>
</empty>
<xpath expression="${itemXPath}">
<var name="content"/>
</xpath>
</while>
</return>
</function>
<var-def name="products">
<call name="download-multipage-list">
<call-param name="pageUrl">http://www.flipkart.com/watches/pr?sid=reh%2Cr18</call-param>
<call-param name="nextXPath">//a[starts-with(., 'Next')]/#href</call-param>
<call-param name="itemXPath">//div[#class="product browse-product "]</call-param>
<call-param name="pids"></call-param>
<call-param name="maxloops">5</call-param>
</call>
</var-def>
<var-def name="scrappedContent">
<!-- iterates over all collected products and extract desired data -->
<![CDATA[ <catalog> ]]>
<loop item="item" index="i">
<list><var name="products"/></list>
<body>
<xquery>
<xq-param name="item" type="node()"><var name="item"/></xq-param>
<xq-expression><![CDATA[
declare variable $item as node() external;
let $linktoprod :=data($item//a[#class="fk-anchor-link"]/#href)
let $name := data($item//div[#class="title"])
return
<product>
<link>{$linktoprod}</link>
<title>{normalize-space($name)}</title>
</product>
]]></xq-expression>
</xquery>
</body>
</loop>
<![CDATA[ </catalog> ]]>
</var-def>
</config>
My config xml is as show above.Where to use regexp code block in my xml? And I want the regexp to be applied to
linktoprod and finally get the regexp output in link tag as output xml.Please anyone guide me.
Thank you.
I don't know about web harvest, but if it supports a non greedy quantifier, you can use this pattern
/([^/]+?)\?
According to Web Harvest User manual - regexp you must insert something like this
<regexp>
<regexp-pattern>/([^/]+?)\?</regexp-pattern>
<regexp-source>
/casio-sheen-analog-watch-women/p/itmdaqmvzyy23hz5?pid=WATDAQMVVNQEM9CX&ref=6df83d8f-f61f-4648-b846-403938ae92fa
</regexp-source>
<regexp-result>
<template>Last URL part is "${_1}"</template>
</regexp-result>
</regexp>
In the <regexp-source> part you must insert your URL or variable to search for. Guessing from the manual and your config xml it might be something like
<regexp-source>
<var>scrappedContent</var>
</regexp-source>
or
<regexp-source>
${linktoprod}
</regexp-source>
I think you must experiment a bit.
Try this regex:
/([^/]+)\?
You might need to strip the leading / and trailing ?.
To illustrate that the regex works, this is it's result in JavaScript:
var s = "/casio-sheen-analog-watch-women/p/itmdaqmvzyy23hz5?pid=WATDAQMVVNQEM9CX&ref=6df83d8f-f61f-4648-b846-403938ae92fa"
console.log(s.match(/\/([^/]+)\?/g)); // /itmdaqmvzyy23hz5?

Doxygen doesn't parse tags normally. It parses tags like <Code>,<value> and \s\p into <computeroutput></computeroutput>

I am trying to generate xml using doxygen from java sourcecode. Doxygen doesn't parse tags like
<code>,<value> and \s\p.... correctly. It generates xml with incorrect values.
For example:
<code>0x0</code> tag is converted into <computeroutput>0x0</computeroutput>.
<para>
<computeroutput>This is code tag</computeroutput>
<value2>test value4</value2> </meta> </meta> <gid>000001</gid> <read>1</read>
</parameter> </component> </algebra>
</para>
similarly for other tags like <value> and \s\p also.
I am wondering why it happens?????
Please let me know what are all other tags also will produce the same output
and how to resolve it.
"correctly" is a bit of a misnomer when referring to xml, unless it weren't structured correctly, but I think you're referring to the tags.
If you don't like the output from doxygen why not write an xslt to make it whatever you want? I'm sure there are many doxygen.xml --> myflavor.xml transforms out there that you could use as a starting point.

How to put a newline in ant property

This doesn't seem to work:
<property name="foo" value="\n bar \n"/>
I use the property value in the body of an e-mail message (which is sent as plain text):
<mail ...>
<message>some text${foo}</message>
and I get literal "\n" in the e-mail output.
These all work for me:
<property name="foo" value="bar${line.separator}bazz"/>
<property name="foo">bar
bazz2</property>
<property name="foo" value="bar
bazz"/>
You want ${line.separator}. See this post for an example. Also, the Ant echo task manual page has an example using ${line.separator}.
By using ${line.separator} you're simply using a Java system property. You can read up on the list of system properties here, and here is Ant's manual page on Properties.

parsing XML that contain XML in elements, Can this be done

I have a 'complex item' that is in XML, Then a 'workitem' (in xml) that contains lots of other info, and i would like this to contain a string that contains the complex item in xml.
for example:
<inouts name="ClaimType" type="complex" value="<xml string here>"/>
However, trying SAX and other java parsers I cannot get it to process this line, it doesn't like the < or the "'s in the string, I have tried escaping, and converting the " to '.
Is there anyway around this at all?? Or will I have to come up with another solution?
Thanks
I think you'll find that the XML you're dealing with won't parse with a lot of parsers since it's invalid. If you have control over the XML, you'll at a bare minimum need to escape the attribute so it's something like:
<inouts name="ClaimType" type="complex" value="<xml string here>" />
Then, once you've extracted the attribute you can possibly re-parse it to treat it as XML.
Alternatively, you can take one of the approaches above (using CDATA sections) with some re-factoring of your XML.
If you don't have control over your XML, you could try using the TagSoup library to parse it to see how you go. (Disclaimer: I've only used TagSoup for HTML, I have no idea how it'd go with non-HTML content)
(The tag soup site actually appears down ATM, but you should be able to find enough doco on the web, and downloads via the maven repository)
Possibly the easiest solution would be to use a CDATA section. You could convert your example to look like this:
<inouts name="ClaimType" type="complex">
<![CDATA[
<xml string here>
]]>
</inouts>
If you have more than one attribute you want to store complex strings for, you could use multiple child elements with different names:
<inouts name="ClaimType" type="complex">
<value1>
<![CDATA[
<xml string here>
]]>
</value1>
<value2>
<![CDATA[
<xml string here>
]]>
</value2>
</inouts>
Or multiple value elements with an identifying id:
<inouts name="ClaimType" type="complex">
<value id="complexString1">
<![CDATA[
<xml string here>
]]>
</value>
<value id="complexString2">
<![CDATA[
<xml string here>
]]>
</value>
</inouts>
CDATA section or escaping
NB There is a big difference between escaping and encoding, which some other posters have referred to. Be careful of confusing the two.
I'm not sure how it works for attributes, and if escaping (< as < and > as >) does not work, then I don't know.
If it were an inner tag: you could use the Xml Any mechanism (never used it myself) or declare it in a CDATA section.
you are http://www.doingitwrong.com/
If inouts/#value really is tree-structured (i.e. XML) then it shouldn't be an attribute, it should be a child element:
<inout name="ClaimType" type="complex">
<value>
<some-arbitrary>
<xml-stuff/>
</some-arbitrary>
</value>
</inout>
If it is not, in fact, guaranteed to be well-formed XML, but just sort of looks like it because you put some pointy brackets in it, then you should ask yourself if there isn't some better way to solve this problem. That failing, use <![CDATA[ as some have already suggested.

Categories