Using regex in web harvest xml - java

I'm using web harvest to scrap some e-commerce site.I'm iterating over the search page and getting each product details in output xml.But now I want to use regular expression in anchor(a) tag while scraping and get particular string.i.e.,
let $linktoprod :=data($item//a[#class="fk-anchor-link"]/#href)
The above line returns anchor tag href value of each product i.e., for first product the value returned is,
/casio-sheen-analog-watch-women/p/itmdaqmvzyy23hz5?pid=WATDAQMVVNQEM9CX&ref=6df83d8f-f61f-4648-b846-403938ae92fa
Now I want to using the regular expression like /([^/\?]+)\? and get the string between last / and ? i.e.,
itmdaqmvzyy23hz5
in the output xml.
Please anyone who has any idea regarding this help me.
Thank you.
Updated -
<?xml version="1.0" encoding="UTF-8"?>
<config charset="ISO-8859-1">
<function name="download-multipage-list">
<return>
<while condition="${pageUrl.toString().length() != 0}" maxloops="${maxloops}" index="i">
<empty>
<var-def name="content">
<html-to-xml>
<http url="${pageUrl}"/>
</html-to-xml>
</var-def>
<var-def name="nextLinkUrl">
<xpath expression="${nextXPath}">
<var name="content"/>
</xpath>
</var-def>
<var-def name="pageUrl">
<template>${sys.fullUrl(pageUrl.toString(), nextLinkUrl.toString())}</template>
</var-def>
</empty>
<xpath expression="${itemXPath}">
<var name="content"/>
</xpath>
</while>
</return>
</function>
<var-def name="products">
<call name="download-multipage-list">
<call-param name="pageUrl">http://www.flipkart.com/watches/pr?sid=reh%2Cr18</call-param>
<call-param name="nextXPath">//a[starts-with(., 'Next')]/#href</call-param>
<call-param name="itemXPath">//div[#class="product browse-product "]</call-param>
<call-param name="pids"></call-param>
<call-param name="maxloops">5</call-param>
</call>
</var-def>
<var-def name="scrappedContent">
<!-- iterates over all collected products and extract desired data -->
<![CDATA[ <catalog> ]]>
<loop item="item" index="i">
<list><var name="products"/></list>
<body>
<xquery>
<xq-param name="item" type="node()"><var name="item"/></xq-param>
<xq-expression><![CDATA[
declare variable $item as node() external;
let $linktoprod :=data($item//a[#class="fk-anchor-link"]/#href)
let $name := data($item//div[#class="title"])
return
<product>
<link>{$linktoprod}</link>
<title>{normalize-space($name)}</title>
</product>
]]></xq-expression>
</xquery>
</body>
</loop>
<![CDATA[ </catalog> ]]>
</var-def>
</config>
My config xml is as show above.Where to use regexp code block in my xml? And I want the regexp to be applied to
linktoprod and finally get the regexp output in link tag as output xml.Please anyone guide me.
Thank you.

I don't know about web harvest, but if it supports a non greedy quantifier, you can use this pattern
/([^/]+?)\?
According to Web Harvest User manual - regexp you must insert something like this
<regexp>
<regexp-pattern>/([^/]+?)\?</regexp-pattern>
<regexp-source>
/casio-sheen-analog-watch-women/p/itmdaqmvzyy23hz5?pid=WATDAQMVVNQEM9CX&ref=6df83d8f-f61f-4648-b846-403938ae92fa
</regexp-source>
<regexp-result>
<template>Last URL part is "${_1}"</template>
</regexp-result>
</regexp>
In the <regexp-source> part you must insert your URL or variable to search for. Guessing from the manual and your config xml it might be something like
<regexp-source>
<var>scrappedContent</var>
</regexp-source>
or
<regexp-source>
${linktoprod}
</regexp-source>
I think you must experiment a bit.

Try this regex:
/([^/]+)\?
You might need to strip the leading / and trailing ?.
To illustrate that the regex works, this is it's result in JavaScript:
var s = "/casio-sheen-analog-watch-women/p/itmdaqmvzyy23hz5?pid=WATDAQMVVNQEM9CX&ref=6df83d8f-f61f-4648-b846-403938ae92fa"
console.log(s.match(/\/([^/]+)\?/g)); // /itmdaqmvzyy23hz5?

Related

How to clean URL for the XSS issue before using it [duplicate]

I'm totally stuck on this, I'm trying to escape a single quote in a JSP. I have some data that I'm outputting directly into a JS string and the single quotes seem to be causing issues.
Here is my code:
<dsp:droplet name="/atg/dynamo/droplet/ForEach">
<dsp:param value="${CommerceItems}" name="array" />
<dsp:param name="elementName" value="CommerceItem" />
<dsp:oparam name="outputStart">
var itemNameList ='
</dsp:oparam>
<dsp:oparam name="output">
<dsp:getvalueof id="Desc" param="CommerceItem.auxiliaryData.productRef.displayName">
${fn:replace(Desc, "'", "\\/'")}
</dsp:getvalueof>
</dsp:oparam>
<dsp:oparam name="outputEnd">';</dsp:oparam>
</dsp:droplet>
And here is the output that Im getting:
var itemNameList ='
Weyland Estate Santa Barbara Pinot Noir
Raymond \/'Prodigal\/' North Coast Cabernet Sauvignon
Chateau Haute Tuque';
But this is wrong, and I just need /'Prodigal'/ or no single quotes at all!
EDIT: Or I actually need to escape quotes with \ backward slash?
The forward slash is not an escape character. That's the backslash.
${fn:replace(Desc, "'", "\\'")}
(yes, it's been presented twice, because that's also an escape character in Java!)
However, you don't only need to repace ' by \', you also need to replace \n (newlines) by \\n. The string is been printed over multiple lines, which makes it also an invalid JS string variable. Your final result must basically look like this:
var itemNameList = ''
+ '\nWeyland Estate Santa Barbara Pinot Noir'
+ '\nRaymond \'Prodigal\' North Coast Cabernet Sauvignon'
+ '\nChateau Haute Tuque';
(please note that the syntax highlighter agrees on me here but not on yours)
There are however much more possible special characters which needs to be escaped. They are all covered by Apache Commons Lang StringEscapeUtils#escapeEcmaScript(). Much easier is to create a custom EL function which calls exactly that method. If not done yet, download and drop commons-lang.jar in /WEB-INF/lib. Then create a /WEB-INF/functions.tld file like follows:
<?xml version="1.0" encoding="UTF-8" ?>
<taglib
xmlns="http://java.sun.com/xml/ns/javaee"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-jsptaglibrary_2_1.xsd"
version="2.1">
<display-name>Custom Functions</display-name>
<tlib-version>1.0</tlib-version>
<uri>http://example.com/functions</uri>
<function>
<name>escapeJS</name>
<function-class>org.apache.commons.lang3.StringEscapeUtils</function-class>
<function-signature>java.lang.String escapeEcmaScript(java.lang.String)</function-signature>
</function>
</taglib>
So that you can use it as follows:
<%#taglib prefix="util" uri="http://example.com/functions" %>
...
${util:escapeJS(Desc)}

XmlUnit ignore order of elements when comparing XML files

I have the two following xml files:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<pricebooks xmlns="http://www.blablabla.com">
<pricebook>
<header pricebook-id="my-id">
<currency>GBP</currency>
<display-name xml:lang="x-default">display name</display-name>
<description>my description 1</description>
</header>
<price-tables>
<price-table product-id="id1" mode="mode1">
<amount quantity="1">30.00</amount>
</price-table>
<price-table product-id="id2" mode="mode2">
<amount quantity="1">60.00</amount>
</price-table>
</price-tables>
</pricebook>
</pricebooks>
and
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<pricebooks xmlns="http://www.blablabla.com">
<pricebook>
<header pricebook-id="my-id">
<currency>GBP</currency>
<display-name xml:lang="x-default">display name</display-name>
<description>my description 1</description>
</header>
<price-tables>
<price-table product-id="id2" mode="mode2">
<amount quantity="1">60.00</amount>
</price-table>
<price-table product-id="id1" mode="mode1">
<amount quantity="1">30.00</amount>
</price-table>
</price-tables>
</pricebook>
</pricebooks>
Which I'm trying to compare ignoring the order of the elements price-table, so for me those two are equal. I'm using
<dependency>
<groupId>org.xmlunit</groupId>
<artifactId>xmlunit-core</artifactId>
<version>2.5.0</version>
</dependency>
and the code is the following, but I'm not able to make it work. It complains because the attribute values id1 and id2 are different.
Diff myDiffSimilar = DiffBuilder
.compare(expected)
.withTest(actual)
.checkForSimilar()
.ignoreWhitespace()
.ignoreComments()
.withNodeMatcher(new DefaultNodeMatcher(ElementSelectors.byNameAndText))
.build();
assertFalse(myDiffSimilar.hasDifferences());
I have also tried to edit the the nodeMatcher as follow:
.withNodeMatcher(new DefaultNodeMatcher(ElementSelectors.conditionalBuilder()
.whenElementIsNamed("price-tables")
.thenUse(ElementSelectors.byXPath("./price-table", ElementSelectors.byNameAndText))
.elseUse(ElementSelectors.byName)
.build()))
Thank you for your help.
I don't see any nested text inside your price-table elements at all, so byNamAndText matches on element names only - which will be the same for all price-tables and thus not do what you want.
In you example there is no ambiguity for price-tables as there is only one anyway. So the byXPath approach looks wrong. At least in your snippet XMLUnit should do fine with byName except for the price-table elements.
I'm not sure whether product-id alone is what identifies your price-table elements or the combination of all attributes. Either byNameAndAttributes("product-id") or its byNameAndAllAttributes cousin should work.
If it is only product-id then byNameAndAttributes("product-id") becomes byName for all elements that don't have any product-id attribute at all. In this special case byNameAndAttribute("product-id") alone will work for your whole document as we can see it - more or less by accident.
If you need more complex rules for other elements than price-table or you want to make things more explicit than
ElementSelectors.conditionalBuilder()
.whenElementIsNamed("price-table")
.thenUse(ElementSelectors.byNameAndAttributes("product-id"))
// more special cases
.elseUse(ElementSelectors.byName)
is the better choice.

Java STX CDATA parsing

I am trying to anonymize an XML Export of confluence.
I found their export cleanner jar:
https://confluence.atlassian.com/doc/content-anonymizer-for-data-backups-134795.html
I have modified the clean.stx to remove all users like this:
<stx:template match="object[#class='ConfluenceUserImpl']/property[#name='name']/text() | object[#class='ConfluenceUserImpl']/property[#name='lowerName']/text() | object[#class='ConfluenceUserImpl']/id[#name='key']/text() | property[#class='ConfluenceUserImpl']/id[#name='key']/text()">
<stx:value-of select="translate(., '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx')"/>
</stx:template>
I need to modify the CDATA also using regex or similar in order to remove user mentions in the body of a confluence page.
The CDATA looks like this e.g.:
<property name="body">
<![CDATA[
<p>
<ac:link>
<ri:user ri:userkey="8a8300716489cc7d016489ce009a0000" />
</ac:link>
</p>
]]>
</property>
Here I only need to replace the value of ri:userkey to xxx or similar.
How can I do this?
Nevermind,
i now use the joost java version of the stx which is newer than the one used by attlassian in their jar:
http://joost.sourceforge.net/
I can use replace() here and use stx:cdata to disable escaping:
<stx:template match="property[#name='body']/cdata()">
<stx:cdata>
<stx:value-of select="replace(., '(ri:userkey=).*?\s', '$1"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" ')" />
</stx:cdata>
</stx:template>

How to use xpath in camel when the outermost element has an xmlns attribute?

I am having some trouble using xpath to extract the "Payload" values below using apache-camel. I use the below xpath in my route for both of the example xml, the first example xml returns SomeElement and SomeOtherElement as expected, but the second xml seems unable to parse the xml at all.
xpath("//Payload/*")
This example xml parses just fine.
<Message>
<Payload>
<SomeElement />
<SomeOtherElement />
</Payload>
</Message>
This example xml does not parse.
<Message xmlns="http://www.fake.com/Message/1">
<Payload>
<SomeElement />
<SomeOtherElement />
</Payload>
</Message>
I found a similar question about xml and xpath, but it deals with C# and is not a camel solution.
Any idea how to solve this using apache-camel?
Your 2nd example xml, specifies a default namespace: xmlns="http://www.fake.com/Message/1" and so your xpath expression will not match, as it specifies no namespace.
See http://camel.apache.org/xpath.html#XPath-Namespaces on how to specify a namespace.
You would need something like
Namespaces ns = new Namespaces("fk", "http://www.fake.com/Message/1");
xpath("//fk:Payload/*", ns)
I'm not familiar with Apache-Camel, this was just a result of some quick googling.
An alternative maybe to just change your xPath to something like
xpath("//*[local-name()='Payload']/*)
Good luck.

How to add Prefix for the attribute while marshling

I like to add prefix for attribute while marshaling using castors.
I would like to get result as like below
<ThesaurusConcept dc:identifier="C268">
<ScopeNote xml:lang="en">
<LexicalValue>index heading is Atomic absorption spectroscopy</LexicalValue>
</ScopeNote>
</ThesaurusConcept>
but I am getting
<ThesaurusConcept identifier="C621">
<ScopeNote lang="en">
<LexicalValue>index heading is Atomic absorption spectroscopy</LexicalValue>
</ScopeNote>
</ThesaurusConcept>
I got an answer for my question
we need to add the following in mapping.xml file
<mapping xmlns:dc="http://purl.org/dc/elements/1.1/">
<bind-xml name="dc:identifier" node="attribute" ></bind-xml>
and also we need to set namespace by using following code.
Marshaller casreactmp = new Marshaller(handler);
casreactmp.setNamespaceMapping("dc", "http://purl.org/dc/elements/1.1/");

Categories