XPath string modification using regex, Java - java

Say I have an XPath string like /Results/Bill[Item[id]]/id. I need to add namespace information to the path, so that the path is transformed to this: /*:Results/*:Bill[*:Item[*:id]]/*:id.
I was thinking of use regex to do this, something like "prepend "*:" to any alphanumeric character that is not preceded by another alphanumeric character". However, I don't have very much regex knowledge and don't know what regex this would correspond to (I'm planning to use Java's replaceAll() function once I have the regex). Also, can anyone think of a counter example where my idea wouldn't work? I'll just be performing the replacement operation on XPath strings with simple predicates (i.e. no and, or etc in between the square brackets).

You might get a regex solution to work with some kind of subset of XPath expressions, but you will never get it to work with all XPath expressions. The XPath grammar is just too complicated.
(The most obvious bugs in your initial proposal are that it fails on variable names like $var, function names like count(..) and axis names like parent::* or #code. You could solve that by checking for the relevant punctuation before or after the symbol. Checking for text inside comments or string literals is a bit trickier. But distinguishing "div" as an element name from "div" as an operator is way beyond what a regex approach can do: it needs a full context-sensitive parser.)
Better suggestion: use a tool that gives you a parse tree for the XPath expression, modify that parse tree, and then re-serialize the modified tree into XPath syntax.
See for example what can be done with Gunther Rademacher's Rex tool, or with the W3C XQuery parser applets (both easily found with google).

Related

Different result between Javascript and Java regular expression matches

Now I am trying to match some patterns from a String containing elasticsearch's structured bulk requests. Here is an example:
index {[event_20191209][event][null], source[{"haha":"haha","jaja":"jaja"}]}, update {[event_20191209][event][xxx], doc_as_upsert[false], doc[index {[null][_doc][null], source[{"haha":"haha","jaja":"jaja"}]}], scripted_upsert[false], detect_noop[true]}, delete {[event_20191208][_doc][sjdos]}, update {[event_20191209][event][yyy], doc_as_upsert[false], upsert[index {[null][_doc][null], source[{"haha":"haha","jaja":"jaja"}]}], scripted_upsert[false], detect_noop[true]}
My goal is to match every separate request out of the bulk requests string, i.e to get strings like:
index {[event_20191209][event][null], source[{"haha":"haha","jaja":"jaja"}]},
update {[event_20191209][event][xxx], doc_as_upsert[false], doc[index {[null][_doc][null], source[{"haha":"haha","jaja":"jaja"}]}], scripted_upsert[false], detect_noop[true]},
delete {[event_20191208][_doc][sjdos]},
update {[event_20191209][event][yyy], doc_as_upsert[false], upsert[index {[null][_doc][null], source[{"haha":"haha","jaja":"jaja"}]}], scripted_upsert[false], detect_noop[true]}
And my pattern expression is [a-z]+\s\{.+?\}[,\w\t\r\n]+? which works fine on a Javascript based regular expression online tester like below:
However, when I copied this pattern expression to my Java code, the output was not what I expected. It was like this:
So I realized there exists some differences between Javascript and Java regular expression engine, but I cannot figure out how to update my expression so that it could work well in Java after so much coding and googling.
I would be so grateful if someone could give me some favor or hint for this.
After a short nap, I found epiphany. I was a fool in the morning....
The workaround is so easy to implement. Elasticsearch has well overridden toString() for us.
At first glance, I wouldn't suggest using regex right away. It looks like those lines follow some kind of pattern that you could parse and split up first.
After that, if you're talking about regex, I'd try:
Taking a look at the java regex format: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
How about using an online java regex tool instead?

how to strip the text from xpath expression

I have the following xpath expression
//pre//strong[#class='messageText']
which give me out put of
"this output O/13-1405 is valid for this scenario"
what I need is to get only "O/13-1405" this should be generic as every test gets different output in the text
If this pattern is fixed (or you can find a fixed and distinct pattern directly around the wanted substring), you can use substring-before and substring-after to cut down the result:
substring-after(substring-before(//pre//strong[#class='messageText'], ' is valid for this scenario'), 'this output ')
If the output is always of the exactly same length, you could also use substring($text, $start, $length).
If you need to grep out of arbitrary text and need regular expressions, you either need to use Java code or embed another XPath processor which supports at least XPath 2.0 (Saxon can be embedded quite easily).

How to extract Substring from a String in Java

I have a String like below:
<script language="JavaScript" type="text/javascript" src="http://dns.net/adj/myhost.com/index;size=5x10;zipc=12345;myzon=north_west;|en;tile=10;ord=7jkllk456?"></script>
I want to access whatever is between src=" and ">. I have developed a code something like below:
int i=str.indexOf("src=\"");
str=str.substring(i+5);
i=str.indexOf("\">");
str=str.substring(0,i);
System.out.println(str);
Do you know if this is the right way? My only worry is that sometimes there could be a space between src and = or space between " and > and in this case my code will not work so I was thinking to use Regex. But I am not able to come up with any Regular expression. Do you have any suggestions?
This will work, but you should look into Regular Expressions, they provide a powerful way to spot patterns and extract text accordingly.
If you don't want to bother with regex, you can do this:
testString.split("src\\=")[1].split(">")[0]);
Of course it still doesn't solve your other concerns with different formats, but you can still use an applicable regex (like RanRag's answer) with the String.split() instead of the 5 lines of code you were using.
You can also try this regex src\s+"[=](.*)"\s+>.
Lets break it down
src match for src in string
\s+ look for one or more than one occurence of whitespace
[=] match for equal to
(.*) zero or more than one occurence of text until "\s>
Perhaps this is overkill for your situation, but you might want to consider using an HTML parser. This would take care of all the document formatting issues and let you get at the tags and attributes in a standard way. While Regex may work for simple HTML, once things become more complicated you could run into trouble (false matches or missed matches).
Here is a list of available open source parsers for Java: http://java-source.net/open-source/html-parsers
If there can't be any escaped double quotes in the string you want, try this expression: src="([^"]*)". This will src=" and match anything up to the first " that follows and capture the text between the double quotes into group 1 (group 0 is always the entire matched string).
Since whitespace around = is allowed, you might extend the expression to src\s*=\s*"([^"]*)".
Just a word of warning: HTML isn't a regular language and thus it can't be parsed using regular expressions. For simple cases like this it is ok but don't fall into the trap and think you can parse more complex html structures.

Java regex to retain specific closing tags

I'm trying to write a regex to remove all but a handful of closing xml tags.
The code seems simple enough:
String stringToParse = "<body><xml>some stuff</xml></body>";
Pattern pattern = Pattern.compile("</[^(a|em|li)]*?>");
Matcher matcher = pattern.matcher(stringToParse);
stringToParse = matcher.replaceAll("");
However, when this runs, it skips the "xml" closing tag. It seems to skip any tag where there is a matching character in the compiled group (a|em|li), i.e. if I remove the "l" from "li", it works.
I would expect this to return the following string: "<body><xml>some stuff" (I am doing additional parsing to remove the opening tags but keeping it simple for the example).
You probably shouldn't use regex for this task, but let's see what happens...
Your problem is that you are using a negative character class, and inside character classes you can't write complex expressions - only characters. You could try a negative lookahead instead:
"</(?!a|em|li).*?>"
But this won't handle a number of cases correctly:
Comments containing things that look like tags.
Tags as strings in attributes.
Tags that start with a, em, or li but are actually other tags.
Capital letters.
etc...
You can probably fix these problems, but you need to consider whether or not it is worth it, or if it would be better to look for a solution based on a proper HTML parser.
I would really use a proper parser for this (e.g. JTidy). You can't parse XML/HTML using regular expressions as it's not regular, and no end of edge cases abound. I would rather use the XML parsing available in the standard JDK (JAXP) or a suitable 3rd party library (see above) and configure your output accordingly.
See this answer for more passionate info re. parsing XML/HTML via regexps.
You cannot use an alternation inside a character class. A character class always matches a single character.
You likely want to use a negative lookahead or lookbehind instead:
"</(?!a|em|li).*?>"

How do I write regular expression in Java that takes into account the context of the string I'm looking for?

I want to parse a HTML code and create objects from their text representation in table. I have several columns and I want to save context of certain columns on every row.
Now, I have the HTML code and I understand I should use Pattern and Matcher to get those strings, but I don't know how to write required regular expression.
This is a row I will be parsing:
<tr><td>Delirium</td><td>65...</tr>
So, I want to extract Delirium from that string. How do I write regular expression that sais
get me the string that is between the string htm"> and </a></td>
?
This is a common question on SO and the answer is always the same: regular expressions are a poor and limited tool for parsing HTML because HTML is not a regular language.
You should be using an HTML parser, for example HTML Parser.
If you're curious what I mean by "regular language", have a look at JMD, Markdown and a Brief Overview of Parsing and Compilers. Basically a regular expression is a DFA (deterministic finite automaton or deterministic finite state machine). HTML requires a PDA (pushdown automaton) to parse. A PDA is a DFA with a stack. It's how it handles recursive elements.
htm">(.+)</a></td>
Searches for any character (that's the .+ bit) that is between htm"> and </a></td> and return what's in between to use with Pattern.matcher() (which is why there are brackets around .+ )
http://www.regular-expressions.info/java.html

Categories