how to strip the text from xpath expression - java

I have the following xpath expression
//pre//strong[#class='messageText']
which give me out put of
"this output O/13-1405 is valid for this scenario"
what I need is to get only "O/13-1405" this should be generic as every test gets different output in the text

If this pattern is fixed (or you can find a fixed and distinct pattern directly around the wanted substring), you can use substring-before and substring-after to cut down the result:
substring-after(substring-before(//pre//strong[#class='messageText'], ' is valid for this scenario'), 'this output ')
If the output is always of the exactly same length, you could also use substring($text, $start, $length).
If you need to grep out of arbitrary text and need regular expressions, you either need to use Java code or embed another XPath processor which supports at least XPath 2.0 (Saxon can be embedded quite easily).

Related

How to write a regular expression that can matches the whole function #Prompt (…) whatever written inside () even if it contains another ()

For example I want replace any prompt function in an SQL query
I have used this expression
Query = Query.replaceAll("#prompt\\s*\\(.*?\\)", "(1)");
This expression works in this example
#Prompt('Bill Cycle','A','MIGRATION\BC',,,)
#Prompt('Bill Cycle','A','MIGRATION\BC',,,)
and the output is is (1)
but when it does not work on this example
#Prompt('Groups','A','[Lookup] Price Group (2)\Nested Group',,,)
the out put is (1) \Nested Group',,,) which is not valid
Sadly, as pointed out by Joe C in a comment, what you are trying to do cannot be done in a regular expression for arbitrary depth parenthesis. The reason is because regular-expressions are not capable of "counting". You need a stack machine for that, or a context-free language parser.
However, you also suggest that the 'prompted' content is always inside single quotes. I assume below the standard Java regexp library. Other regexp libraries might need translation...
"#Prompt\\('[^']*'(\s*,\s*(('[^']*')|([^',)]*)))*\\)"
So, you are searching within prompt for blocks of single-quoted text. The search assumes that each internal bit of content is enclosed in single quotes.
Verify at https://regex101.com/r/nByy0Y/1 (I made a couple fixes). Note that at regex101.com, it will treat the double back-slash as intending a literal back-slash. What you want instead is just to quote the parenthesis so that you want a literal parenthesis.
Because you are using the lazy quantifier '?', it is stopping the match at the end of the first ')'. removing that will let it go to the end greedily, as such:
#prompt\(.*\)
But if there is concern that the entries may have more parans after the one in question, it will cause problems.
Assuming the additional parens will always be in quotes, you can do this:
#prompt\((('([^'])*',*)*|(.*,*)*)\)
Here is it looking for items wrapped in single quotes OR text without parens, which should capture all of the single quoted elements or null params or unquoted text params

XPath string modification using regex, Java

Say I have an XPath string like /Results/Bill[Item[id]]/id. I need to add namespace information to the path, so that the path is transformed to this: /*:Results/*:Bill[*:Item[*:id]]/*:id.
I was thinking of use regex to do this, something like "prepend "*:" to any alphanumeric character that is not preceded by another alphanumeric character". However, I don't have very much regex knowledge and don't know what regex this would correspond to (I'm planning to use Java's replaceAll() function once I have the regex). Also, can anyone think of a counter example where my idea wouldn't work? I'll just be performing the replacement operation on XPath strings with simple predicates (i.e. no and, or etc in between the square brackets).
You might get a regex solution to work with some kind of subset of XPath expressions, but you will never get it to work with all XPath expressions. The XPath grammar is just too complicated.
(The most obvious bugs in your initial proposal are that it fails on variable names like $var, function names like count(..) and axis names like parent::* or #code. You could solve that by checking for the relevant punctuation before or after the symbol. Checking for text inside comments or string literals is a bit trickier. But distinguishing "div" as an element name from "div" as an operator is way beyond what a regex approach can do: it needs a full context-sensitive parser.)
Better suggestion: use a tool that gives you a parse tree for the XPath expression, modify that parse tree, and then re-serialize the modified tree into XPath syntax.
See for example what can be done with Gunther Rademacher's Rex tool, or with the W3C XQuery parser applets (both easily found with google).

is it possible to use replaceAll() with wildcards

Good morning. I realize there are a ton of questions out there regarding replace and replaceAll() but i havnt seen this.
What im looking to do is parse a string (which contains valid html to a point) then after I see the second instance of <p> in the string i want to remove everything that starts with & and ends with ; until i see the next </p>
To do the second part I was hoping to use something along the lines of s.replaceAll("&*;","")
That doesnt work but hopefully it gets my point across that I am looking to replace anything that starts with & and ends with ;
You should probably leave the parsing to a DOM parser (see this question). I can almost guarantee you'll have to do this to find text within the <p> tags.
For the replacement logic, String.replaceAll uses regular expressions, which can do the matching you want.
The "wildcard" in regular expressions that you want is the .* expression. Using your example:
String ampStr = "This &escape;String";
String removed = ampStr.replaceAll("&.*;", "");
System.out.println(removed);
This outputs This String. This is because the . represents any character, and the * means "this character 0 or more times." So .* basically means "any number of characters." However, feeding it:
"This &escape;String &anotherescape;Extended"
will probably not do what you want, and it will output This Extended. To fix this, you specify exactly what you want to look for instead of the . character. This is done using [^;], which means "any character that's not a semicolon:
String removed = ampStr.replaceAll("&[^;]*;", "");
This has performance benefits over &.*?; for non-matching strings, so I highly recommend using this version, especially since not all HTML files will contain a &abc; token and the &.*?; version can have huge performance bottle-necks as a result.
The expression you want is:
s.replaceAll("&.*?;","");
But do you really want to be parsing HTML this way? You may be better off using an XML parser.

How to extract Substring from a String in Java

I have a String like below:
<script language="JavaScript" type="text/javascript" src="http://dns.net/adj/myhost.com/index;size=5x10;zipc=12345;myzon=north_west;|en;tile=10;ord=7jkllk456?"></script>
I want to access whatever is between src=" and ">. I have developed a code something like below:
int i=str.indexOf("src=\"");
str=str.substring(i+5);
i=str.indexOf("\">");
str=str.substring(0,i);
System.out.println(str);
Do you know if this is the right way? My only worry is that sometimes there could be a space between src and = or space between " and > and in this case my code will not work so I was thinking to use Regex. But I am not able to come up with any Regular expression. Do you have any suggestions?
This will work, but you should look into Regular Expressions, they provide a powerful way to spot patterns and extract text accordingly.
If you don't want to bother with regex, you can do this:
testString.split("src\\=")[1].split(">")[0]);
Of course it still doesn't solve your other concerns with different formats, but you can still use an applicable regex (like RanRag's answer) with the String.split() instead of the 5 lines of code you were using.
You can also try this regex src\s+"[=](.*)"\s+>.
Lets break it down
src match for src in string
\s+ look for one or more than one occurence of whitespace
[=] match for equal to
(.*) zero or more than one occurence of text until "\s>
Perhaps this is overkill for your situation, but you might want to consider using an HTML parser. This would take care of all the document formatting issues and let you get at the tags and attributes in a standard way. While Regex may work for simple HTML, once things become more complicated you could run into trouble (false matches or missed matches).
Here is a list of available open source parsers for Java: http://java-source.net/open-source/html-parsers
If there can't be any escaped double quotes in the string you want, try this expression: src="([^"]*)". This will src=" and match anything up to the first " that follows and capture the text between the double quotes into group 1 (group 0 is always the entire matched string).
Since whitespace around = is allowed, you might extend the expression to src\s*=\s*"([^"]*)".
Just a word of warning: HTML isn't a regular language and thus it can't be parsed using regular expressions. For simple cases like this it is ok but don't fall into the trap and think you can parse more complex html structures.

Java Inner Text (getTextContents()) Problem

I'm trying to do some parsing in Java and I'm using Cobra HTML Parser to get the HTML into a DOM then I'm using XPath to get the nodes I want. When I get down to the desired level I call node.getTextContents(), but this gives me a string like
"\n\n\nValue\n-\nValue\n\n\n"
Is there a built in way to get rid of the line breaks? I would like to do a RegEx like
(?:\s*([^-]+)\s*-\s*([^-]+)\s*)
on the inner text and would really prefer not to have to deal with the possible different white space symbols in between the text.
Example Input:
Value
-
Value
Thanks
You can use String.replaceAll().
String trimmed = original_string.replaceAll("\n", "");
The first argument is a regular expression: you could replace all contiguous blocks of whitespace in the original string with replaceAll("\\s+", "") for instance.
I'm not totally sure I understood the question correctly, but the simplest way to remove all the whitespace would be:
String s = node.getTextContents().replaceAll("\\s","");
If you just want to get rid of the leading/trailing whitespace, use trim().

Categories