Groovy Regular Expressions - java

I wanted to try to match the inner part of the string between the strong tags where it is guaranteed that the strong inside the strong tags starts with Price Range:. This text should not appear anywhere else in the string, but the <p> and <strong> tags certainly do. How can I match this with groovy?
<p><strong>Price Range: $61,000-$99,500</strong></p>
I tried:
def string = "<p><strong>Price Range: \$61,000-\$181,500</strong></p>strong";
string = string.replace(/Price.*strong/, "Replaced");
Just to see if I could get something to work, but I can't seem to get anything working that is more than a single word, which of course isn't particularly useful since I don't need regex for that.

Found the problems.
def string = "<p><strong>Price Range: \$61,000-\$181,500</strong>?</p>strong";
string = string.replaceFirst(~/<strong>Price Range.*<\/strong>/, "Replaced");
This includes the strong tags but it is good enough for my purpose. Needed replaceFirst instead of replace and ~ at the start to indicate a regex.

Is this what you're trying to do?
http://regexr.com?2t9jp

Related

Regex: Ignoring numbers

I am trying to write a regex that tries to match on a specific string, but ignores all numbers in the target string - So my regex could be 'MyDog', but it should match MyDog, as well as My11Dog and MyDog1 etc. I could write something like
M[^\d]*y[^\d]D[^\d]*o[^\d]g[^\d]*
But that is pretty painful. Any ideas out there? I am using Java, and cannot change what is in the string, because I need to retrieve it as is.
Regular Expressions can do this at the end but why don't you get help by your programming language Java? (I can't Java!)
String s1 = "0My1D2og3";
s2 = s1.replaceAll("\d", "");
if (s2.equals("MyDog")) {
// Do something
}

is it possible to use replaceAll() with wildcards

Good morning. I realize there are a ton of questions out there regarding replace and replaceAll() but i havnt seen this.
What im looking to do is parse a string (which contains valid html to a point) then after I see the second instance of <p> in the string i want to remove everything that starts with & and ends with ; until i see the next </p>
To do the second part I was hoping to use something along the lines of s.replaceAll("&*;","")
That doesnt work but hopefully it gets my point across that I am looking to replace anything that starts with & and ends with ;
You should probably leave the parsing to a DOM parser (see this question). I can almost guarantee you'll have to do this to find text within the <p> tags.
For the replacement logic, String.replaceAll uses regular expressions, which can do the matching you want.
The "wildcard" in regular expressions that you want is the .* expression. Using your example:
String ampStr = "This &escape;String";
String removed = ampStr.replaceAll("&.*;", "");
System.out.println(removed);
This outputs This String. This is because the . represents any character, and the * means "this character 0 or more times." So .* basically means "any number of characters." However, feeding it:
"This &escape;String &anotherescape;Extended"
will probably not do what you want, and it will output This Extended. To fix this, you specify exactly what you want to look for instead of the . character. This is done using [^;], which means "any character that's not a semicolon:
String removed = ampStr.replaceAll("&[^;]*;", "");
This has performance benefits over &.*?; for non-matching strings, so I highly recommend using this version, especially since not all HTML files will contain a &abc; token and the &.*?; version can have huge performance bottle-necks as a result.
The expression you want is:
s.replaceAll("&.*?;","");
But do you really want to be parsing HTML this way? You may be better off using an XML parser.

How to extract Substring from a String in Java

I have a String like below:
<script language="JavaScript" type="text/javascript" src="http://dns.net/adj/myhost.com/index;size=5x10;zipc=12345;myzon=north_west;|en;tile=10;ord=7jkllk456?"></script>
I want to access whatever is between src=" and ">. I have developed a code something like below:
int i=str.indexOf("src=\"");
str=str.substring(i+5);
i=str.indexOf("\">");
str=str.substring(0,i);
System.out.println(str);
Do you know if this is the right way? My only worry is that sometimes there could be a space between src and = or space between " and > and in this case my code will not work so I was thinking to use Regex. But I am not able to come up with any Regular expression. Do you have any suggestions?
This will work, but you should look into Regular Expressions, they provide a powerful way to spot patterns and extract text accordingly.
If you don't want to bother with regex, you can do this:
testString.split("src\\=")[1].split(">")[0]);
Of course it still doesn't solve your other concerns with different formats, but you can still use an applicable regex (like RanRag's answer) with the String.split() instead of the 5 lines of code you were using.
You can also try this regex src\s+"[=](.*)"\s+>.
Lets break it down
src match for src in string
\s+ look for one or more than one occurence of whitespace
[=] match for equal to
(.*) zero or more than one occurence of text until "\s>
Perhaps this is overkill for your situation, but you might want to consider using an HTML parser. This would take care of all the document formatting issues and let you get at the tags and attributes in a standard way. While Regex may work for simple HTML, once things become more complicated you could run into trouble (false matches or missed matches).
Here is a list of available open source parsers for Java: http://java-source.net/open-source/html-parsers
If there can't be any escaped double quotes in the string you want, try this expression: src="([^"]*)". This will src=" and match anything up to the first " that follows and capture the text between the double quotes into group 1 (group 0 is always the entire matched string).
Since whitespace around = is allowed, you might extend the expression to src\s*=\s*"([^"]*)".
Just a word of warning: HTML isn't a regular language and thus it can't be parsed using regular expressions. For simple cases like this it is ok but don't fall into the trap and think you can parse more complex html structures.

How to remove a word to another word from string using regular expression

How can I remove a part of a string from one word to another word using regular expressions?
For example, I have a string like
String s = "<html><body> this is test </body></html>"
In the above string I have to remove the part from the starting <body> tag to the ending </body> tag, and the value in between will be determined dynamically, the output should be s="<html></html>".
Unless I'm missing something here, you can use:
s = s.replaceFirst("<body>.+</body>", "");
Of course, with your example, you might just as well use
s = s.substring(0, 6) + s.substring(s.length() - 7, s.length());
to avoid a costly regex.
If you are after editing HTML or better XHTML and/or XML use DOM. It's not very good idea trying to do that with regular expressions.
If you have/want to use regexp:
If you want to remove from HERE to THERE, have you thought of cases like these HERE A HERE B THERE C THERE? Simple non greedy match will not behave as "expected" removing the inner HERE to THERE, but will result in C THERE.
Basically what you have to do is to find a THERE and then go to left to find first HERE so s/(.*)HERE.*?THERE/\1/ (using PCRE syntax) should do the trick and leave HERE A C THERE. Repeat to get rid of that too. However, this will not work with global substitution replacing all occurrences. For such usecase use the algo:
while (found) {
find a first `THERE` and then go to left to find first `HERE` \
with regexps or without.
}

Java Regex - exclude empty tags from xml

let's say I have two xml strings:
String logToSearch = "<abc><number>123456789012</number></abc>"
String logToSearch2 = "<abc><number xsi:type=\"soapenc:string\" /></abc>"
String logToSearch3 = "<abc><number /></abc>";
I need a pattern which finds the number tag if the tag contains value, i.e. the match should be found only in the logToSearch.
I'm not saying i'm looking for the number itself, but rather that the matcher.find method should return true only for the first string.
For now i have this:
Pattern pattern = Pattern.compile("<(" + pattrenString + ").*?>",
Pattern.CASE_INSENSITIVE);
where the patternString is simply "number". I tried to add "<(" + pattrenString + ")[^/>].*?> but it didn't work because in [^/>] each character is treated separately.
Thanks
This is absolutely the wrong way to parse XML. In fact, if you need more than just the basic example given here, there's provably no way to solve the more complex cases with regex.
Use an easy XML parser, like XOM. Now, using xpath, query for the elements and filter those without data. I can only imagine that this question is a precursor to future headaches unless you modify your approach right now.
So a search for "<number[^/>]*>" would find the opening tag. If you want to be sure it isn't empty, try "<number[^/>]*>[^<]" or "<number[^/>]*>[0-9]"

Categories