Simple Java regular expression matching fails - java

Before y'all jump on me for posting something similar to previous questions asked, yes, there seem to be a number of regex related questions but nothing which seems to help me, or at least that I can see.
I am trying to parse strings in JAVA using PATTERN and MATCHER and am really having no joy. My regular expression seems to match my input string when I use a few of the online regular expression testing websites but Java simply does not match my expression.
My input string is:
"Big apple" title="Little Apple" type="Container" url="http://malcolm.com/testing"
The regular expression I am using to match is ".*" title="(.*)" type="Container" url="(.*)"
Essentially I want to pull out the text within the second and the fourth set of quotes. There will always be 4 sets of quotes with text within and around.
I am coding as follows:
Variable XMLSubstring contains the string above (including the quotes) and is as stated, even when I print it out.
Pattern p = Pattern.compile(".* title=\"(.*)\" type=\"Container\" url=\"(.*)\"");
m = p.matcher(XMLSubstring);
It doesn't appear to be rocket science I'm attempting but I'm pulling my hair out trying to debug the bloody thing.
Is there something wrong with my regex pattern?
Is there something wrong with the code I am using?
Am I simply a moron and should stop coding with immediate effect?
EDIT & UPDATE: I have found the problem. My string had a space at the end of it which was breaking the parser! How silly, and I think based on that, I need to accept the third suggestion of mine and give up programming. Thanks all for your assistance.

Try this,
String str="\"Big apple\" title=\"Little Apple\" type=\"Container\" url=\"http://malcolm.com/testing\"";
Pattern p=Pattern.compile(".* title=\\\".*\\\" type=\\\"Container\\\" url=\\\".*\\\"");
Matcher m=p.matcher(str);

Related

Messy string. Regex for excluding '#' char in the end

I have a pretty much messy array of strings which doesn't follow any particular pattern.
Basically, it's users' properties which are messed up (all info in one string without following any kind of pattern).
I'm interested in 2 particular properties (email and number).
I found a way around to get email and thought that the following regex:
^9[0-9]{9}
would work for users' phones. However, some users do have emails which are equal to phone numbers + '#'.something. That seems to be a problem.
So, I need a regex which will exclude the following and give me just a number.
9876548877#
I've tried
^9[0-9]{9}((?!#).{0})*$"
And get full match here:
9876548877
But it works so well only if the string doesn't contain anything apart from this.
I am trying to achieve is getting exactly phone number in a string like this:
/* mess mess mess*/ John Doe Jr email: 9876548877#jdoe.com, phone number: 9876548877, /* more mess */
How do I do it? Thanks in advance.
UPD:
Thank you for your answers sirs, but what the task is a little bit different
For example, I took a regex from here and then I test it here I get the result I want. I'm trying to accomplish the same behaviour, but with the phone number and without '#' to be sure that it's exactly what I'm looking for.
The question wasn't described properly. My bad.
You could use lookarounds to assert what is on the left is not a non whitespace char and on the right in not an #:
(?<!\S)9[0-9]{9}(?!\#)
Regex demo
If there can be for example a : directly before the number you could omit the lookbehind at the start and use a word boundary \b

Replacing substrings in String

I am 16 and trying to learn Java, I have a paper that my uncle gave me that has things to do in Java. One of these things is too write and execute a program that will accept an extended message as a string such as
Each time she saw the painting, she was happy
and replace the word she with the word he.
Each time he saw the painting, he was happy.
This part is simple, but he wants me to be able to take any form of she and replace it we he like (she to he, She to He, she? to he?, she. to he., she' to he' and so on). Can someone help me make a program to accomplish this.
I have this
public static void main(String[] args) {
Scanner keyboard = new Scanner(System.in);
System.out.println("Write Sentence");
String original = keyboard.nextLine();
String changeWord = "he";
String modified = original.replaceAll("she", changeWord);
System.out.println(modified);
}
If this isn't the right site to find answers like this, can you redirect me to a site that answers such questions?
The best way to do this is with regular expressions (regex). Regex allow you to match patterns or classes of words so you can deal with general cases. Consider the cases you have already listed:
(she to he, She to He, she? to he?, she. to he., she' to he' and so on)
What is common between these cases? Can you think of some general rule(s) that would apply to all such transformations?
But also consider some cases you haven't listed: for example, as you've written it now, your code will change the word "ashes" to "ahes" because "ashes" contains "she." A properly written regex expression allows you to avoid this.
Before delving into regex, try and express, in plain English, a rule or set of rules for what you want to replace and what it should be replaced with.
Then, learn some regex and attempt to apply those rules.
Lastly, try and write some tests (i.e. using JUnit) for various cases so you can see which cases your code is working for and which cases it isn't working for.
Once you have done this, if something still doesn't work, feel free to post a new question here showing us your code and explaining what doesn't work. We'll be happy to help.
I would recommend this regular expression to solve this. It seems you have to search and replace separately the uppercase S and the lowercase s
String modified = original
.replaceAll("(she)(\\W)", "he$2")
.replaceAll("(She)(\\W)", "He$2");
Explanation :
The pattern (she) will match the word she and store it as the first captured group of characters
The pattern (\\W) will match one non alphabetic character (e.g. ', .) and store it as the second captured group of characters
Both of these patterns must match consecutive parts of the input string for replaceAll to replace something.
"he$2" put in the resulting string the word he followed by the second captured group of characters (in our case the group has only one character)
The above means that the regular expression will match a pattern like She'll and replace with He'll, but it will not match a pattern like Sherlock because here She is followed by an alphabetic character r

understanding regex if then statements

So I'm not sure if I understand how this works and would like
a simple explanation to how they work is all. I probably have it way off. A pure regex solution is required, and I don't know if this is possible. If it is, a solution would be awesome too, but a shove in the right direction would be good for my learning process ^_^
This is how I thought the if/then/else option built into my regex engines was formatted:
?(condition)if regex|else regex
I want it to capture a string from a very specific location only when this string exists within a certain section of javascript. Because this is how I thought it worked after a decent amount of research I tried out a few variations of this code but they all ended up something like this.
((?^view_large$)Tables-137(.*?)search.htm)
Also of relevance: I'm using an java based app that has regex searches which pull the data I need so I cannot write an if statement in java which would be my preferred method. It's a pain to have to do it this way, but at the moment I have no other choice. I'm trying really hard for them to allow java code functionality instead of pure regex for more versatile options.
So to summarize, is there even a if/then option in regex and if so how is it formatted for what I'm trying to accomplish?
EDIT: The string that I want to be the "if condition" is like this: if view_large string exists and is not null then capture the exact string 500/ which is captured within the catch all group I used: (.*?)
There is no conditionals in Java regexp, but you can simulate them by writing two expressions that include mutually exclusive look-behind constructs, like this:
((?<=if )then)|((?<!if )end)
This expression will match "then" when it is preceded by an "if "; it will match "end" when it is not preceded by an "if "
The Javadoc for java.util.regex.Pattern mentions, in its list of "Perl constructs not supported by this class":
The conditional constructs (?(condition)X) and (?(condition)X|Y).
So, no dice. But you should look through the Javadoc to see if you can achieve what you need by using regex features that it does support. (Or, if you post some more detailed examples, we can try to help.)
Try lookaround assertions.
For example, say you want to capture FOOBAR only if there is a 4+ digit number somewhere:
(?=.*\d{4}).*(FOOBAR)

Android: Matcher.find() never returns

First of all, here is a chunk of affected code:
// (somewhere above, data is initialized as a String with a value)
Pattern detailsPattern = Pattern.compile("**this is a valid regex, omitted due to length**", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
Matcher detailsMatcher = detailsPattern.matcher(data);
Log.i("Scraper", "Initialized pattern and matcher, data length "+data.length());
boolean found = detailsMatcher.find();
Log.i("Scraper", "Found? "+((found)?"yep":"nope"));
I omitted the regex inside Pattern.compile because it's very long, but I know it works with the given data set; or if it doesn't, it shoudn't break anything anyway.
The trouble is, I do get the feedback I/Scraper(23773): Initialized pattern and matcher, data length 18861 but I never see the "Found?" line, it is just stuck on the find() call.
Is this a known Android bug? I've tried it over and over and just can't get it to work. Somehow, I think something over the past few days broke this because my app was working fine before, and I have in the past couple days received several comments of the app not working so it is clearly affecting other users as well.
How can I further debug this?
Some regexes can take a very, very long time to evaluate. In particular, regexes that have lots of quantifiers can cause the regex engine to do a huge amount of backtracking to explore all of the possible ways that the input string might match. And if it is going to fail, it has to explore all of those possibilities.
(Here is an example:
regex = "a*a*a*a*a*a*b"; // 6 quantifiers
input = "aaaaaaaaaaaaaaaaaaaa"; // 20 characters
A typical regex engine will do in the region of 20^6 character comparisons before deciding that the input string does not match.)
If you showed us the regex and the string you are trying to match, we could give a better diagnosis, and possibly offer some alternatives. But if you are trying to extract information from HTML, then the best solution is to not use regexes at all. There are HTML parsers that are specifically designed to deal with real-world HTML.
How long is the string you are trying to parse ?
How long and how complicated is the regex you are trying to match ?
Have you tried to break down your regex down to simpler bits ? Adding up the bits one after another will let you see when it breaks and maybe why.
make some RE like [a-zA-Z]* pass it as argument to compile(),here this example allows only characters small & cap.
Read my blogpost on android validation for more info.
I had the same issue and I solved it replacing all the wildchart . with [\s\S]. I really don't know why it worked for me but it did. I come from Javascript world and I know in there that expression is faster for being evaluated.

Java Regex, capturing groups with comma separated values

InputString: A soldier may have bruises , wounds , marks , dislocations or other Injuries that hurt him .
ExpectedOutput:
bruises
wounds
marks
dislocations
Injuries
Generalized Pattern Tried:
".[\s]?(\w+?)"+ // bruises.
"(?:(\s)?,(\s)?(\w+?))*"+ // wounds marks dislocations
"[\s]?(?:or|and) other (\w+)."; // Injuries
The pattern should be able to match other input strings like: A soldier may have bruiser or other injuries that hurt him.
On trying the generalized pattern above, the output is:
bruises
dislocations
Injuries
There is something wrong with the capturing group for "(?:(\s)?,(\s)?(\w+?))*". The capturing group has one more occurences.. but it returns only "dislocations". "marks" and "dislocation: are devoured.
Could you please suggest what should be the right pattern, and where is the mistake?
This question comes closest to this question, but that solution didn't help.
Thanks.
When the capture group is annotated with a quantifier [ie: (foo)*] then you will only get the last match. If you wanted to get all of them then you need to quantifier inside the capture and then you will have to manually parse out the values. As big a fan as I am of regex, I don't think it's appropriate here for any number of reasons... even if you weren't ultimately doing NLP.
How to fix: (?:(\s)?,(\s)?(\w+?))*
Well, the quantifier basically covers the whole regex in that case and you might as well use Matcher.find() to step through each match. Also, I'm curious why you have capture groups for the whitespace. If all you are trying to do is find a comma-separated set of words then that's something like: \w+(?:\s*,\s*\w+)* Then don't bother with capture groups and just split the whole match.
And for anything more complicated re: NLP, GATE is a pretty powerful tool. The learning curve is steep at times but you have a whole industry of science-guys to draw from: http://gate.ac.uk/
Regex in not suited for (natural) language processing. With regex, you can only match well defined patterns. You should really, really abandon the idea of doing this with regex.
You may want to start a new question where you specify what programming language you're using to perform this task and ask for pointers there.
EDIT
PSpeed posted a promising link to a 3rd party library, Gate, that's able to do many language processing tasks. And it's written in Java. I have not used it myself, but looking at the people/institutions working on it, it seems pretty solid.
The pattern that works is: \w+(?:\s*,\s*\w+)* and then manually separate CSV
There is no other method to do this with Java Regex.
Ideally, Java regex is not suitable for NLP. A useful tool for text mining is: gate.ac.uk
Thanks to Bart K. , and PSpeed.

Categories