regex pick full word only

regex pick full word only - java

I need to pick only a full word using regex,I don't want to pick a word if its contained in another word, but I do want to pick if it starts/ends with special characters like _test, test.,test/,test.
Example:I dont want to pick if a word is contained in other word like"context" if I am looking for "text". But want it if I am looking for full-text, /text,text.,text_test, text,text's.
EDIT: Since we cant identify the plural forms, I am deleting that part.

First, you will benefit a lot from completing a tutorial such as this: http://www.codeproject.com/KB/dotnet/regextutorial.aspx
And Expresso is an excellent free tool for debugging and testing regular expressions.
Second, your expression should probably be something like:
\b([^A-Za-z]|A-Za-z[^A-Za-z]+)(text)([^A-Za-z]|[^A-Za-z]+A-Za-z)\b
\b word boundaries
([^A-Za-z]|A-Za-z[^A-Za-z]+) means "non alpha characters OR alpha characters followed by at least one non-alpha character"
"text" will be matched by subgroup 2.
Again, go through the tutorial above, it's short and you probably could have figured out how to create this expression in the time it's taken to get an answer here.

If you're looking for a word contained in the variable word I suggest you use
"\\b\\Q" + word + "\\E\\b"
Here's a breakdown:
\b: A word boundary
\Q: Nothing, but quotes all characters until \E
\E: Nothing, but ends quoting started by \Q
Something like this may do:
Pattern p = Pattern.compile("\\b\\Q" + word + "\\E\\b");
Matcher m = p.matcher("word like \"context\" while looking for \"text\".");
while (m.find())
System.out.println(m.group());

Related

Replacing with this pattern doesn't work as I would expect it to, what's wrong?

I need help on extracting some words from this sentence:
String keywords = "I like to find something vicous in somewhere bla bla bla.\r\n" +
"https://address.suffix.com/level/somelongurlstuff";
And my matching code looks somewhat like this:
keywords = keywords.toLowerCase();
regex = "(I like to find )(.*)( in )(.*)(\\.){1}(.*)";
regex = regex.toLowerCase();
keywords = keywords.replaceAll(regex, "$4 $2"); //"$4 $2");
And I want to extract the words between find and in and between in and the first dot. however, as the url has multiple dots, some weird stuff starts happening and I get what I need PLUS the url wit dots replaced with empty spaces. I want the url to be gone, because it's supposed to be the matched with (.*) in my case, and I only need one dot after my words with (\\.){1}, so I wonder what's going wrong there? Any ideas?
By adding (?s) or doing removing all new line characters on the line before matching on the regex gives you something like: somewhere bla bla bla address suffix something vicious so the problem with the url without having dots still being left there persists.
This is NOT just about matching multiline text.

You need two things to fix: 1) add the DOTALL modifier since you have text that spans across multiple lines and 2) use lazy dot matching or - more efficient - a negated character class [^.] to match characters up to the first . after in:
(?s)(I like to find )(.*)( in )([^.]*)(\.)(.*)
^^^^^^^
See the regex demo
However, the best one would be this one:
(?s)(I like to find )(.*?)( in )([^.]*)(\.)(.*)
The reluctant (lazy) quantifier makes the engine match as few characters as possible between the lazily quantified subpattern and the next subpattern. If we use .* before ( in ), backtracking will occur, that is, the whole string after "I like to find " will be grabbed by the regex engine, and then the engine will move backwards looking for the last in . Thus, using .*? will match up to the first in .
Instead of [^.]* you can use a . with a reluctant quantifier *? to match up to the first dot, but it is costlier in terms of performance since the engine expands the subpattern upon each fail it comes across when trying to match the string with the subsequent subpatterns.
Check my answer for Perl regex matching optional phrase in longer sentence to understand how greedy and lazy (=reluctant) quantifiers work.

word range or \w in negative lookbehind

I was trying to made regex for extracting word at the place of Delhi in text
sending to: GK Delhi, where the sending to: is fixed and i don't want to capture whatever at the place of GK. Actually GK will be one word in my case, what i made which should work is: (?<=sending to: \w )Delhi, means if word starts with sending to: and ends with Delhi then return Delhi.
Please help me to fix this.

Three points,
\w matches a single word character. Use \w+ to match one or more or \w* to match zero or more word characters.
Don't forget about space between DK and Delhi: \s+.
Just a note: The (?<= construct is the positive lookbehind, not negative one.
So the regex could look like this:
(?<=sending to:\s*\w+\s+)Delhi
Please also note that arbitrary-length lookbehind is only supported by very few regex engines, but you didn't say anything about the tool you are using.
Update:
Java doesn't support arbitrary-length lookbehind expressions.
The possibilities you have are:
The matched text will always be Delhi (on successful match). So if you are only checking for a match, then you could just use the regex: sending to:\s*\w+\s+Delhi.
If you want to extend the regex to other towns in future, then you could use a capturing group. The regex would be, for example, sending to:\s*\w+\s+(Delhi|Mumbai) and in Java code you would get the city name via matcher.group(1).
Please post your actual Java code of how you are using the regex if you want a more detailed advice.

multiple regular expressions vs search algorithm

I have a text file where every line is a random combination of any of the following groups
Numbers - English Letters - Arabic Letters - Punctuation
\w which is composed of a-zA-Z0-9_ for the first 2 groups
\p{InArabic} for the third group
\p{Punct} which is composed of !"#$%&'()*+,-./:;<=>?#[]^_`{|}~ for the fifth group
I got this info from here
i read a line. The ONLY time I do something to this line is if the line contains Arabic letters AND (English letters OR Unicode Symbols)
After reading this post and this post I came up with the following expression. Obviously it's wrong as my output is all wrong >.<
pattern = Pattern.compile("(?=\\p{InArabic})(?=[a-zA-Z])");
Here's the input
1
1a
a!
aش
شa
ششa
aشش
شaش
aشa
!aش
The first three shouldn't be matched but my output shows that NONE are a match.
Edit: sorry I just realized that I forgot to change my title. But if any of you feel that searching is better performance wise then please suggest a search algorithm. Using search algo instead of regex looks ugly but I'd go with it if it performed better. Thanks to the posts I read, I learned that I can make regex faster if I put this in the constructor so that it'd be executed once only instead of including them in my loop thereby being executed everytime
pattern = Pattern.compile("(?=\\p{InArabic})(?=[a-zA-Z])");
matcher = pattern.matcher("");

To follow your idea, the correct pattern is:
pattern = Pattern.compile("(?=.*\\p{InArabic})(?=.*[a-zA-Z\\p{Punct}])");
The same position in a string can not be followed by an arabic letter and a punctuation character or a latin letter at the same time. In other words, you have written an always false condition. Adding .* allows characters to be anywhere in the string.
If you want a more optimised pattern, you can use Jason C idea but with negative character classes to reduce the backtracking:
pattern = Pattern.compile("\\p{inArabic}[^a-zA-Z\\p{Punct}]*[a-zA-Z\\p{Punct}]|[a-zA-Z\\p{Punct}]\\P{inArabic}*\\p{inArabic}");

If you want to find a line with a mix, all you really need are 2 boundry condition checks.
A sucessfull match indicates a mix.
# "\\p{InArabic}(?=[\\w\\p{Punct}])|(?<=[\\w\\p{Punct}])\\p{InArabic}"
\p{InArabic}
(?= [\w\p{Punct}] )
|
(?<= [\w\p{Punct}] )
\p{InArabic}

Regular expression not extracting the exact pattern

I am working in Java to read a string of over 100000 characters.
I have a list of keywords, that I search the string for, and if the string is present I call a function which does some internal processing.
The kind of keyword I have is "face", for example - I wish to get all the patterns where I have matches for "faces" not "facebook". I can accept a space character behind the face in the string so if in a string I have a match like " face" or " faces" or "face " or " faces" i can accept that too. However I can not accept "duckface" or "duckface " etc.
I have written the regex
Pattern p = Pattern.compile("\\s+"+keyword+"s\\s+|\\s+");
where keyword is my list of keywords, but I am not getting the desired results. Can you read my description and please suggest what might be issue and how I can fix it?
Also if a pointer to a really good regex for Java page is shared I would appreciate that as well.
Thank you Contributers ..
Edit
The reason I know it is not working is I have used the following code:
Pattern p = Pattern.compile("\\s+"+keyword+"s\\s+|\\s+");
Matcher m = p.matcher(myInputDataSting);
if(m.find())
{
System.out.println("Its a Match: "+m.group());
}
This returns a blank string...

If keyword is "face", then your current regex is
\s+faces\s+|\s+
which matches either one or more whitespace characters, followed by faces, followed by one or more whitespace characters, or one or more whitespace characters. (The pipe | has very low precedence.)
What you really want is
\bfaces?\b
which matches a word boundary, followed by face, optionally followed by s, followed by a word boundary.
So, you can write:
Pattern p = Pattern.compile("\\b"+keyword+"s?\\b");
(though obviously this will only work for words like face that form their plurals by simply adding s).
You can find a comprehensive listing of Java's regular-expression support at http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html, but it's not much of a tutorial. For that, I'd recommend just Googling "regular expression tutorial", and finding one that suits you. (It doesn't have to be Java-specific: most of the tutorials you'll find are for flavors of regular-expression that are very similar to Java's.)

You should use
Pattern p = Pattern.compile("\b"+keyword+"s?\b");
, where keyword is not plural. \\b means that keyword must be as a complete word in searched string. s? means that keyword's value may end with s.
If you are not familar enough with regular expressions I recommend reading http://docs.oracle.com/javase/tutorial/essential/regex/index.html, because there are examples and explanations.

Java - Unknown characters passing as [a-zA-z0-9]*?

I'm no expert in regex but I need to parse some input I have no control over, and make sure I filter away any strings that don't have A-z and/or 0-9.
When I run this,
Pattern p = Pattern.compile("^[a-zA-Z0-9]*$"); //fixed typo
if(!p.matcher(gottenData).matches())
System.out.println(someData); //someData contains gottenData
certain spaces + an unknown symbol somehow slip through the filter (gottenData is the red rectangle):
In case you're wondering, it DOES also display Text, it's not all like that.
For now, I don't mind the [?] as long as it also contains some string along with it.
Please help.
[EDIT] as far as I can tell from the (very large) input, the [?]'s are either white spaces either nothing at all; maybe there's some sort of encoding issue, also perhaps something to do with #text nodes (input is xml)

The * quantifier matches "zero or more", which means it will match a string that does not contain any of the characters in your class. Try the + quantifier, which means "One or more": ^[a-zA-Z0-9]+$ will match strings made up of alphanumeric characters only. ^.*[a-zA-Z0-9]+.*$ will match any string containing one or more alphanumeric characters, although the leading .* will make it much slower. If you use Matcher.lookingAt() instead of Matcher.matches, it will not require a full string match and you can use the regex [a-zA-Z0-9]+.

You have an error in your regex: instead of [a-zA-z0-9]* it should be [a-zA-Z0-9]*.
You don't need ^ and $ around the regex.
Matcher.matches() always matches the complete string.
String gottenData = "a ";
Pattern p = Pattern.compile("[a-zA-z0-9]*");
if (!p.matcher(gottenData).matches())
System.out.println("doesn't match.");
this prints "doesn't match."

The correct answer is a combination of the above answers. First I imagine your intended character match is [a-zA-Z0-9]. Note that A-z isn't as bad as you might think it include all characters in the ASCII range between A and z, which is the letters plus a few extra (specifically [,\,],^,_,`).
A second potential problem as Martin mentioned is you may need to put in the start and end qualifiers, if you want the string to only consists of letters and numbers.
Finally you use the * operator which means 0 or more, therefore you can match 0 characters and matches will return true, so effectively your pattern will match any input. What you need is the + quantifier. So I will submit the pattern you are most likely looking for is:
^[a-zA-Z0-9]+$

You have to change the regexp to "^[a-zA-Z0-9]*$" to ensure that you are matching the entire string

Looks like it should be "a-zA-Z0-9", not "a-zA-z0-9", try correcting that...

Did anyone consider adding space to the regex [a-zA-Z0-9 ]*. this should match any normal text with chars, number and spaces. If you want quotes and other special chars add them to the regex too.
You can quickly test your regex at http://www.regexplanet.com/simple/

You can check input value is contained string and numbers? by using regex ^[a-zA-Z0-9]*$
if your value just contained numberString than its show match i.e, riz99, riz99z
else it will show not match i.e, 99z., riz99.z, riz99.9
Example code:
if(e.target.value.match('^[a-zA-Z0-9]*$')){
console.log('match')
}
else{
console.log('not match')
}
}
online working example

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

regex pick full word only - java

Related

Replacing with this pattern doesn't work as I would expect it to, what's wrong?

word range or \w in negative lookbehind

multiple regular expressions vs search algorithm

Regular expression not extracting the exact pattern

Java - Unknown characters passing as [a-zA-z0-9]*?

Categories

Resources