Java: String.contains matches exact word - java

In Java
String term = "search engines"
String subterm_1 = "engine"
String subterm_2 = "engines"
If I do term.contains(subterm_1) it returns true. I don't want that. I want the subterm to exactly match one of the words in term
Therefore something like term.contains(subterm_1) returns false and term.contains(subterm_2) returns true

\b Matches a word boundary where a word character is [a-zA-Z0-9_].
This should work for you, and you could easily reuse this method.
public class testMatcher {
public static void main(String[] args){
String source1="search engines";
String source2="search engine";
String subterm_1 = "engines";
String subterm_2 = "engine";
System.out.println(isContain(source1,subterm_1));
System.out.println(isContain(source2,subterm_1));
System.out.println(isContain(source1,subterm_2));
System.out.println(isContain(source2,subterm_2));
}
private static boolean isContain(String source, String subItem){
String pattern = "\\b"+subItem+"\\b";
Pattern p=Pattern.compile(pattern);
Matcher m=p.matcher(source);
return m.find();
}
}
Output:
true
false
false
true

I would suggest using word boundaries. If you compile a pattern like \bengines\b, your regular expression will only match on complete words.
Here is an explanation of word boundaries, as well as some examples.
http://www.regular-expressions.info/wordboundaries.html
Also, here is the java API for the pattern, which does include word boundaries
http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Here is an example using your requirements above
Pattern p = Pattern.compile("\\bengines\\b");
Matcher m = p.matcher("search engines");
System.out.println("matches: " + m.find());
p = Pattern.compile("\\bengine\\b");
m = p.matcher("search engines");
System.out.println("matches: " + m.find());
and here is the output:
matches: true
matches: false

If the words are always separated by spaces, this is one way to go:
String string = "search engines";
String[] parts = string.split(" ");
for(int i = 0; i < parts.length; i++) {
if(parts[i].equals("engine")) {
//do whatever you want
}

I want the subterm to exactly match one of the words in term
Then you can't use contains(). You could split the term into words and check equality (with or without case sensitivity).
boolean hasTerm = false;
for (String word : term.split("\\s+") {
if (word.equals("engine")) {
hasTerm = true;
break;
}
}

Use indexOf instead and then check whether char at the poistion
index + length of string plus +1 == ` ` or EOS
or I am sure there is a regex way as well.

Since the contains method verify if does exist that array of char in the string, it will aways return true, you will have to use Regex to make this validation.
If the words are aways separed by space it is easier, you can use the \s regex to get it.
Here is a good tutorial: http://www.vogella.com/tutorials/JavaRegularExpressions/article.html

One approach could be to split the string by spaces, convert it to a list, and then use the contains method to check for exact matches, like so:
String[] results = term.split("\\s+");
Boolean matchFound = Arrays.asList(results).contains(subterm_1);
Demo

Related

Java regex to validate the beginning of a list of strings

If I have a list of strings say:
private List<String> domains;
How do I validate the strings and create a boolean to confirm they all start with a symbol that is not a word or number?
Result of string should be something like:
#yahoo.com
)cloud.com
and not:
yahoo.com
cloud.com
I have
Pattern p = Pattern.compile("/[^\\w._\\s]/g");
Matcher m = p.matcher(values.getList().get(0)); //right now it's only checking first element I'm not sure how to check them all
boolean b = m.matches();
This doesn't seem to be working.
When you're using Java 8 (or higher), you can make use of the following:
Pattern p = Pattern.compile("^[^a-zA-Z0-9].*");
boolean all = domains.stream().allMatch(st -> p.matcher(st).matches());
all then contains a boolean that checks if all the domains match the regex.
The regex matches everything that doesn't start with a lowercase, uppercase character or a numeric character.
You can use matcher.lookingAt with a simple regex \\W that means a non-word character (that includes non letters and non digits) like this:
Pattern p = Pattern.compile("\\W");
boolean status = true;
for (String str : domains) {
Matcher m = p.matcher(str);
if ((status = m.lookingAt()) == false) {
break;
}
}
System.out.println( status );
lookingAt attempts to match the input sequence, starting at the beginning of the region, against the pattern.
btw you are mixing Javascript regex syntax in Java. There is no /..../g in Java.

How to extract a number from a string in a particular format?

I have a String like this as shown below. From below string I need to extract number 123 and it can be at any position as shown below but there will be only one number in a string and it will always be in the same format _number_
text_data_123
text_data_123_abc_count
text_data_123_abc_pqr_count
text_tery_qwer_data_123
text_tery_qwer_data_123_count
text_tery_qwer_data_123_abc_pqr_count
Below is the code:
String value = "text_data_123_abc_count";
// this below code will not work as index 2 is not a number in some of the above example
int textId = Integer.parseInt(value.split("_")[2]);
What is the best way to do this?
With a little guava magic:
String value = "text_data_123_abc_count";
Integer id = Ints.tryParse(CharMatcher.inRange('0', '9').retainFrom(value)
see also CharMatcher doc
\\d+
this regex with find should do it for you.
Use Positive lookahead assertion.
Matcher m = Pattern.compile("(?<=_)\\d+(?=_)").matcher(s);
while(m.find())
{
System.out.println(m.group());
}
You can use replaceAll to remove all non-digits to leave only one number (since you say there will be only 1 number in the input string):
String s = "text_data_123_abc_count".replaceAll("[^0-9]", "");
See IDEONE demo
Instead of [^0-9] you can use \D (which also means non-digit):
String s = "text_data_123_abc_count".replaceAll("\\D", "");
Given current requirements and restrictions, the replaceAll solution seems the most convenient (no need to use Matcher directly).
u can get all parts from that string and compare with its UPPERCASE, if it is equal then u can parse it to a number and save:
public class Main {
public static void main(String[] args) {
String txt = "text_tery_qwer_data_123_abc_pqr_count";
String[] words = txt.split("_");
int num = 0;
for (String t : words) {
if(t == t.toUpperCase())
num = Integer.parseInt(t);
}
System.out.println(num);
}
}

Java String- How to get a part of package name in android?

Its basically about getting string value between two characters. SO has many questions related to this. Like:
How to get a part of a string in java?
How to get a string between two characters?
Extract string between two strings in java
and more.
But I felt it quiet confusing while dealing with multiple dots in the string and getting the value between certain two dots.
I have got the package name as :
au.com.newline.myact
I need to get the value between "com." and the next "dot(.)". In this case "newline". I tried
Pattern pattern = Pattern.compile("com.(.*).");
Matcher matcher = pattern.matcher(beforeTask);
while (matcher.find()) {
int ct = matcher.group();
I tried using substrings and IndexOf also. But couldn't get the intended answer. Because the package name in android varies by different number of dots and characters, I cannot use fixed index. Please suggest any idea.
As you probably know (based on .* part in your regex) dot . is special character in regular expressions representing any character (except line separators). So to actually make dot represent only dot you need to escape it. To do so you can place \ before it, or place it inside character class [.].
Also to get only part from parenthesis (.*) you need to select it with proper group index which in your case is 1.
So try with
String beforeTask = "au.com.newline.myact";
Pattern pattern = Pattern.compile("com[.](.*)[.]");
Matcher matcher = pattern.matcher(beforeTask);
while (matcher.find()) {
String ct = matcher.group(1);//remember that regex finds Strings, not int
System.out.println(ct);
}
Output: newline
If you want to get only one element before next . then you need to change greedy behaviour of * quantifier in .* to reluctant by adding ? after it like
Pattern pattern = Pattern.compile("com[.](.*?)[.]");
// ^
Another approach is instead of .* accepting only non-dot characters. They can be represented by negated character class: [^.]*
Pattern pattern = Pattern.compile("com[.]([^.]*)[.]");
If you don't want to use regex you can simply use indexOf method to locate positions of com. and next . after it. Then you can simply substring what you want.
String beforeTask = "au.com.newline.myact.modelact";
int start = beforeTask.indexOf("com.") + 4; // +4 since we also want to skip 'com.' part
int end = beforeTask.indexOf(".", start); //find next `.` after start index
String resutl = beforeTask.substring(start, end);
System.out.println(resutl);
You can use reflections to get the name of any class. For example:
If I have a class Runner in com.some.package and I can run
Runner.class.toString() // string is "com.some.package.Runner"
to get the full name of the class which happens to have a package name inside.
TO get something after 'com' you can use Runner.class.toString().split(".") and then iterate over the returned array with boolean flag
All you have to do is split the strings by "." and then iterate through them until you find one that equals "com". The next string in the array will be what you want.
So your code would look something like:
String[] parts = packageName.split("\\.");
int i = 0;
for(String part : parts) {
if(part.equals("com")
break;
}
++i;
}
String result = parts[i+1];
private String getStringAfterComDot(String packageName) {
String strArr[] = packageName.split("\\.");
for(int i=0; i<strArr.length; i++){
if(strArr[i].equals("com"))
return strArr[i+1];
}
return "";
}
I have done heaps of projects before dealing with websites scraping and I
just have to create my own function/utils to get the job done. Regex might
be an overkill sometimes if you just want to extract a substring from
a given string like the one you have. Below is the function I normally
use to do this kind of task.
private String GetValueFromText(String sText, String sBefore, String sAfter)
{
String sRetValue = "";
int nPos = sText.indexOf(sBefore);
if ( nPos > -1 )
{
int nLast = sText.indexOf(sAfter,nPos+sBefore.length()+1);
if ( nLast > -1)
{
sRetValue = sText.substring(nPos+sBefore.length(),nLast);
}
}
return sRetValue;
}
To use it just do the following:
String sValue = GetValueFromText("au.com.newline.myact", ".com.", ".");

How to check if only chosen characters are in a string?

What's the best and easiest way to check if a string only contains the following characters:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_
I want like an example like this pseudo-code:
//If String contains other characters
else
//if string contains only those letters
Please and thanks :)
if (string.matches("^[a-zA-Z0-9_]+$")) {
// contains only listed chars
} else {
// contains other chars
}
For that particular class of String use the regular expression "\w+".
Pattern p = Pattern.compile("\\w+");
Matcher m = Pattern.matcher(str);
if(m.matches()) {}
else {};
Note that I use the Pattern object to compile the regex once so that it never has to be compiled again which may be nice if you are doing this check in a-lot or in a loop. As per the java docs...
If a pattern is to be used multiple
times, compiling it once and reusing
it will be more efficient than
invoking this method each time.
My turn:
static final Pattern bad = Pattern.compile("\\W|^$");
//...
if (bad.matcher(suspect).find()) {
// String contains other characters
} else {
// string contains only those letters
}
Above searches for single not matching or empty string.
And according to JavaDoc for Pattern:
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]

Regular Expression problem in Java

I am trying to create a regular expression for the replaceAll method in Java. The test string is abXYabcXYZ and the pattern is abc. I want to replace any symbol except the pattern with +. For example the string abXYabcXYZ and pattern [^(abc)] should return ++++abc+++, but in my case it returns ab++abc+++.
public static String plusOut(String str, String pattern) {
pattern= "[^("+pattern+")]" + "".toLowerCase();
return str.toLowerCase().replaceAll(pattern, "+");
}
public static void main(String[] args) {
String text = "abXYabcXYZ";
String pattern = "abc";
System.out.println(plusOut(text, pattern));
}
When I try to replace the pattern with + there is no problem - abXYabcXYZ with pattern (abc) returns abxy+xyz. Pattern (^(abc)) returns the string without replacement.
Is there any other way to write NOT(regex) or group symbols as a word?
What you are trying to achieve is pretty tough with regular expressions, since there is no way to express “replace strings not matching a pattern”. You will have to use a “positive” pattern, telling what to match instead of what not to match.
Furthermore, you want to replace every character with a replacement character, so you have to make sure that your pattern matches exactly one character. Otherwise, you will replace whole strings with a single character, returning a shorter string.
For your toy example, you can use negative lookaheads and lookbehinds to achieve the task, but this may be more difficult for real-world examples with longer or more complex strings, since you will have to consider each character of your string separately, along with its context.
Here is the pattern for “not ‘abc’”:
[^abc]|a(?!bc)|(?<!a)b|b(?!c)|(?<!ab)c
It consists of five sub-patterns, connected with “or” (|), each matching exactly one character:
[^abc] matches every character except a, b or c
a(?!bc) matches a if it is not followed by bc
(?<!a)b matches b if it is not preceded with a
b(?!c) matches b if it is not followed by c
(?<!ab)c matches c if it is not preceded with ab
The idea is to match every character that is not in your target word abc, plus every word character that, according to the context, is not part of your word. The context can be examined using negative lookaheads (?!...) and lookbehinds (?<!...).
You can imagine that this technique will fail once you have a target word containing one character more than once, like example. It is pretty hard to express “match e if it is not followed by x and not preceded by l”.
Especially for dynamic patterns, it is by far easier to do a positive search and then replace every character that did not match in a second pass, as others have suggested here.
[^ ... ] will match one character that is not any of ...
So your pattern "[^(abc)]" is saying "match one character that is not a, b, c or the left or right bracket"; and indeed that is what happens in your test.
It is hard to say "replace all characters that are not part of the string 'abc'" in a single trivial regular expression. What you might do instead to achieve what you want could be some nasty thing like
while the input string still contains "abc"
find the next occurrence of "abc"
append to the output a string containing as many "+"s as there are characters before the "abc"
append "abc" to the output string
skip, in the input string, to a position just after the "abc" found
append to the output a string containing as many "+"s as there are characters left in the input
or possibly if the input alphabet is restricted you could use regular expressions to do something like
replace all occurrences of "abc" with a single character that does not occur anywhere in the existing string
replace all other characters with "+"
replace all occurrences of the target character with "abc"
which will be more readable but may not perform as well
Negating regexps is usually troublesome. I think you might want to use negative lookahead. Something like this might work:
String pattern = "(?<!ab).(?!abc)";
I didn't test it, so it may not really work for degenerate cases. And the performance might be horrible too. It is probably better to use a multistep algorithm.
Edit: No I think this won't work for every case. You will probably spend more time debugging a regexp like this than doing it algorithmically with some extra code.
Try to solve it without regular expressions:
String out = "";
int i;
for(i=0; i<text.length() - pattern.length() + 1; ) {
if (text.substring(i, i + pattern.length()).equals(pattern)) {
out += pattern;
i += pattern.length();
}
else {
out += "+";
i++;
}
}
for(; i<text.length(); i++) {
out += "+";
}
Rather than a single replaceAll, you could always try something like:
#Test
public void testString() {
final String in = "abXYabcXYabcHIH";
final String expected = "xxxxabcxxabcxxx";
String result = replaceUnwanted(in);
assertEquals(expected, result);
}
private String replaceUnwanted(final String in) {
final Pattern p = Pattern.compile("(.*?)(abc)([^a]*)");
final Matcher m = p.matcher(in);
final StringBuilder out = new StringBuilder();
while (m.find()) {
out.append(m.group(1).replaceAll(".", "x"));
out.append(m.group(2));
out.append(m.group(3).replaceAll(".", "x"));
}
return out.toString();
}
Instead of using replaceAll(...), I'd go for a Pattern/Matcher approach:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static String plusOut(String str, String pattern) {
StringBuilder builder = new StringBuilder();
String regex = String.format("((?:(?!%s).)++)|%s", pattern, pattern);
Matcher m = Pattern.compile(regex).matcher(str.toLowerCase());
while(m.find()) {
builder.append(m.group(1) == null ? pattern : m.group().replaceAll(".", "+"));
}
return builder.toString();
}
public static void main(String[] args) {
String text = "abXYabcXYZ";
String pattern = "abc";
System.out.println(plusOut(text, pattern));
}
}
Note that you'll need to use Pattern.quote(...) if your String pattern contains regex meta-characters.
Edit: I didn't see a Pattern/Matcher approach was already suggested by toolkit (although slightly different)...

Categories