Regex pattern for matching words like c++ in a text - java

I have a text which can have words like c++, c, .net, asp.net in any format.
Sample Text:
Hello, java is what I want. Hmm .net should be fine too. C, C++ are also need. So, get me C,C++,Java,asp.net skills.
I already have c,c++,java,.net,asp.net stored somewhere.
All I need is to pick the occurrences of all these words in the text.
The pattern I was using to match was (?i)\\b(" +Pattern.quote(key)+ ")\\b which doesn't match things like c++ and .net. So I tried escaping the literals using (?i)\\b(" +forRegex(key)+ ")\\b (method link here), and I got the same result.
The expected output is that it should match(case insensitive):
C++ : 2
C : 2
java: 2
asp.net : 1
.net : 1

Set<String> keywords; // add your keywords in this set;
String text="Hello, java is what I want. Hmm .net should be fine too. C, C++ are also need. So, get me C,C++,Java,asp.net skills.";
text=text.replaceAll("[, ; ]"," ");
String[] textArray=text.split(" ");
for(String s : keywords){
int count=0;
for(int i=0;i<textArray.length();i++){
if(textArray[i].equals(s)){
count++
}
}
System.out.println(s + " : " + count);
}
This works most of the time. (if you want better result change the regular expression on replaceAll method.)

I would choose a non-regex solution to your problem. Just put the keywords into an array, and search for each occurance in the input string. It uses String.indexOf(String, int) to iterate through the string without creating any new objects (beyond the index and counter).
public class SearchWordCountNonRegex {
public static final void main(String[] ignored) {
//Keywords and input searched for with lowercase, so the keyword "java"
//matches "Java", "java", and "JAVA".
String[] searchWords = {"c++", "c", "java", "asp.net", ".net"};
String input = "Hello, java is what I want. Hmm .net should be fine too. C, C++ are also need. So, get me C,C++,Java,asp.net skills.".
toLowerCase();
for(int i = 0; i < searchWords.length; i++) {
String searchWord = searchWords[i];
System.out.print(searchWord + ": ");
int foundCount = 0;
int currIdx = 0;
while(currIdx != -1) {
currIdx = input.indexOf(searchWord, currIdx);
if(currIdx != -1) {
foundCount++;
currIdx += searchWord.length();
} else {
currIdx = -1;
}
}
System.out.println(foundCount);
}
}
}
Output:
c++: 2
c: 4
java: 2
asp.net: 1
.net: 2
If you are really wanting a regex solution, you could try something like the following, which uses a case insensitive pattern to match each keyword.
The problem is that the number of occurrences must be kept track of separately. This could be done, for example, by adding each found keyword to a map, where the key is the keyword, and the value is its current count. In addition, once a match is found, the search continues from that point, which implies that any potential overlapping matches are hidden (such as when Asp.NET is found, that particular .NET match will never be found)--this may or may not be a desired behavior.
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class SearchWordsRegexNoCounts {
public static final void main(String[] ignored) {
Matcher keywordMtchr = Pattern.compile("(C\\+\\+|C|Java|Asp\\.NET|\\.NET)",
Pattern.CASE_INSENSITIVE).matcher("");
String input = "Hello, java is what I want. Hmm .net should be fine too. C, C++ are also need. So, get me C,C++,Java,asp.net skills.";
keywordMtchr.reset(input);
while(keywordMtchr.find()) {
System.out.println("Keyword found at index " + keywordMtchr.start() + ": " + keywordMtchr.group(1));
}
}
}
Output:
Keyword found at index 7: java
Keyword found at index 32: .net
Keyword found at index 57: C
Keyword found at index 60: C++
Keyword found at index 90: C
Keyword found at index 92: C++
Keyword found at index 96: Java
Keyword found at index 101: asp.net

Using regex I've come up with the following solution. Although it can potentially find undesired matches as described in the code comments:
// "\\" is first because we don't want to escape any escape characters we will
// be adding ourselves
private static final String[] regexSpecial = {"\\", "(", ")", "[", "]", "{",
"}", ".", "+", "*", "?", "^", "$", "|"};
private static final String regexEscape = "\\";
private static final String[] regexEscapedSpecial;
static {
regexEscapedSpecial = new String[regexSpecial.length];
for (int i = 0; i < regexSpecial.length; i++) {
regexEscapedSpecial[i] = regexEscape + regexSpecial[i];
}
}
public static void main(String[] args) throws Throwable {
Set<String> searchWords = new HashSet<String>(Arrays.asList("c++", "c",
".net", "asp.net", "java"));
String text = "Hello, java is what I want. Hmm .net should be fine too. C, C++ are also need. So, get me\nC,C++,Java,asp.net skills.";
System.out.println(numOccurrences(text, searchWords, false));
}
/**
* Counts the number of occurrences of the given words in the given text. This
* allows the given "words" to contain non-word characters. Note that it is
* possible for unexpected matches to occur. For example if one of the words
* to match is "c" then while none of the "c"s in "coconut" will be matched,
* the "c" in "c-section" will even if only matches of "c" as in the "c
* programming language" were intended.
*/
public static Map<String, Integer> numOccurrences(String text,
Set<String> searchWords, boolean caseSensitive) {
Map<String, String> lowerCaseToSearchWords = new HashMap<String, String>();
List<String> searchWordsInOrder = sortByNonInclusion(searchWords);
StringBuilder regex = new StringBuilder("(?<!\\w)(");
boolean started = false;
for (String searchWord : searchWordsInOrder) {
lowerCaseToSearchWords.put(searchWord.toLowerCase(), searchWord);
if (started) {
regex.append("|");
} else {
started = true;
}
regex.append(escapeRegex(searchWord));
}
regex.append(")(?!\\w)");
Pattern pattern = null;
if (caseSensitive) {
pattern = Pattern.compile(regex.toString());
} else {
pattern = Pattern.compile(regex.toString(), Pattern.CASE_INSENSITIVE);
}
Matcher matcher = pattern.matcher(text);
Map<String, Integer> matches = new HashMap<String, Integer>();
while (matcher.find()) {
String match = lowerCaseToSearchWords.get(matcher.group(1).toLowerCase());
Integer oldVal = matches.get(match);
if (oldVal == null) {
oldVal = 0;
}
matches.put(match, oldVal + 1);
}
return matches;
}
/**
* Sorts the given collection of words in such a way that if A is a prefix of
* B, then it is guaranteed that A will appear after B in the sorted list.
*/
public static List<String> sortByNonInclusion(Collection<String> toSort) {
List<String> sorted = new ArrayList<String>(new HashSet<String>(toSort));
// sorting in reverse alphabetical order will ensure that if A is a prefix
// of B it will appear later in the list than B
Collections.sort(sorted, new Comparator<String>() {
#Override
public int compare(String o1, String o2) {
return o2.compareTo(o1);
}
});
return sorted;
}
/**
* Escape all regex special characters in the given text.
*/
public static String escapeRegex(String toEscape) {
for (int i = 0; i < regexSpecial.length; i++) {
toEscape = toEscape.replace(regexSpecial[i], regexEscapedSpecial[i]);
}
return toEscape;
}
The printed result is
{asp.net=1, c=2, c++=2, java=2, .net=1}

Related

Attributed Value of Ternary Operator Resets After Each Loop

I would like to be able to go through an inputted string and count the amount of times "good" is written and compare it to how many times "bad" is written. If the good and the bad match, then goodVbad==0 and it returns true. Otherwise it returns false.
The code worked fine when I was using if statements inside the for-loop, but when using the ternary operator it doesn't. While debugging, I realized that each time the for-loop moves onto the next element 'goodVbad' becomes zero again. Kind of stumped, would love some advice. Thanks!
public static boolean goodbadClean(String word) {
String [] wordS;
int goodVbad=0;
String good="good";
String bad="bad";
word=word.toLowerCase();
word=word.replaceAll(good, " good ");
word=word.replaceAll(bad, " bad ");
wordS=word.split(" ");
for(String i:wordS) {
goodVbad=i.equals(good)?goodVbad++
:i.equals(bad) ?goodVbad--
:goodVbad;
}
if(goodVbad==0) {
return true;
}
return false;
}
The problem is the postfix ++ operator returns the old value, which you are assigning the variable, then increments. ie
goodVBad = goodVBad++; // returns the old value, so does nothing
so you should use the prefixed ++ operator:
goodVBad = ++goodVBad; // increments first, returning the new value
But both of these are hard to read and brittle.
If you must use ternaries, change your code to:
goodVbad += i.equals(good) ? 1 : (i.equals(bad) ? -1 : 0);
However, nested ternaries are generally a style smell. I recommend instead:
if (i.equals(good)) {
goodVBad++;
} else if (i.equals(bad)) {
goodVBad--;
}
Assuming that the OP needs to count the frequency of a string pattern in a given string, then you could do something like this with Java 8 or older:
public class CountMatches {
public static void main(String[] args) {
String phrase1 = "goodbadbadgoodgoodbad"; // equal amount of good vs bad.
String phrase2 = "goodbadbadgoodgoodbadbad"; // more bad than good.
String phrase3 = "goodbadbadgoodgoodbadgood"; // more good than bad.
// create capturing groups for "good" and "bad"
String GOOD_REGEX = "(good)";
String BAD_REGEX = "(bad)";
Pattern gPattern = Pattern.compile(GOOD_REGEX);
Pattern bPattern = Pattern.compile(BAD_REGEX);
Matcher countGood = gPattern.matcher(phrase1);
Matcher countBad = bPattern.matcher(phrase1);
int count = 0;
while (countBad.find()) {
count--;
}
while (countGood.find()) {
count++;
}
System.out.println(count == 0);
}
}
With Java 9 or later:
public class CountMatches {
public static void main(String[] args) {
String phrase1 = "goodbadbadgoodgoodbad"; // equal amount of good vs bad.
String phrase2 = "goodbadbadgoodgoodbadbad"; // more bad than good.
String phrase3 = "goodbadbadgoodgoodbadgood"; // more good than bad.
String GOOD_REGEX = "(good)";
String BAD_REGEX = "(bad)";
Pattern gPattern = Pattern.compile(GOOD_REGEX);
Pattern bPattern = Pattern.compile(BAD_REGEX);
Matcher countGood = gPattern.matcher(phrase1);
Matcher countBad = bPattern.matcher(phrase1);
long cCount = countGood.results().count();
long bCount = countBad.results().count();
System.out.println(cCount - bCount == 0);
}
}
My assumption is based on this line of code word=word.replaceAll(good, " good ");. This tells me that the expected input is something similar to the phrase variables I used for my testing.
By the way, this solution should work even if the words "good" and/or "bad" are preceded or followed by spaces.
UPDATE: Integrated looping to evaluate all expressions against all phrases.
public static void main(String[] args) {
List<String> expressions = List.of("(good)", "(bad)");
List<String> phrases = List.of("goodbadbadgoodgoodbad", "goodbadbadgoodgoodbadbad", "goodbadbadgoodgoodbadgood", " good bad bad good good bad ");
for (String phrase : phrases) {
List<Long> itemCount = new ArrayList<>();
for (String regex : expressions) {
Pattern gPattern = Pattern.compile(regex);
Matcher matcher = gPattern.matcher(phrase);
long count = matcher.results().count();
System.out.println("Pattern \"" + regex + "\" appears " + count + (count == 1 ? " time" : " times"));
itemCount.add(count);
}
long count = itemCount.stream().reduce((value1, value2) -> value1 - value2).get();
System.out.println(count == 0);
}
}
This outputs:
Pattern "(good)" appears 3 times
Pattern "(bad)" appears 3 times
true
Pattern "(good)" appears 3 times
Pattern "(bad)" appears 4 times
false
Pattern "(good)" appears 4 times
Pattern "(bad)" appears 3 times
false
Pattern "(good)" appears 3 times
Pattern "(bad)" appears 3 times
true

How I can use InCombiningDiacriticalMarks ignoring one case

I'm writing code for remove all diacritics for one String.
For example: áÁéÉíÍóÓúÚäÄëËïÏöÖüÜñÑ
I'm using the property InCombiningDiacriticalMarks of Unicode. But I want to ignore the replace for ñ and Ñ.
Now I'm saving these two characters before replace with:
s = s.replace('ñ', '\001');
s = s.replace('Ñ', '\002');
It's possible to use InCombiningDiacriticalMarks ignoring the diacritic of ñ and Ñ.
This is my code:
public static String stripAccents(String s)
{
/*Save ñ*/
s = s.replace('ñ', '\001');
s = s.replace('Ñ', '\002');
s = Normalizer.normalize(s, Normalizer.Form.NFD);
s = s.replaceAll("[\\p{InCombiningDiacriticalMarks}]", "");
/*Add ñ to s*/
s = s.replace('\001', 'ñ');
s = s.replace('\002', 'Ñ');
return s;
}
It works fine, but I want know if it's possible optimize this code.
It depends what you mean by "optimize". It's tough to reduce the number of lines of code from what you have written, but since you are processing the string six times there's scope to improve performance by processing the input string only once, character by character:
public class App {
// See SO answer https://stackoverflow.com/a/10831704/2985643 by virgo47
private static final String tab00c0
= "AAAAAAACEEEEIIII"
+ "DNOOOOO\u00d7\u00d8UUUUYI\u00df"
+ "aaaaaaaceeeeiiii"
+ "\u00f0nooooo\u00f7\u00f8uuuuy\u00fey"
+ "AaAaAaCcCcCcCcDd"
+ "DdEeEeEeEeEeGgGg"
+ "GgGgHhHhIiIiIiIi"
+ "IiJjJjKkkLlLlLlL"
+ "lLlNnNnNnnNnOoOo"
+ "OoOoRrRrRrSsSsSs"
+ "SsTtTtTtUuUuUuUu"
+ "UuUuWwYyYZzZzZzF";
public static void main(String[] args) {
var input = "AaBbCcáÁéÉíÍóÓúÚäÄëËïÏöÖüÜñÑçÇ";
var output = removeDiacritic(input);
System.out.println("input = " + input);
System.out.println("output = " + output);
}
public static String removeDiacritic(String input) {
var output = new StringBuilder(input.length());
for (var c : input.toCharArray()) {
if (isModifiable(c)) {
c = tab00c0.charAt(c - '\u00c0');
}
output.append(c);
}
return output.toString();
}
// Returns true if the supplied char is a candidate for diacritic removal.
static boolean isModifiable(char c) {
boolean modifiable;
if (c < '\u00c0' || c > '\u017f') {
modifiable = false;
} else {
modifiable = switch (c) {
case 'ñ', 'Ñ' ->
false;
default ->
true;
};
}
return modifiable;
}
}
This is the output from running the code:
input = AaBbCcáÁéÉíÍóÓúÚäÄëËïÏöÖüÜñÑçÇ
output = AaBbCcaAeEiIoOuUaAeEiIoOuUñÑcC
Characters without diacritics in the input string are not modified. Otherwise the diacritic is removed (e.g. Çto C), except in the cases of ñ and Ñ.
Notes:
The code does not use the Normalizer class or InCombiningDiacriticalMarks at all. Instead it processes each character in the input string only once, removing its accent if appropriate. The conventional approach for removing diacritics (as used in the OP) does not support selective removal as far as I know.
The code is based on an answer by user virgo47, but enhanced to support the selective removal of accents. See virgo47's answer for details of mapping an accented character to its unaccented counterpart.
This solution only works for Latin-1/Latin-2, but could be enhanced to support other mappings.
Your solution is very short and easy to understand, but it feels brittle, and for large input I suspect that it would be significantly slower than an approach that only processed each character once.
Ave Maria Purisima,
You can create a pattern excluding the tilde from the diacritical marks set:
private static final Pattern STRIP_ACCENTS_PATTERN = Pattern.compile("[\\p{InCombiningDiacriticalMarks}&&[^\u0303]]+");
public static String stripAccents(String input) {
if (input == null) {
return null;
}
final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD));
return STRIP_ACCENTS_PATTERN.matcher(decomposed).replaceAll(EMPTY);
}
Hope it helps

How to check if an Array contains a particular word in a String and get it?

I have a String[] and an input String:
String[] ArrayEx = new String[1];
String textInput = "a whole bunch of words"
What I want to do is check if the String contains a word present in the Array, like this.
Ex: textInput = "for example" and ArrayEx[0] = "example"
I know about this method:
Arrays.asList(yourArray).contains(yourValue)
but it checks the full String right? How do I check if the String contains a particular word present in the Array. Even if it is from an ArrayList I have no problem.
Also if yes, can I get that word from the String[]? i.e., in the above case get the String "example".
EDIT:
public void searchNearestPlace(String v2txt)
{
Log.e("TAG", "Started");
v2txt = v2txt.toLowerCase();
String[] places = {"accounting, airport, amusement_park, aquarium, art_gallery, atm, bakery, bank, bar, beauty_salon, bicycle_store, book_store, bowling_alley, bus_station, cafe, campground, car_dealer, car_rental, car_repair, car_wash, casino, cemetery, church, city_hall, clothing_store, convenience_store, courthouse, dentist, department_store, doctor, electrician, electronics_store, embassy, establishment, finance, fire_station, florist, food, funeral_home, furniture_store, gas_station, general_contractor, grocery_or_supermarket, gym, hair_care, hardware_store, health, hindu_temple, home_goods_store, hospital, insurance_agency, jewelry_store, laundry, lawyer, library, liquor_store, local_government_office, locksmith, lodging, meal_delivery, meal_takeaway, mosque, movie_rental, movie_theater, moving_company, museum, night_club, painter, park, parking, pet_store, pharmacy, physiotherapist, place_of_worship, plumber, police, post_office, real_estate_agency, restaurant, roofing_contractor, rv_park, school, shoe_store, shopping_mall, spa, stadium, storage, store, subway_station, synagogue, taxi_stand, train_station, travel_agency, university, veterinary_care, zoo"};
int index;
for(int i = 0; i<= places.length - 1; i++)
{
Log.e("TAG","for");
if(v2txt.contains(places[i]))
{
Log.e("TAG", "sensed?!");
index = i;
}
}
Say v2txt was "awesome airport" the sensed Log never does appear even though all other logs indicate it working
Edit2:
I am so embarrassed that I made such a dunder head mistake. My array is declared wrongly. There should be a " before every ,. I am such a big idiot!
Sorry will change it and let you know.
First of all it has nothing to do with android
Second the solution
boolean flag = false;
String textInput = "for example";
int index = 0;
String[] yourArray = {"ak", "example"};
for (int i = 0; i <= yourArray.length - 1; i++) {
if (textInput.contains(yourArray[i])) {
flag = true;
index = i;
}
}
if (flag)
System.out.println("found at index " + index);
else
System.out.println("not found ");
DEMO
EDIT :
Change your array to
String[] places = {"accounting", "airport", "amusement_park" };
and so on with other values with your array declaration it has one index.
you can split your string and get array of words
txArray = textInput.split(" ");
then for each element in txArray check if
Arrays.asList(ArrayEx).contains(txArray[i])
txArray = "Hello I'm your String";
String[] splitStr = txArray.split(" ");
int i=0;
while(splitStr[i]){
if(Arrays.asList(ArrayEx).contains(txArray[i])){
System.out.println("FOUND");
}
i++;
}
You can use Java - Regular Expressions.
A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. They can be used to search, edit, or manipulate text and data.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Testing {
public static void main(String[] args) {
String textInput = "for example";
String[] arrayEx = new String[1];
arrayEx[0] = "example";
Pattern p = Pattern.compile(arrayEx[0]);
Matcher m = p.matcher(textInput);
boolean matchedFoundStatus = false;
while (m.find()) {
matchedFoundStatus = true;
}
System.out.println("matchedFoundStatus:" + matchedFoundStatus);
}
}
Try this;
Sting text2check = "Your Name":
for(int t = 0; t < array.length; t++)
{
if (text2check.equals(array[t])
// Process it Here
break;
}
"How do I check if the String contains a particular word present in the Array?" is the same thing as Is there an element in the array, for which the input string contains this element
Java 8
String[] words = { "example", "hello world" };
String input = "a whole bunch of words";
Arrays.stream(words).anyMatch(input::contains);
(The matching words can also be extracted, if needed:)
Arrays.stream(words)
.filter(input::contains)
.toArray();
If you are stuck with Java 7, you will have to re-implement "anyMatch" and "filter" yourself:
Java 7
boolean anyMatch(String[] words, String input) {
for(String s : words)
if(input.contains(s))
return true;
return false;
}
List<String> filter(String[] words, String input) {
List<String> matches = new ArrayList<>();
for(String s : words)
if(input.contains(s))
matches.add(s);
return matches;
}
This will take an String array, and search through all the strings looking for a specific char sequence found in a string. Also, native Android apps are programmed in the Java language. You might find it beneficial to read up more on Strings.
String [] stringArray = new String[5];
//populate your array
String inputText = "abc";
for(int i = 0; i < stringArray.length; i++){
if(inputText.contains(stringArray[i]){
//Do something
}
}

How do I split/parse this String properly using Regex

I am inexperienced with regex and rusty with JAVA, so some help here would be appreciated.
So I have a String in the form:
statement|digit|statement
statement|digit|statement
etc.
where statement can be any combination of characters, digits, and spaces.
I want to parse this string such that I save the first and last statements of each line in a separate string array.
for example if I had a string:
cats|1|short hair and long hair
cats|2|black, blue
dogs|1|cats are better than dogs
I want to be able to parse the string into two arrays.
Array one = [cats], [cats], [dogs]
Array two = [short hair and long hair],[black, blue],[cats are better than dogs]
Matcher m = Pattern.compile("(\\.+)|\\d+|=(\\.+)").matcher(str);
while(m.find()) {
String key = m.group(1);
String value = m.group(2);
System.out.printf("key=%s, value=%s\n", key, value);
}
I would have continued to add the keys and values into seperate arrays had my output been right but no luck. Any help with this would be very much appreciated.
Here is a solution with RegEx:
public class ParseString {
public static void main(String[] args) {
String data = "cats|1|short hair and long hair\n"+
"cats|2|black, blue\n"+
"dogs|1|cats are better than dogs";
List<String> result1 = new ArrayList<>();
List<String> result2 = new ArrayList<>();
Pattern pattern = Pattern.compile("(.+)\\|\\d+\\|(.+)");
Matcher m = pattern.matcher(data);
while (m.find()) {
String key = m.group(1);
String value = m.group(2);
result1.add(key);
result2.add(value);
System.out.printf("key=%s, value=%s\n", key, value);
}
}
}
Here is a great site to help with regex http://txt2re.com/ expressions. Enter some example text in step one. Select the parts you are interested in part 2. And select a language in step 3. Then copy, paste and massage the code that it spits out.
Double split should work:
class ParseString
{
public static void main(String[] args)
{
String s = "cats|1|short hair and long hair\ncats|2|black, blue\ndogs|1|cats are better than dogs";
String[] sa1 = s.split("\n");
for (int i = 0; i < sa1.length; i++)
{
String[] sa2 = sa1[i].split("\\|");
System.out.printf("key=%s, value=%s\n", sa2[0], sa2[2]);
} // end for i
} // end main
} // end class ParseString
Output:
key=cats, value=short hair and long hair
key=cats, value=black, blue
key=dogs, value=cats are better than dogs
There is no need for a complex regex pattern, you could simple split the string by the demiliter token using the string's split method (String#split()) on Java.
Working Example
public class StackOverFlow31840211 {
private static final int SENTENCE1_TOKEN_INDEX = 0;
private static final int DIGIT_TOKEN_INDEX = SENTENCE1_TOKEN_INDEX + 1;
private static final int SENTENCE2_TOKEN_INDEX = DIGIT_TOKEN_INDEX + 1;
public static void main(String[] args) {
String[] text = {
"cats|1|short hair and long hair",
"cats|2|black, blue",
"dogs|1|cats are better than dogs"
};
ArrayList<String> arrayOne = new ArrayList<String>();
ArrayList<String> arrayTwo = new ArrayList<String>();
for (String s : text) {
String[] tokens = s.split("\\|");
int tokenType = 0;
for (String token : tokens) {
switch (tokenType) {
case SENTENCE1_TOKEN_INDEX:
arrayOne.add(token);
break;
case SENTENCE2_TOKEN_INDEX:
arrayTwo.add(token);
break;
}
++tokenType;
}
}
System.out.println("Sentences for first token: " + arrayOne);
System.out.println("Sentences for third token: " + arrayTwo);
}
}
I agree with the other answers that you should use split, but I am providing an answer that uses Pattern.split, since it uses a regex.
import java.util.*;
import java.lang.*;
import java.io.*;
import java.util.regex.Pattern;
/* Name of the class has to be "Main" only if the class is public. */
class MatchExample
{
public static void main (String[] args) {
String[] data = {
"cats|1|short hair and long hair",
"cats|2|black, blue",
"dogs|1|cats are better than dogs"
};
Pattern p = Pattern.compile("\\|\\d+\\|");
for(String line: data){
String[] elements = p.split(line);
System.out.println(elements[0] + " // " + elements[1]);
}
}
}
Notice that the pattern will match on one or more digits between two |'s. I see what you are doing with the groupings.
The main problem is that you need to escape | and not the .. Also what is the = doing in your regex? I generalized the regex a little bit but you can replace .* by \\d+ to have the same as you.
Matcher m = Pattern.compile("^(.+?)\\|.*\\|(.+)$", Pattern.MULTILINE).matcher(str);
Here is the strict version:"^([^|]+)\\|\\d+\\|([^|]+)$" (also with MULTILINE)
And it's indeed easier using split (on the lines) as some have said, but like this:
String[] parts = str.split("\\|\\d+\\|");
If parts.length is not two then you know it is not a legal line.
If your input is always formatted like that, then you can just do with this single statement to get the left part in the even indexes and the right part in the odd indexes (0: line1-left, 1: line1-right, 2: line2-left, 3: line2-right, 4: line3-left ...), so you will get an array twice the size of line count.
String[] parts = str.split("\\|\\d+\\||\\n+");

Replace substring with a regex combination

Since I'm not that familiar with java, I don't know if there's a library somewhere that can do this thing. If not, does anybody have any ideas how can this be accomplished?
For instance I have a string "foo" and I want to change the letter f with "f" and "a" so that the function returns a list of strings with values "foo" and "aoo".
How to deal with it when there's more of the same letters? "ffoo" into "ffoo", "afoo", "faoo", "aaoo".
A better explanation:
(("a",("a","b)),("c",("c","d")))
Above is a group of characters that need to be replaced with a character from the other element. "a" is to be replaced with "a" and with "b". "c" is to be replaced with "c" and "d".
If I have a string "ac", the resulting combinations I need are:
"ac"
"bc"
"ad"
"bd"
If the string is "IaJaKc", the resulting combinations are:
"IaJaKc"
"IbJaKc"
"IaJbKc"
"IbJbKc"
"IaJaKd"
"IbJaKd"
"IaJbKd"
"IbJbKd"
The number of combinations can be calculated like this:
(replacements_of_a^letter_amount_a)*(replacements_of_c^letter_amount_c)
first case: 2^1*2^1 = 4
second case: 2^2*2^1 = 8
If, say, the group is (("a",("a","b)),("c",("c","d","e"))), and the string is "aac", the number of combinations is:
2^2*3^1 = 12
Here is the code for your example with foo and aoo
public List<String> doSmthTricky (String str) {
return Arrays.asList("foo".replaceAll("(^.)(.*)", "$1$2 a$2").split(" "));
}
For the input "foo" this method returns a list with 2 strings "foo" and "aoo".
It works only if there is no whitespaces in your input string ("foo" in your example). Otherwise it's a bit more complicated.
How to deal with it when there's more of the same letters? "ffoo" into "ffoo", "afoo", "faoo", "aaoo".
I doubt that regular expressions could help here, you want to generate strings based on initial string, it's not a task for regexp.
UPD: I've created a recursive function (actually it's half-recursive half-iterative) which generates strings based on the template string by replacing its first characters with characters from a specified set:
public static List<String> generatePermutations (String template, String chars, int depth, List<String> result) {
if (depth <= 0) {
result.add (template);
return result;
}
for (int i = 0; i < chars.length(); i++) {
String newTemplate = template.substring(0, depth - 1) + chars.charAt(i) + template.substring(depth);
generatePermutations(newTemplate, chars, depth - 1, result);
}
generatePermutations(template, chars, depth - 1, result);
return result;
}
Parameter #depth means how many characters from the beginning of string should be replaced. Number of permutations (chars.size() + 1) ^ depth.
Tests:
System.out.println(generatePermutations("ffoo", "a", 2, new LinkedList<String>()));
Output: [aaoo, faoo, afoo, ffoo]
--
System.out.println(generatePermutations("ffoo", "ab", 3, new LinkedList<String>()));
Output: [aaao, baao, faao, abao, bbao, fbao, afao, bfao, ffao, aabo, babo, fabo, abbo, bbbo, fbbo, afbo, bfbo, ffbo, aaoo, baoo, faoo, aboo, bboo, fboo, afoo, bfoo, ffoo]
I'm not sure what you need. Please specify source and the result you expect. Anyway, you should use standard java classes for that purpose: java.util.regex.Pattern, java.util.regex.Matcher. If you need to deal with the repeating letters in the beginning, then there is two ways, use symbol "^" - which means beginning of the line, or for the same purpose you can use "\w" shortcut, which means beginning of the word. In more sophisticated cases, please take a look at "lookbehind" expressions. There are more than complete descriptions of these techniques you can find in java doc for java.util.regex and if it's not enough look at www.regular-expressions.info good luck.
Here it is:
public static void returnVariants(String input){
List<String> output = new ArrayList<String>();
StringBuffer word = new StringBuffer(input);
output.add(input);
String letters = "ac";
int lettersLength = letters.length();
int wordLength = word.length();
String replacement = "";
for (int i = 0; i < lettersLength; i++) {
for (int j = 0; j < wordLength; j++) {
if(word.charAt(j)==letters.charAt(i)){
if (word.charAt(j)=='a'){
replacement = "ab";
}else if (word.charAt(j)=='c'){
replacement = "cd";
}
List<String> tempList = new ArrayList<String>();
for (int k = 0; k < replacement.length(); k++) {
for (String variant : output){
StringBuffer tempBuffer = new StringBuffer(variant);
String combination = tempBuffer.replace(j, j+1, replacement.substring(k, k+1)).toString();
tempList.add(combination);
}
}
output.addAll(tempList);
if (j==0){
output.remove(0);
}
}
}
}
Set<String> uniqueCombinations = new HashSet(output);
System.out.println(uniqueCombinations);
}
If input is "ac", the combinations returned are "ac", "bc", "ad", "bd". If it can be optimized further, any additional help is welcome and appreciated.

Categories