Conditions:
there are many rules ,maybe hundreds, which are like :
{aab*, aabc*,
aabcdd*, dtctddds*,
*ddt*,
*cddt*,
*bcddt*,
*t,
*ttt,
*ccddttt}
each time I will get one string, then I should find the longest matched rule.
Examples:
example 1.string is aabcddttt the matched rule should be: aabcdd*
example 2. string is accddttt the matched rule should be *ccddttt
Question:
I don't want to use the rules in a long array to match the string one by one,that is inefficient method.maybe I should use the string as a regex to match the hundred rules.But yet I can't find a elegant way to solve this problem.
Can I use some regexes to get the result?
Which is the best/fastest way to match?
Java, plain C or shell are preferred,please don't use C++ STL
Longest common substring
Perhaps this algorithm is what you are looking for =).
Why not do it simply?
String[] rules = {"^aab", "bcd", "aabcdd$", "dtctddds$", "^ddt$", "^cddt$", "^bcddt$", "^t", "^ttt", "^ccddttt"};
String testCase = "aabcddttt";
for (int i = 0; i < rules.length; i++) {
Pattern p = Pattern.compile(rules[i]);
Matcher m = p.matcher(testCase);
if (m.find()) {
System.out.println("String: " + testCase + " has matched the pattern " + rules[i]);
}
}
So basically in this case, rules[0], which is ^aab found because carrot (^) means string must begin with ^aab. On the other hand, bba$ means string must end with bba. And rules1 is found because it means the rule can appear anywhere from the testCase (e.g. bcd).
You could try matching them all at once with a brackets around each sub-rule. You could use the group to determine which matched.
public static void main(String... ignored) {
for (String test : "aabaa,wwwaabcdddd,abcddtxyz".split(",")) {
System.out.println(test + " matches " + longestMatch(test, "aab*", "aabc*", "aabcdd*", "dtctddds*", "ddt"));
}
}
public static String longestMatch(String text, String... regex) {
String[] sortedRegex = regex.clone();
Arrays.sort(sortedRegex, new Comparator<String>() {
#Override
public int compare(String o1, String o2) {
return o2.length() - o1.length();
}
});
StringBuilder sb = new StringBuilder();
String sep = "(";
for (String s : sortedRegex) {
sb.append(sep).append('(').append(s).append(')');
sep = "|";
}
sb.append(")");
Matcher matcher = Pattern.compile(sb.toString()).matcher(text);
if (matcher.find()) {
for (int i = 2; i <= matcher.groupCount(); i++) {
String group = matcher.group(i);
if (group != null)
return sortedRegex[i - 2];
}
}
return "";
}
prints
aabaa matches aabc*
wwwaabcdddd matches aabcdd*
abcddtxyz matches ddt
Related
I want to find the occurrences of (a) specific character(s) but this String to search can't be between quotes:
Example:
"this is \"my\" example string"
If you look for the char 'm', then it should only return the index of 'm' from "example" as the other ' is between double quotes.
Another example:
"th\"i\"s \"is\" \"my\" example string"
I'm expecting something like:
public List<Integer> getOccurrenceStartIndexesThatAreNotBetweenQuotes(String snippet,String stringToFind);
One "naive" way would be to:
get all the start indexes of stringToFind in snippet
get all the indexes of the quotes in snippet
Depending of the start index of stringToFind, because you have the positions of the quotes, you can know if you are between quotes or not.
Is there a better way to do this?
EDIT:
What do I want to retrieve? The indexes of the matches.
Few things:
There can be many quoted content in the string to search in: "th\"i\"s \"is\" \"my\" example string"
In the string : "th\"i\"s \"is\" \"my\" example string", "i", "is" and "my" are between quotes.
It's not limited to letters and digits, we can have ';:()_-=+[]{} etc...
Here's one solution:
Algorithm:
Find all the "Dead Zone" regions within the String (e.g. regions that are off limits because they are within quotes)
Find all the regions where the String contains the search string in question (hitZones in the code).
Retain only the regions in the hitZones that are not contained in any deadZones. I will leave this part to you :)
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class FindStrings
{
// Just a simple model class for regions
static class Pair
{
int s = 0;
int e = 0;
public Pair (int s, int e)
{
this.s = s;
this.e = e;
}
public String toString ()
{
return "[" + s + ", " + e + "]";
}
}
public static void main(String[] args)
{
String search = "other";
String str = "this is \"my\" example other string. And \"my other\" this is my str in no quotes.";
Pattern p = Pattern.compile("\"([^\"]*)\"");
Matcher m = p.matcher(str);
List<Pair> deadZones = new ArrayList<Pair>();
while (m.find())
{
int s = m.start();
int e = m.end();
deadZones.add(new Pair(s, e - 1));
}
List<Pair> hitZones = new ArrayList<Pair>();
p = Pattern.compile(search);
m = p.matcher(str);
while (m.find())
{
int s = m.start();
int e = m.end();
hitZones.add(new Pair(s, e - 1));
}
System.out.println(deadZones);
System.out.println(hitZones);
}
}
Note: The s component of all Pairs in the hitZones, that are not within deadZones, will ultimately be what you want.
As Mamun suggested, you can remove all the quotes and strings between quotes and then search. The following is a regex solution (though I agree with Tim that it's probably not a job for a regex engine).
String snippetQuoteRemoved = snippet.replaceAll("(?:\")(\\w+)(?:\")","");
// Now simply search in snippetQuoteRemoved
NOTE: This will look for \w+ i.e. ([a-zA-Z0-9_]). Change it to whatever is suitable for your use case.
EDIT
I checked if it'd remove everything and that's not the case. Check here.
Also, for those extra special chars, just change the regex to (?:")([a-zA-Z0-9_';:()_\-=+\[\]\{\}]+)(?:").
Another solution:
get all the start indexes of stringToFind in snippet
get all the indexes of the quotes in snippet
Depending of the start index of stringToFind, because you have the positions of the quotes, you can know if you are between quotes or not.
public List<Integer> getOccurrenceIndexesNotInQuotes(String snippet,String patternToFind) {
List<Integer> allIndexes = getStartPositions(snippet,patternToFind);
List<Integer> allQuoteIndexes = getStartPositions(snippet,"\"");
TreeSet<Integer> allQuoteIndexesTree = new TreeSet<>(allQuoteIndexes);
List<Integer> finalIndexes = new ArrayList<>();
for (Integer index : allIndexes){
Integer quoteIndexValue = allQuoteIndexesTree.floor(index);
int quoteIndex = allQuoteIndexes.indexOf(quoteIndexValue);
if (quoteIndexValue == null || !isBetweenQuote(quoteIndex)){
finalIndexes.add(index);
}
}
return finalIndexes;
}
private List<Integer> getStartPositions(String stringToProcess, String regex) {
List<Integer> out = new ArrayList<>();
Matcher matcher = Pattern.compile(regex).matcher(stringToProcess);
while(matcher.find()) {
out.add(matcher.start());
}
return out;
}
private boolean isBetweenQuote(Integer indexInQuoteList){
return indexInQuoteList % 2 != 1;
}
I'm writing a sound change program in Java which should replace patterns of tokens with a replacement string only if they match the pattern string.
The pattern string can contain literal strings and/or variables using %TokenName. Such a variable references a Token class containing a List of Strings containing the possible token values. An optional anchor to specifiy the location of the pattern (^ and $ like in regex ) can preceed or succeed the pattern. All whitespace is deleted while processing the replace.
The following example should only match when first a ShortVowel token occurs, followed by a VoicelessStop, and the the string should end:
%ShortVowel %VoicelessStop $
with the folloing tokens:
ShortVowel: ɑ ɛ ɪ jɪ ɔ ə
VoicelessStop: k p t
I want the replacer to return an array of a ReplacerMatch class containing the List of Strings with the matched tokens per variable, and the start and end positions of the total match in the string to be processed. For every match in the string, such a class exists in the array.
This means the string dɛt should return
[
matches: [ɛ, t]
startPosition: 1
endPosition: 3
]
and the string drɛkɔp should return
[
matches: [ɔ, p]
startPosition: 4
endPosition: 6
]
since only matches at the end of the string are matched. The string dɛ should return an empty array.
The ReplacerMatch class is defined as follows:
public class ReplacerMatch
{
private List<String> matches;
private int startPosition;
private int endPosition;
[...]
}
Such a replace rule is defined in a Replacer class:
public class Replacer
{
enum Anchor
{
NONE,
STRING_START,
STRING_END;
public static Anchor fromString(String string)
{
if (string.startsWith("^"))
return STRING_START;
else if (string.endsWith("$"))
return STRING_END;
else
return NONE;
}
}
private String pattern;
private String replacement;
private List<Token> tokens;
[...]
}
The Token class contains the name of that token and a list of String with possible token values. These values can be of variable length.
public class Token
{
private final String name;
private final List<String> tokens;
[...]
}
So far I've written the code in the Replacer class to split the pattern string into a list of Tokens and extract the Anchor.
public ReplacerMatch[] matches(String string)
{
String pat = this.pattern;
// Get anchor
Anchor anchor = Anchor.fromString(pat);
if (anchor == Anchor.STRING_START)
pat = pat.substring(1);
else if (anchor == Anchor.STRING_END)
pat = pat.substring(0,pat.length() - 1);
// Parse variables
List<Token> vars = new ArrayList<>();
Pattern varPattern = Pattern.compile("%(\\w+)");
Matcher varMatcher = varPattern.matcher(pat);
while (varMatcher.find())
{
for (Token t : this.tokens)
{
if (t.getName().equals(varMatcher.group(1)))
{
vars.add(t);
pat = pat.replace(varMatcher.group(),"%");
varMatcher.reset(pat);
break;
}
}
// Error handling on non-existing token
}
return new ReplacerMatch[0];
}
Now I'm stuck on the matching of the variables, which seems to be quite hard or impossible with regex. Does anybody have an idea how to approach this problem?
Using your Token class, you can convert its tokens field into a Java Pattern object by
StringBuilder sb = new StringBuilder("[").append(tokens.get(0));
for (int i = 1; i < tokens.size(); i++){
sb.append('|').append(tokens.get(i));
}
sb.append(']');
return sb.toString();
If you replace in the user supplied pattern every instance of a Token name with the Token pattern like
String javaPattern = userPattern.replaceAll("\\s+", "");
for (Token t : tokens){
javaPattern = javaPattern.replaceAll('%'+t.getName(), t.toPatternString());
}
return Pattern.compile(javaPattern);
then you get a Pattern that matches as your user expects, and you just have to extract the matched parts.
Matcher matcher = pattern.matcher(userInput);
if (matcher.matches()){
// this gives you the limits
matcher.start();
matcher.end();
// this is the matched bit
String matchedString = matcher.group();
// now you've got to follow your %some %some $ pattern to separate the parts of the matchedString. You have to parse your pattern into parts and for each of those part find the part in matchedString that matches
}
Just for the completeness, here are the finished matches() and replace() methods of the Replacer class:
public List<ReplacerMatch> matches(String string)
{
String regex = this.pattern;
List<ReplacerMatch> matches = new ArrayList<>();
for (Token t : this.tokens)
regex = regex.replaceAll('%' + t.getName(),t.toPattern());
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(string);
while (m.find())
{
string = string.substring(0,m.start()) + Util.filledString('%',m.group().length()) + string.substring(m.end());
List<String> local = new ArrayList<>();
for (int i = 0; i < m.groupCount(); i ++)
local.add(m.group(i + 1));
matches.add(new ReplacerMatch(local,m.start(),m.end()));
}
return matches;
}
public String replace(String string)
{
List<ReplacerMatch> matches = this.matches(string);
if (matches.isEmpty())
return string;
int increase = 0;
for (ReplacerMatch m : matches)
{
String replaced = this.replacement;
for (int i = 0; i < m.getMatches().size(); i ++)
{
String match = m.getMatches().get(i);
String pattern = "%" + (i + 1);
replaced = replaced.replace(pattern,match);
}
string = string.substring(0,m.getStartPosition() + increase) + replaced + string.substring(m.getEndPosition() + increase);
increase += (replaced.length() - m.getMatchesAsString().length());
}
return string;
}
For example, if I had (-> means return):
aBc123afa5 -> aBc
168dgFF9g -> 168
1GGGGG -> 1
How can I do this in Java? I assume it's something regex related but I'm not great with regex and so not too sure how to implement it (I could with some thought but I have a feeling it would be 5-10 lines long, and I think this could be done in a one-liner).
Thanks
String myString = "aBc123afa5";
String extracted = myString.replaceAll("^([A-Za-z]+|\\d+).*$", "$1");
View the regex demo and the live code demonstration!
To use Matcher.group() and reuse a Pattern for efficiency:
// Class
private static final Pattern pattern = Pattern.compile("^([A-Za-z]+|\\d+).*$");
// Your method
{
String myString = "aBc123afa5";
Matcher matcher = pattern.matcher(myString);
if(matcher.matches())
System.out.println(matcher.group(1));
}
Note: /^([A-Za-z]+|\d+).*$ and /^([A-Za-z]+|\d+)/ both works in similar efficiency. On regex101 you can compare the matcher debug logs to find out this.
Without using regex, you can do this:
String string = "168dgFF9g";
String chunk = "" + string.charAt(0);
boolean searchDigit = Character.isDigit(string.charAt(0));
for (int i = 1; i < string.length(); i++) {
boolean isDigit = Character.isDigit(string.charAt(i));
if (isDigit == searchDigit) {
chunk += string.charAt(i);
} else {
break;
}
}
System.out.println(chunk);
public static String prefix(String s) {
return s.replaceFirst("^(\\d+|\\pL+|).*$", "$1");
}
where
\\d = digit
\\pL = letter
postfix + = one or more
| = or
^ = begin of string
$ = end of string
$1 = first group `( ... )`
An empty alternative (last |) ensures that (...) is always matched, and always a replace happens. Otherwise the original string would be returned.
I am trying to censor specific strings, and patterns within my application but my matcher doesn't seem to be finding any results when searching for the Pattern.
public String censorString(String s) {
System.out.println("Censoring... "+ s);
if (findPatterns(s)) {
System.out.println("Found pattern");
for (String censor : foundPatterns) {
for (int i = 0; i < censor.length(); i++)
s.replace(censor.charAt(i), (char)42);
}
}
return s;
}
public boolean findPatterns(String s) {
for (String censor : censoredWords) {
Pattern p = Pattern.compile("(.*)["+censor+"](.*)");//regex
Matcher m = p.matcher(s);
while (m.find()) {
foundPatterns.add(censor);
return true;
}
}
return false;
}
At the moment I'm focusing on just the one pattern, if the censor is found in the string. I've tried many combinations and none of them seem to return "true".
"(.*)["+censor+"](.*)"
"(.*)["+censor+"]"
"["+censor+"]"
"["+censor+"]+"
Any help would be appreciated.
Usage: My censored words are "hello", "goodbye"
String s = "hello there, today is a fine day."
System.out.println(censorString(s));
is supposed to print " ***** today is a fine day. "
Your regex is right!!!!. The problem is here.
s.replace(censor.charAt(i), (char)42);
If you expect this line to rewrite the censored parts of your string it will not. Please check the java doc for string.
Please find below the program which will do what you intend to do. I removed your findpattern method and just used the replaceall with regex in String API. Hope this helps.
public class Regex_SO {
private String[] censoredWords = new String[]{"hello"};
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
Regex_SO regex_SO = new Regex_SO();
regex_SO.censorString("hello there, today is a fine day. hello again");
}
public String censorString(String s) {
System.out.println("Censoring... "+ s);
for(String censoredWord : censoredWords){
String replaceStr = "";
for(int index = 0; index < censoredWord.length();index++){
replaceStr = replaceStr + "*";
}
s = s.replaceAll(censoredWord, replaceStr);
}
System.out.println("Censored String is .. " + s);
return s;
}
}
Since this seem like homework I cant give you working code, but here are few pointers
consider using \\b(word1|word2|word3)\\b regex to find specific words
to create char representing * you can write it as '*'. Don't use (char)42 to avoid magic numbers
to create new string which will have same length as old string but will be filled with only specific characters you can use String newString = oldString.replaceAll(".","*")
to replace on-the-fly founded match with new value you can use appendReplacement and appendTail methods from Matcher class. Here is how code using it should look like
StringBuffer sb = new StringBuffer();//buffer for string with replaced values
Pattern p = Pattern.compile(yourRegex);
Matcher m = p.matcher(yourText);
while (m.find()){
String match = m.group(); //this will represent current match
String newValue = ...; //here you need to decide how to replace it
m.appentReplacemenet(sb, newValue );
}
m.appendTail(sb);
String censoredString = sb.toString();
I am trying to perform multiple string replacements using Java's Pattern and Matcher, where the regex pattern may include metacharacters (e.g. \b, (), etc.). For example, for the input string fit i am, I would like to apply the replacements:
\bi\b --> EYE
i --> I
I then followed the coding pattern from two questions (Java Replacing multiple different substring in a string at once, Replacing multiple substrings in Java when replacement text overlaps search text). In both, they create an or'ed search pattern (e.g foo|bar) and a Map of (pattern, replacement), and inside the matcher.find() loop, they look up and apply the replacement.
The problem I am having is that the matcher.group() function does not contain information on matching metacharacters, so I cannot distinguish between i and \bi\b. Please see the code below. What can I do to fix the problem?
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.*;
public class ReplacementExample
{
public static void main(String argv[])
{
Map<String, String> replacements = new HashMap<String, String>();
replacements.put("\\bi\\b", "EYE");
replacements.put("i", "I");
String input = "fit i am";
String result = doit(input, replacements);
System.out.printf("%s\n", result);
}
public static String doit(String input, Map<String, String> replacements)
{
String patternString = join(replacements.keySet(), "|");
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(input);
StringBuffer resultStringBuffer = new StringBuffer();
while (matcher.find())
{
System.out.printf("match found: %s at start: %d, end: %d\n",
matcher.group(), matcher.start(), matcher.end());
String matchedPattern = matcher.group();
String replaceWith = replacements.get(matchedPattern);
// Do the replacement here.
matcher.appendReplacement(resultStringBuffer, replaceWith);
}
matcher.appendTail(resultStringBuffer);
return resultStringBuffer.toString();
}
private static String join(Set<String> set, String delimiter)
{
StringBuilder sb = new StringBuilder();
int numElements = set.size();
int i = 0;
for (String s : set)
{
sb.append(Pattern.quote(s));
if (i++ < numElements-1) { sb.append(delimiter); }
}
return sb.toString();
}
}
This prints out:
match found: i at start: 1, end: 2
match found: i at start: 4, end: 5
fIt I am
Ideally, it should be fIt EYE am.
You mistyped one of your regexes:
replacements.put("\\bi\\", "EYE"); //Should be \\bi\\b
replacements.put("i", "I");
You may also want to make your regexes unique. There is no guarantee of order with map.getKeySet() so it may just be replacing i with I before checking \\bi\\b.
You could use capture groups, without straying too far from your existing design. So instead of using the matched pattern as the key, you look up based on the order within a List.
You would need to change the join method to put parantheses around each of the patterns, something like this:
private static String join(Set<String> set, String delimiter) {
StringBuilder sb = new StringBuilder();
sb.append("(");
int numElements = set.size();
int i = 0;
for (String s : set) {
sb.append(s);
if (i++ < numElements - 1) {
sb.append(")");
sb.append(delimiter);
sb.append("("); }
}
sb.append(")");
return sb.toString();
}
As a side note, the use of Pattern.quote in the original code listing would have caused the match to fail where those metacharacters were present.
Having done this, you would now need to determine which of the capture groups was responsible for the match. For simplicity I'm going to assume that none of the match patterns will themselves contain capture groups, in which case something like this would work, within the matcher while loop:
int index = -1;
for (int j=1;j<=replacements.size();j++){
if (matcher.group(j) != null) {
index = j;
break;
}
}
if (index >= 0) {
System.out.printf("Match on index %d = %s %d %d\n", index, matcher.group(index), matcher.start(index), matcher.end(index));
}
Next, we would like to use the resulting index value to index straight back into the replacements. The original code uses a HashMap, which is not suitable for this; you're going to have to refactor that to use a pair of Lists in some form, one containing the list of match patterns and the other the corresponding list of replacement strings. I won't do that here, but I hope that provides enough detail to create a working solution.