I want to match 2 strings
e.g. I have pre-defined words like wheat, egg, flour etc...
I got the text from OCR like wh3at, agg, f1Our etc...
So wh3at should match wheat OR f1Our should match flour etc..
I have worked on OCR projects where we "normalized" extracted text. You can build regular expressions that match reasonably expected/observed output.
import java.util.regex.Pattern;
public class Regex {
public static void main(String[] args) {
String[] strings = {"wh3at", "f1Our", "f10ur", "agg"};
for (String s : strings)
System.out.println(String.format("%s -> %s", s, normalizeWord(s)));
}
public static String normalizeWord(String unnormalized) {
if (Pattern.compile("(?i)wh(e|3)at").matcher(unnormalized).matches()) {
return "wheat";
} else if (Pattern.compile("(?i)f(1|L)(O|0)ur").matcher(unnormalized).matches()) {
return "flour";
} else if (Pattern.compile("(?i)(a|e)gg").matcher(unnormalized).matches()) {
return "egg";
}
return unnormalized;
}
}
Related
Firstly, I'm aware of similar questions that have been asked such as here:
How to split a string, but also keep the delimiters?
However, I'm having issue implementing a split of a string using Pattern.split() where the pattern is based on a list of delimiters, but where they can sometimes appear to overlap. Here is the example:
The goal is to split a string based on a set of known codewords which are surrounded by slashes, where I need to keep both the delimiter (codeword) itself and the value after it (which may be empty string).
For this example, the codewords are:
/ABC/
/DEF/
/GHI/
Based on the thread referenced above, the pattern is built as follows using look-ahead and look-behind to tokenise the string into codewords AND values:
((?<=/ABC/)|(?=/ABC/))|((?<=/DEF/)|(?=/DEF/))|((?<=/GHI/)|(?=/GHI/))
Working string:
"123/ABC//DEF/456/GHI/789"
Using split, this tokenises nicely to:
"123","/ABC/","/DEF/","456","/GHI/","789"
Problem string (note single slash between "ABC" and "DEF"):
"123/ABC/DEF/456/GHI/789"
Here the expectation is that "DEF/456" is the value after "/ABC/" codeword because the "DEF/" bit is not actually a codeword, but just happens to look like one!
Desired outcome is:
"123","/ABC/","DEF/456","/GHI/","789"
Actual outcome is:
"123","/ABC","/","DEF/","456","/GHI/","789"
As you can see, the slash between "ABC" and "DEF" is getting isolated as a token itself.
I've tried solutions as per the other thread using only look-ahead OR look-behind, but they all seem to suffer from the same issue. Any help appreciated!
If you are OK with find rather than split, using some non-greedy matches, try this:
public class SampleJava {
static final String[] CODEWORDS = {
"ABC",
"DEF",
"GHI"};
static public void main(String[] args) {
String input = "/ABC/DEF/456/GHI/789";
String codewords = Arrays.stream(CODEWORDS)
.collect(Collectors.joining("|", "/(", ")/"));
// codewords = "/(ABC|DEF|GHI)/";
Pattern p = Pattern.compile(
/* codewords */ ("(DELIM)"
/* pre-delim */ + "|(.+?(?=DELIM))"
/* final bit */ + "|(.+?$)").replace("DELIM", codewords));
Matcher m = p.matcher(input);
while(m.find()) {
System.out.print(m.group(0));
if(m.group(1) != null) {
System.out.print(" ← code word");
}
System.out.println();
}
}
}
Output:
/ABC/ ← code word
DEF/456
/GHI/ ← code word
789
Use a combination of positive and negative look arounds:
String[] parts = s.split("(?<=/(ABC|DEF|GHI)/)(?<!/(ABC|DEF|GHI)/....)|(?=/(ABC|DEF|GHI)/)(?<!/(ABC|DEF|GHI))");
There's also a considerable simplification by using alternations inside single look ahead/behind.
See live demo.
Following some TDD principles (Red-Green-Refactor), here is how I would implement such behaviour:
Write specs (Red)
I defined a set of unit tests that explain how I understood your "tokenization process". If any test is not correct according to what you expect, feel free to tell me and I'll edit my answer accordingly.
import static org.assertj.core.api.Assertions.assertThat;
import java.util.List;
import org.junit.Test;
public class TokenizerSpec {
Tokenizer tokenizer = new Tokenizer("/ABC/", "/DEF/", "/GHI/");
#Test
public void itShouldTokenizeTwoConsecutiveCodewords() {
String input = "123/ABC//DEF/456";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123", "/ABC/", "/DEF/", "456");
}
#Test
public void itShouldTokenizeMisleadingCodeword() {
String input = "123/ABC/DEF/456/GHI/789";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123", "/ABC/", "DEF/456", "/GHI/", "789");
}
#Test
public void itShouldTokenizeWhenValueContainsSlash() {
String input = "1/23/ABC/456";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("1/23", "/ABC/", "456");
}
#Test
public void itShouldTokenizeWithoutCodewords() {
String input = "123/456/789";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123/456/789");
}
#Test
public void itShouldTokenizeWhenEndingWithCodeword() {
String input = "123/ABC/";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("123", "/ABC/");
}
#Test
public void itShouldTokenizeWhenStartingWithCodeword() {
String input = "/ABC/123";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("/ABC/", "123");
}
#Test
public void itShouldTokenizeWhenOnlyCodeword() {
String input = "/ABC//DEF//GHI/";
List<String> tokens = tokenizer.splitPreservingCodewords(input);
assertThat(tokens).containsExactly("/ABC/", "/DEF/", "/GHI/");
}
}
Implement according to the specs (Green)
This class make all the tests above pass
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Optional;
public final class Tokenizer {
private final List<String> codewords;
public Tokenizer(String... codewords) {
this.codewords = Arrays.asList(codewords);
}
public List<String> splitPreservingCodewords(String input) {
List<String> tokens = new ArrayList<>();
int lastIndex = 0;
int i = 0;
while (i < input.length()) {
final int idx = i;
Optional<String> codeword = codewords.stream()
.filter(cw -> input.substring(idx).indexOf(cw) == 0)
.findFirst();
if (codeword.isPresent()) {
if (i > lastIndex) {
tokens.add(input.substring(lastIndex, i));
}
tokens.add(codeword.get());
i += codeword.get().length();
lastIndex = i;
} else {
i++;
}
}
if (i > lastIndex) {
tokens.add(input.substring(lastIndex, i));
}
return tokens;
}
}
Improve implementation (Refactor)
Not done at the moment (not enough time that I can spend on that answer now). I'll do some refactor on Tokenizer with pleasure if you request me to (but later). :-) Or you can do it yourself quite securely since you have the unit tests to avoid regressions.
i need help to substring a string when a a substring occurs.
Example
Initial string: 123456789abcdefgh
string to substr: abcd
result : 123456789
I checked substr method but it accept index position value.I need to search the occurrence of the substring and than pass the index?
If you want to split the String from the last number (a), then the code would look like this:
you can change the "a" to any char within the string
package nl.testing.startingpoint;
public class Main {
public static void main(String args[]) {
String[] part = getSplitArray("123456789abcdefgh", "a");
System.out.println(part[0]);
System.out.println(part[1]);
}
public static String[] getSplitArray(String toSplitString, String spltiChar) {
return toSplitString.split("(?<=" + spltiChar + ")");
}
}
Bear in mind that toSplitString.split("(?<=" + spltiChar + ")"); splits from the first occurrence of that character.
Hope this might help:
public static void main(final String[] args)
{
searchString("123456789abcdefghabcd", "abcd");
}
public static void searchString(String inputValue, final String searchValue)
{
while (!(inputValue.indexOf(searchValue) < 0))
{
System.out.println(inputValue.substring(0, inputValue.indexOf(searchValue)));
inputValue = inputValue.substring(inputValue.indexOf(searchValue) +
searchValue.length());
}
}
Output:
123456789
efgh
Use a regular expression, like this
static String regex = "[abcd[.*]]"
public String remove(String string, String regex) {
return string.contains(regex) ? string.replaceAll(regex) : string;
}
Is there a way to find of most precise regex for a string?
For e.g.
Lets say, I have 2 regex:
1) .*bourne
2) .*ne
If I try to match Melbourne with the above regex, it will match with both regex.
But more precise match will be the first regex. Similarly, there can be very complex regex.
Is there a way to find the most precise match?
Is there a way to find the most precise match?
The most "precise" match is the the one where the regex needs to process less data until it finds a match, in this case, .*bourne.
Wouldn't sorting the patterns in descending order of length solve the problem ?
For example, if Java is the language being used something like the following should be fine right (just sort the pattern in descending order of length and then return for first match)?
public class TestPattern {
public static void main(String args[]){
String text ="Melbourne";
System.out.println("Mtaching regex --> "+getMatchingRegex(text));
}
public static String getMatchingRegex(String text) {
ArrayList<String> patterns = new ArrayList<String>();
patterns.add(".*ne") ;
patterns.add(".*urne") ;
patterns.add(".*bourne") ;
patterns.add(".*rne") ;
Collections.sort(patterns, new StringComparator());
for(String pattern:patterns) {
if(Pattern.matches(pattern, text))
return pattern;
}
return "No Regex matched";
}
public static class StringComparator implements Comparator<String>
{
#Override
public int compare(String s1, String s2)
{
return s2.length()-s1.length();
}
}
}
I need to match if filenames have exactly 2 underscores and extension 'txt'.
For example:
asdf_assss_eee.txt -> true
asdf_assss_eee_txt -> false
asdf_assss_.txt -> false
private static final String FILENAME_PATTERN = "/^[A-Za-z0-9]+_[A-Za-z0-9]+_[A- Za-z0-9]\\.txt";
does not working.
You just need to add + after the third char class and you must remove the first forward slash.
private static final String FILENAME_PATTERN = "^[A-Za-z0-9]+_[A-Za-z0-9]+_[A-Za-z0-9]+\\.txt$";
You can use a regex like this with insensitive flag:
[a-z\d]+_[a-z\d]+_[a-z\d]+\.txt
Or with inline insensitive flag
(?i)[a-z\d]+_[a-z\d]+_[a-z\d]+\.txt
Working demo
In case you want to shorten it a little, you could do:
([a-z\d]+_){2}[a-z\d]+\.txt
Update
So lets assume you want to at least one or more characters after the second underscore, before the file extension.
Regex is still not "needed" for this. You could split the String by the underscore and you should have 3 elements from the split. If the 3rd element is just ".txt" then it's not valid.
Example:
public static void main(String[] args) throws Exception {
String[] data = new String[] {
"asdf_assss_eee.txt",
"asdf_assss_eee_txt",
"asdf_assss_.txt"
};
for (String d : data) {
System.out.println(validate(d));
}
}
public static boolean validate(String str) {
if (!str.endsWith(".txt")) {
return false;
}
String[] pieces = str.split("_");
return pieces.length == 3 && !pieces[2].equalsIgnoreCase(".txt");
}
Results:
true
false
false
Old Answer
Not sure I understand why your third example is false, but this is something that can easily be done without regex.
Start with checking to see if the String ends with ".txt", then check if it contains only two underscores.
Example:
public static void main(String[] args) throws Exception {
String[] data = new String[] {
"asdf_assss_eee.txt",
"asdf_assss_eee_txt",
"asdf_assss_.txt"
};
for (String d : data) {
System.out.println(validate(d));
}
}
public static boolean validate(String str) {
if (!str.endsWith(".txt")) {
return false;
}
return str.chars().filter(c -> c == '_').count() == 2;
}
Results:
true
false
true
Use this Pattern:
Pattern p = Pattern.compile("_[^_]+_[^_]+\\.txt")
and use .find() instead of .match() in the Matcher:
Matcher m = p.matcher(filename);
if (m.find()) {
// found
}
I have a line from which multiple keywords are to be matched. The whole keywords should be matched.
Example,
String str = "This is an example text for matching countries like Australia India England";
if(str.contains("Australia") ||
str.contains("India") ||
str.contains("England")){
System.out.println("Matches");
}else{
System.out.println("Does not match");
}
This code works fine. But if there are too many keywords to be matched, the line grows. Is there any elegant way of writing the same code?
Thanks
Your can write a regular expression like this:
Country0|Country1|Country2
Use it like this:
String str = "This is an example text like Australia India England";
if (Pattern.compile("Australia|India|England").matcher(str).find())
System.out.println("Matches");
If you would like to know which countries has matched:
public static void main(String[] args) {
String str = "This is an example text like Australia India England";
Matcher m = Pattern.compile("Australia|India|England").matcher(str);
while (m.find())
System.out.println("Matches: " + m.group());
}
Outputs:
Matches: Australia
Matches: India
Matches: England
Put countries to array and use small helper method. Using Set makes it even nicer, but building set of countries is bit more tedious. Something like following, but with better naming and null handling if wished:
String[] countries = {"Australia", "India", "England"};
String str = "NAustraliaA";
if (containsAny(str, countries)) {
System.out.println("Matches");
}
else {
System.out.println("Does not match");
}
public static boolean containsAny(String toCheck, String[] values) {
for (String s: values) {
if (toCheck.contains(s)) {
return true;
}
}
return false;
}
From readability point of view, an ArrayList of strings to be matched will be elegant. A loop can be formed to check if the word is available else it will set a flag to indicate that a keyword was missing
Something like, in case all are to be matched
for (String checkStr : myList) {
if(!str.contains(checkStr)) {
flag=false;
break;
}
}
in case any should match
for (String checkStr : myList) {
if(str.contains(checkStr)) {
flag=true;
break;
}
}
package com.test;
public class Program {
private String str;
public Program() {
str = "This is an example text for matching countries like Australia India England";
// TODO Auto-generated constructor stub
}
public static void main(String[] args) {
Program program = new Program();
program.doWork();
}
private void doWork() {
String[] tomatch = { "Australia", "India" ,"UK"};
for(int i=0;i<tomatch.length;i++){
if (match(tomatch[i])) {
System.out.println(tomatch[i]+" Matches");
} else {
System.out.println(tomatch[i]+" Does not match");
}
}
}
private boolean match(String string) {
if (str.contains(string)) {
return true;
}
return false;
}
}
//-----------------
output
Australia Matches
India Matches
UK Does not match