Java Regex complex ID expression filtering

Java Regex complex ID expression filtering - java

I am using Java to implement PDF to plain text conversion. Right now I am facing the problem of filtering out ID expressions from String representation of the text.
The idea here is to capture IDs as whole words of length only greater than 4 and remove them. IDs must comprise of both letters and numbers at the same time, in any order. They can have optional special symbols like :.- and are generally all uppercase except several cases when there might be one and (for now) exactly one lowercase letter in them. IDs can be encountered at any place in the sentence, and there are multiple sentences inside the String. I am also trying to capture the preceding space (if there is one) so there is no double space after I remove the ID. It is acceptable to split the expression into several pieces if it gets too complex.
I've created a small test snippet to show exactly what needs and doesn't need to be caught by the regular expression, as well as display my progress so far. I am using standard java.util.regex package for implementation.
String testString = "Remove this (ACTDIK002), ACTDIK002, (L1:3.CI), 9-12.CT.d.12, and 1A-CS-01 "
+ "but not (DLCS), 781-338-3000, (DTC), (200), K-12, K or 12. "
+ "Also not (), A.I., AI, A or a. . ...";
System.out.println(testString);
String regex = "[\\s]{0,1}[[A-Z]+[\\d]+[-:\\(\\)\\.]*]{4,}[a-z]{0,1}[\\d\\.]*";
//"[\\s]{0,1}[[A-Z]+[\\d]+[-:\\(\\)\\.]*]{4,}[[a-z]{0,1}[\\d\\.]+]*" //for comma removal
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(testString);
testString = matcher.replaceAll("*");
System.out.println(testString);
It may be necessary to remove IDs together with their commas, so it would be great if the revised expression was capable of capturing commas or omitting them via minor alterations like the alternative regex I've provided.
My current solution filters out everything that needs to be filtered but also most of the things it shouldn't. It appears the rule that there must be at least one capital letter and one digit in the word isn't working, possibly because I need to use Lookahead/Lookbehind/Grouping, sadly none of which I managed to get to work properly. I also suspect the use of [] is completely incorrect in my example, but this is the only way I managed to get it to (mostly) work for now. Please help me.

My colleague and I were able to solve this issue in an elegant way. Below is a snippet from my current solution. I hope one day this proves useful to someone.
String testString = "Remove this (ACTDIK002), ACTDIK002, (L1:3.CI), 9-12.CT.d.12, and 1A-CS-01 "
+ "but not (DLCS), 781-338-3000, (DTC), (200), K-12, K or 12. "
+ "Also not (), A.I., AI, A or a. . ...";
System.out.println(testString);
String regex = "(?i)(?=[\\dA-Z\\(\\)\\.:-]*\\d)(?=[\\dA-Z\\(\\)\\.:-]*[A-Z])[\\dA-Z\\(\\)\\.:-]{5,}";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(testString);
testString = matcher.replaceAll("");
System.out.println(testString);
//Clean-up extra spaces and unneeded commas
//testString = testString.replaceAll("\\s{2,}", " ").replaceAll("(\\s\\.)|(\\s\\,)", "");
testString = testString.replaceAll("[ ]{2,}", " ").replaceAll("([ ]\\.)|([ ]\\,)", "");
System.out.println(testString);

Related

How to remove the # in a string using Pattern in java

I need to remove a part of the string which starts with #.
My sample code works for one string and fails for another.
Failed one: Not able to remove #news4buffalo:
String regex = "\\#\\w+ || #\\w*";
String rawContent = "RT #news4buffalo: Police say a shooter fired into a crowd yesterday on the Oakmont overpass, striking and killing a 14-year-old. More: http…";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(rawContent);
if (matcher.find()) {
rawContent = rawContent.replaceAll(regex, "");
}
Success one:
String regex = "\\#\\w+ || #\\w*";
String rawContent = "#ZaslowShow couldn't agree more. Good crowd last night. #LetsGoFish";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(rawContent);
if (matcher.find()) {
rawContent = rawContent.replaceAll(regex, "");
}
Output:
couldn't agree more. Good crowd last night. #LetsGoFish

From your question it looks like this regex can work for you:
rawContent = rawContent.replaceAll("#\\S*", "");

You can try in this way as well.
String s = "#ZaslowShow couldn't agree more. Good crowd last night. #LetsGoFish";
System.out.println(s.replaceAll("#[^\\s]*\\s+", ""));
// Look till space is not found----^^^^ ^^^^---------remove extra spaces as well

The regex is only considering word characters whereas your input String contains a colon :. You can solve this by replacing \\w with \\S (any non-whitespace character) in your regex. Also there is no need for two patterns.
String regex = "#\\S*";

You don't need to escape # so don't add \ before it like "\\#" (it confuses people).
Don't use matcher to check if string contains part which should be replaced and than use replaceAll because you will have to iterate second time. Just use replaceAll at start, and if it doesn't have anything to replace, it will leave string unchanged. BTW. use replaceAll from Matcher instance to avoid recompiling Pattern.
Regex in form foo||bar doesn't seem right. Regex uses only one pipe | to represent OR so such regex represents foo OR emptyString OR bar. Since empty String is kind of special (every string contains empty string at start, and at end, and even in between characters) it can cause some problems like "foo".replaceAll("|foo", "x") returns xfxoxox, instead of for instance "xxx" because consumption of empty string before f prevented it from being used as potential first character of foo :/
Anyway it seems that you would like to accept any #xxxx words so consider maybe something like "#\\w+" if you want to make sure that there will be at least one character after #.
You can also add condition that # must be first character of word (in case you wouldn't want to remove part after # from e-mail addresses). To do this just use look-behind like (?<=\\s|^)# which will check that before # exist some whitespace, or it is placed at start of the string.
You can also remove space after word you wanted to remove (it there is any).
So you can try with
String regex = "(?<=\\s|^)#\\w*\\s?";
which for data like
RT #news4buffalo: Police say a shooter fired into a crowd yesterday on the Oakmont overpass, striking and killing a 14-year-old. More: http…
will return
RT : Police say a shooter fired into a crowd yesterday on the Oakmont overpass, striking and killing a 14-year-old. More: http…
But if you would also like to remove other characters beside alphabetic or numeric ones from \\w like : you can simply use \\S which represents non-whitespace-characters, so your regex can look like
String regex = "(?<=\\s|^)#\\S*\\s?";

Regex for splitting a german address into its parts

Good evening,
I'm trying to splitting the parts of a german address string into its parts via Java. Does anyone know a regex or a library to do this? To split it like the following:
Name der Straße 25a 88489 Teststadt
to
Name der Straße|25a|88489|Teststadt
or
Teststr. 3 88489 Beispielort (Großer Kreis)
to
Teststr.|3|88489|Beispielort (Großer Kreis)
It would be perfect if the system / regex would still work if parts like the zip code or the city are missing.
Is there any regex or library out there with which I could archive this?
EDIT: Rule for german addresses:
Street: Characters, numbers and spaces
House no: Number and any characters (or space) until a series of numbers (zip) (at least in these examples)
Zip: 5 digits
Place or City: The rest maybe also with spaces, commas or braces

I came across a similar problem and tweaked the solutions provided here a little bit and came to this solution which also works but (imo) is a little bit simpler to understand and to extend:
/^([a-zäöüß\s\d.,-]+?)\s*([\d\s]+(?:\s?[-|+/]\s?\d+)?\s*[a-z]?)?\s*(\d{5})\s*(.+)?$/i
Here are some example matches.
It can also handle missing street numbers and is easily extensible by adding special characters to the character classes.
[a-zäöüß\s\d,.-]+? # Street name (lazy)
[\d\s]+(?:\s?[-|+/]\s?\d+)?\s*[a-z]?)? # Street number (optional)
After that, there has to be the zip code, which is the only part that is absolutely necessary because it's the only constant part. Everything after the zipcode is considered as the city name.

I’d start from the back since, as far as I know, a city name cannot contain numbers (but it can contain spaces (first example I’ve found: “Weil der Stadt”). Then the five-digit number before that must be the zip code.
The number (possibly followed by a single letter) before that is the street number. Note that this can also be a range.
Anything before that is the street name.
Anyway, here we go:
^((?:\p{L}| |\d|\.|-)+?) (\d+(?: ?- ?\d+)? *[a-zA-Z]?) (\d{5}) ((?:\p{L}| |-)+)(?: *\(([^\)]+)\))?$
This correctly parses even arcane addresses such as “Straße des 17. Juni 23-25 a 12345 Berlin-Mitte”.
Note that this doesn’t work with address extensions (such as “Gartenhaus” or “c/o …”). I have no clue how to handle those. I rather doubt that there’s a viable regular expression to express all this.
As you can see, this is a quite complex regular expression with lots of capture groups. If I would use such an expression in code, I would use named captures (Java 7 supports them) and break the expression up into smaller morsels using the x flag. Unfortunately, Java doesn’t support this. This s*cks because it effectively renders complex regular expressions unusable.
Still, here’s a somewhat more legible regular expression:
^
(?<street>(?:\p{L}|\ |\d|\.|-)+?)\
(?<number>\d+(?:\ ?-\ ?\d+)?\ *[a-zA-Z]?)\
(?<zip>\d{5})\
(?<city>(?:\p{L}|\ |-)+)
(?:\ *\((?<suffix>[^\)]+)\))?
$
In Java 7, the closest we can achieve is this (untested; may contain typos):
String pattern =
"^" +
"(?<street>(?:\\p{L}| |\\d|\\.|-)+?) " +
"(?<number>\\d+(?: ?- ?\\d+)? *[a-zA-Z]?) " +
"(?<zip>\\d{5}) " +
"(?<city>(?:\\p{L}| |-)+)" +
"(?: *\\((?<suffix>[^\\)]+)\\))?" +
"$";

Here is my suggestion which could be fine-tuned further e.g. to allow missing parts.
Regex Pattern:
^([^0-9]+) ([0-9]+.*?) ([0-9]{5}) (.*)$
Group 1: Street
Group 2: House no.
Group 3: ZIP
Group 4: City

public static void main(String[] args) {
String data = "Name der Strase 25a 88489 Teststadt";
String regexp = "([ a-zA-z]+) ([\\w]+) (\\d+) ([a-zA-Z]+)";
Pattern pattern = Pattern.compile(regexp);
Matcher matcher = pattern.matcher(data);
boolean matchFound = matcher.find();
if (matchFound) {
// Get all groups for this match
for (int i=0; i<=matcher.groupCount(); i++) {
String groupStr = matcher.group(i);
System.out.println(groupStr);
}
}System.out.println("nothing found");
}
I guess it doesn't work with german umlauts but you can fix this on your own. Anyway it's a good startup.
I recommend to visit this it's a great site about regular expressions. Good luck!

At first glance it looks like a simple whitespace would do it, however looking closer I notice the address always has 4 parts, and the first part can have whitespace.
What I would do is something like this (psudeocode):
address[4] = empty
split[?] = address_string.split(" ")
address[3] = split[last]
address[2] = split[last - 1]
address[1] = split[last - 2]
address[0] = join split[first] through split[last - 3] with whitespace, trim trailing whitespace with trim()
However, this will only handle one form of address. If addresses are written multiple ways it could be much more tricky.

try this:
^[^\d]+[\d\w]+(\s)\d+(\s).*$
It captures groups for each of the spaces that delimits 1 of the 4 sections of the address
OR
this one gives you groups for each of the address parts:
^([^\d]+)([\d\w]+)\s(\d+)\s(.*)$
I don't know java, so not sure the exact code to use for replacing captured groups.

Java replaceAll regex With Similar Result

Alright folks, my brain is fried. I'm trying to fix up some EMLs with bad boundaries by replacing the incorrect
--Boundary_([ArbitraryName])
lines with more proper
--Boundary_([ArbitraryName])--
lines, while leaving already correct
--Boundary_([ThisOneWasFine])--
lines alone. I've got the whole message in-memory as a String (yes, it's ugly, but JavaMail dies if it tries to parse these), and I'm trying to do a replaceAll on it. Here's the closest I can get.
//Identifie bondary lines that do not end in --
String regex = "^--Boundary_\\([^\\)]*\\)$";
Pattern pattern = Pattern.compile(regex,
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
Matcher matcher = pattern.matcher(targetString);
//Store all of our unique results.
HashSet<String> boundaries = new HashSet<String>();
while (matcher.find())
boundaries.add(s);
//Add "--" at the end of the Strings we found.
for (String boundary : boundaries)
targetString = targetString.replaceAll(Pattern.quote(boundary),
boundary + "--");
This has the obvious problem of replacing all of the valid
--Boundary_([WasValid])--
lines with
--Boundary_([WasValid])----
However, this is the only setup I've gotten to even perform the replacement. If I try changing Pattern.quote(boundary) to Pattern.quote(boundary) + "$", nothing is replaced. If I try just using matcher.replaceAll("$0--") instead of the two loops, nothing is replaced. What's an elegant way to achieve my aim and why does it work?

There's no need to iterate through the matches with find(); that's part of what replaceAll() does.
s = s.replaceAll("(?im)^--Boundary_\\([^\\)]*\\)$", "$0--");
The $0 in the replacement string is a placeholder whatever the regex matched in this iteration.
The (?im) at the beginning of the regex turns on CASE_INSENSITIVE and MULTILINE modes.

You can try something like this:
String regex = "^--Boundary_\\([^\\)]*\\)(--)?$";
then see if the string ends with -- and replace only ones that don't.

Assuming all the strings are on there own line this works:
"(?im)^--Boundary_\\([^)]*\\)$"
Example script:
String str = "--Boundary_([ArbitraryName])\n--Boundary_([ArbitraryName])--\n--Boundary_([ArbitraryName])\n--Boundary_([ArbitraryName])--\n";
System.out.println(str.replaceAll("(?im)^--Boundary_\\([^)]*\\)$", "$0--"));
Edit: changed from JavaScript to Java, must have read too fast.(Thanks for pointing it out)

Is this Regex incorrect? No matches found

I'm trying to parse through a string formatted like this, except with more values:
Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value
The Regex
((Key1)=(.*)),((Key2)=(.*)),((Key3)=(.*)),((Key4)=(.*)),((Key5)=(.*)),((Key6)=(.*)),((Key7)=(.*))
In the actual string, there are about double the amount of key/values, but I'm keeping it short for brevity. I have them in parentheses so I can call them in groups. The keys I have stored as Constants, and they will always be the same. The problem is, it never finds a match which doesn't make sense (unless the Regex is wrong)

Judging by your comment above, it sounds like you're creating the Pattern and Matcher objects and associating the Matcher with the target string, but you aren't actually applying the regex. That's a very common mistake. Here's the full sequence:
String regex = "Key1=(.*),Key2=(.*)"; // etc.
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(targetString);
// Now you have to apply the regex:
if (m.find())
{
String value1 = m.group(1);
String value2 = m.group(2);
// etc.
}
Not only do you have to call find() or matches() (or lookingAt(), but nobody ever uses that one), you should always call it in an if or while statement--that is, you should make sure the regex actually worked before you call any methods like group() that require the Matcher to be in a "matched" state.
Also notice the absence of most of your parentheses. They weren't necessary, and leaving them out makes it easier to (1) read the regex and (2) keep track of the group numbers.

Looks like you'd do better to do:
String[] pairs = data.split(",");
Then parse the key/value pairs one at a time

Your regex is working for me...
If you are always getting an IllegalStateException, I would say that you are trying to do something like:
matcher.group(1);
without having invoked the find() method.
You need to call that method before any attempt to fetch a group (or you will be in an illegal state to call the group() method)
Give this a try:
String test = "Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value";
Pattern pattern = Pattern.compile("((Key1)=(.*)),((Key2)=(.*)),((Key3)=(.*)),((Key4)=(.*)),((Key5)=(.*)),((Key6)=(.*)),((Key7)=(.*))");
Matcher matcher = pattern.matcher(test);
matcher.find();
System.out.println(matcher.group(1));

It's not wrong per se, but it requires a lot of backtracking which might cause the regular expression engine to bail. I would try a split as suggested elsewhere, but if you really need to use a regular expression, try making it non-greedy.
((Key1)=(.*?)),((Key2)=(.*?)),((Key3)=(.*?)),((Key4)=(.*?)),((Key5)=(.*?)),((Key6)=(.*?)),((Key7)=(.*?))
To understand why it requires so much backtracking, understand that for
Key1=(.*),Key2=(.*)
applied to
Key1=x,Key2=y
Java's regular expression engine matches the first (.*) to x,Key2=y and then tries stripping characters off the right until it can get a match for the rest of the regular expression: ,Key2=(.*). It effectively ends up asking,
Does "" match ,Key2=(.*), no so try
Does "y" match ,Key2=(.*), no so try
Does "=y" match ,Key2=(.*), no so try
Does "2=y" match ,Key2=(.*), no so try
Does "y2=y" match ,Key2=(.*), no so try
Does "ey2=y" match ,Key2=(.*), no so try
Does "Key2=y" match ,Key2=(.*), no so try
Does ",Key2=y" match ,Key2=(.*), yes so the first .* is "x" and the second is "y".
EDIT:
In Java, the non-greedy qualifier changes things so that it starts off trying to match nothing and then building from there.
Does "x,Key2=(.*)" match ,Key2=(.*), no so try
Does ",Key2=(.*)" match ,Key2=(.*), yes.
So when you've got 7 keys it doesn't need to unmatch 6 of them which involves unmatching 5 which involves unmatching 4, .... It can do it's job in one forward pass over the input.

I'm not going to say that there's no regex that will work for this, but it's most likely more complicated to write (and more importantly, read, for the next person that has to deal with the code) than it's worth. The closest I'm able to get with a regex is if you append a terminal comma to the string you're matching, i.e, instead of:
"Key1=value1,Key2=value2"
you would append a comma so it's:
"Key1=value1,Key2=value2,"
Then, the regex that got me the closest is: "(?:(\\w+?)=(\\S+?),)?+"...but this doesn't quite work if the values have commas, though.
You can try to continue tweaking that regex from there, but the problem I found is that there's a conflict in the behavior between greedy and reluctant quantifiers. You'd have to specify a capturing group for the value that is greedy with respect to commas up to the last comma prior to an non-capturing group comprised of word characters followed by the equal sign (the next value)...and this last non-capturing group would have to be optional in case you're matching the last value in the sequence, and maybe itself reluctant. Complicated.
Instead, my advice is just to split the string on "=". You can get away with this because presumably the values aren't allowed to contain the equal sign character.
Now you'll have a bunch of substrings, each of which that is a bunch of characters that comprise a value, the last comma in the string, followed by a key. You can easily find the last comma in each substring using String.lastIndexOf(',').
Treat the first and last substrings specially (because the first one does not have a prepended value and the last one has no appended key) and you should be in business.

If you know you always have 7, the hack-of-least resistance is
^Key1=(.+),Key2=(.+),Key3=(.+),Key4=(.+),Key5=(.+),Key6=(.+),Key7=(.+)$
Try it out at http://www.fileformat.info/tool/regex.htm
I'm pretty sure that there is a better way to parse this thing down that goes through .find() rather than .matches() which I think I would recommend as it allows you to move down the string one key=value pair at a time. It moves you into the whole "greedy" evaluation discussion.

Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems. - Jamie Zawinski
The simplest solution is the most robust.
final String data = "Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value";
final String[] pairs = data.split(",");
for (final String pair: pairs)
{
final String[] keyValue = pair.split("=");
final String key = keyValue[0];
final String value = keyValue[1];
}

Replace a word that is not on a string

I'm trying to replace a word in a file whenever it appears except when it is contained in a string:
So I should replace this in
The test in this line consists in ...
But should not match in :
The test "in this line" consist in ...
This is what I'm trying:
line.replaceAll( "\\s+this\\s+", " that ")
But it fails with this scenario so I tried using:
line.replaceAll( "[^\"]\\s+this\\s+", " that ")
But doesn't work either.
Any help would be appreciated

This seems to work (in so far as I understand your requirements from the examples provided):
(?!.*\s+this\s+.*\")\s+this\s+
http://rubular.com/r/jZvR4XEbRf
You may need to adjust the escaping for java.
This is a bit better actually:
(?!\".*\s+this\s+)(?!\s+this\s+.*\")\s+this\s+

The only reliable way to do this is to search for EITHER a complete, quoted sequence OR the search term. You do this with one regex, and after each match you determine which one you matched. If it's the search term, you replace it; otherwise you leave it alone.
That means you can't use replaceAll(). Instead you have to use the appendReplacement() and appendTail() methods like replaceAll() itself does. Here's an example:
String s = "Replace this example. Don't replace \"this example.\" Replace this example.";
System.out.println(s);
Pattern p = Pattern.compile("\"[^\"]*\"|(\\bexample\\b)");
Matcher m = p.matcher(s);
StringBuffer sb = new StringBuffer();
while (m.find())
{
if (m.start(1) != -1)
{
m.appendReplacement(sb, "REPLACE");
}
}
m.appendTail(sb);
System.out.println(sb.toString());
output:
Replace this example. Don't replace "this example." Replace this example.
Replace this REPLACE. Don't replace "this example." Replace this REPLACE.
See demo online
I'm assuming every quotation mark is significant and they can't be escaped--in other words, that you're working with prose, not source code. Escaped quotes can be dealt with, but it greatly complicates the regex.
If you really must use replaceAll(), there is a trick where you use a lookahead to assert that the match is followed by an even number of quotes. But it's really ugly, and for large texts you might find it prohibitively expensive, performance-wise.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.