Regex for splitting a german address into its parts - java

Good evening,
I'm trying to splitting the parts of a german address string into its parts via Java. Does anyone know a regex or a library to do this? To split it like the following:
Name der Straße 25a 88489 Teststadt
to
Name der Straße|25a|88489|Teststadt
or
Teststr. 3 88489 Beispielort (Großer Kreis)
to
Teststr.|3|88489|Beispielort (Großer Kreis)
It would be perfect if the system / regex would still work if parts like the zip code or the city are missing.
Is there any regex or library out there with which I could archive this?
EDIT: Rule for german addresses:
Street: Characters, numbers and spaces
House no: Number and any characters (or space) until a series of numbers (zip) (at least in these examples)
Zip: 5 digits
Place or City: The rest maybe also with spaces, commas or braces

I came across a similar problem and tweaked the solutions provided here a little bit and came to this solution which also works but (imo) is a little bit simpler to understand and to extend:
/^([a-zäöüß\s\d.,-]+?)\s*([\d\s]+(?:\s?[-|+/]\s?\d+)?\s*[a-z]?)?\s*(\d{5})\s*(.+)?$/i
Here are some example matches.
It can also handle missing street numbers and is easily extensible by adding special characters to the character classes.
[a-zäöüß\s\d,.-]+? # Street name (lazy)
[\d\s]+(?:\s?[-|+/]\s?\d+)?\s*[a-z]?)? # Street number (optional)
After that, there has to be the zip code, which is the only part that is absolutely necessary because it's the only constant part. Everything after the zipcode is considered as the city name.

I’d start from the back since, as far as I know, a city name cannot contain numbers (but it can contain spaces (first example I’ve found: “Weil der Stadt”). Then the five-digit number before that must be the zip code.
The number (possibly followed by a single letter) before that is the street number. Note that this can also be a range.
Anything before that is the street name.
Anyway, here we go:
^((?:\p{L}| |\d|\.|-)+?) (\d+(?: ?- ?\d+)? *[a-zA-Z]?) (\d{5}) ((?:\p{L}| |-)+)(?: *\(([^\)]+)\))?$
This correctly parses even arcane addresses such as “Straße des 17. Juni 23-25 a 12345 Berlin-Mitte”.
Note that this doesn’t work with address extensions (such as “Gartenhaus” or “c/o …”). I have no clue how to handle those. I rather doubt that there’s a viable regular expression to express all this.
As you can see, this is a quite complex regular expression with lots of capture groups. If I would use such an expression in code, I would use named captures (Java 7 supports them) and break the expression up into smaller morsels using the x flag. Unfortunately, Java doesn’t support this. This s*cks because it effectively renders complex regular expressions unusable.
Still, here’s a somewhat more legible regular expression:
^
(?<street>(?:\p{L}|\ |\d|\.|-)+?)\
(?<number>\d+(?:\ ?-\ ?\d+)?\ *[a-zA-Z]?)\
(?<zip>\d{5})\
(?<city>(?:\p{L}|\ |-)+)
(?:\ *\((?<suffix>[^\)]+)\))?
$
In Java 7, the closest we can achieve is this (untested; may contain typos):
String pattern =
"^" +
"(?<street>(?:\\p{L}| |\\d|\\.|-)+?) " +
"(?<number>\\d+(?: ?- ?\\d+)? *[a-zA-Z]?) " +
"(?<zip>\\d{5}) " +
"(?<city>(?:\\p{L}| |-)+)" +
"(?: *\\((?<suffix>[^\\)]+)\\))?" +
"$";

Here is my suggestion which could be fine-tuned further e.g. to allow missing parts.
Regex Pattern:
^([^0-9]+) ([0-9]+.*?) ([0-9]{5}) (.*)$
Group 1: Street
Group 2: House no.
Group 3: ZIP
Group 4: City

public static void main(String[] args) {
String data = "Name der Strase 25a 88489 Teststadt";
String regexp = "([ a-zA-z]+) ([\\w]+) (\\d+) ([a-zA-Z]+)";
Pattern pattern = Pattern.compile(regexp);
Matcher matcher = pattern.matcher(data);
boolean matchFound = matcher.find();
if (matchFound) {
// Get all groups for this match
for (int i=0; i<=matcher.groupCount(); i++) {
String groupStr = matcher.group(i);
System.out.println(groupStr);
}
}System.out.println("nothing found");
}
I guess it doesn't work with german umlauts but you can fix this on your own. Anyway it's a good startup.
I recommend to visit this it's a great site about regular expressions. Good luck!

At first glance it looks like a simple whitespace would do it, however looking closer I notice the address always has 4 parts, and the first part can have whitespace.
What I would do is something like this (psudeocode):
address[4] = empty
split[?] = address_string.split(" ")
address[3] = split[last]
address[2] = split[last - 1]
address[1] = split[last - 2]
address[0] = join split[first] through split[last - 3] with whitespace, trim trailing whitespace with trim()
However, this will only handle one form of address. If addresses are written multiple ways it could be much more tricky.

try this:
^[^\d]+[\d\w]+(\s)\d+(\s).*$
It captures groups for each of the spaces that delimits 1 of the 4 sections of the address
OR
this one gives you groups for each of the address parts:
^([^\d]+)([\d\w]+)\s(\d+)\s(.*)$
I don't know java, so not sure the exact code to use for replacing captured groups.

Related

Java Regex complex ID expression filtering

I am using Java to implement PDF to plain text conversion. Right now I am facing the problem of filtering out ID expressions from String representation of the text.
The idea here is to capture IDs as whole words of length only greater than 4 and remove them. IDs must comprise of both letters and numbers at the same time, in any order. They can have optional special symbols like :.- and are generally all uppercase except several cases when there might be one and (for now) exactly one lowercase letter in them. IDs can be encountered at any place in the sentence, and there are multiple sentences inside the String. I am also trying to capture the preceding space (if there is one) so there is no double space after I remove the ID. It is acceptable to split the expression into several pieces if it gets too complex.
I've created a small test snippet to show exactly what needs and doesn't need to be caught by the regular expression, as well as display my progress so far. I am using standard java.util.regex package for implementation.
String testString = "Remove this (ACTDIK002), ACTDIK002, (L1:3.CI), 9-12.CT.d.12, and 1A-CS-01 "
+ "but not (DLCS), 781-338-3000, (DTC), (200), K-12, K or 12. "
+ "Also not (), A.I., AI, A or a. . ...";
System.out.println(testString);
String regex = "[\\s]{0,1}[[A-Z]+[\\d]+[-:\\(\\)\\.]*]{4,}[a-z]{0,1}[\\d\\.]*";
//"[\\s]{0,1}[[A-Z]+[\\d]+[-:\\(\\)\\.]*]{4,}[[a-z]{0,1}[\\d\\.]+]*" //for comma removal
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(testString);
testString = matcher.replaceAll("*");
System.out.println(testString);
It may be necessary to remove IDs together with their commas, so it would be great if the revised expression was capable of capturing commas or omitting them via minor alterations like the alternative regex I've provided.
My current solution filters out everything that needs to be filtered but also most of the things it shouldn't. It appears the rule that there must be at least one capital letter and one digit in the word isn't working, possibly because I need to use Lookahead/Lookbehind/Grouping, sadly none of which I managed to get to work properly. I also suspect the use of [] is completely incorrect in my example, but this is the only way I managed to get it to (mostly) work for now. Please help me.
My colleague and I were able to solve this issue in an elegant way. Below is a snippet from my current solution. I hope one day this proves useful to someone.
String testString = "Remove this (ACTDIK002), ACTDIK002, (L1:3.CI), 9-12.CT.d.12, and 1A-CS-01 "
+ "but not (DLCS), 781-338-3000, (DTC), (200), K-12, K or 12. "
+ "Also not (), A.I., AI, A or a. . ...";
System.out.println(testString);
String regex = "(?i)(?=[\\dA-Z\\(\\)\\.:-]*\\d)(?=[\\dA-Z\\(\\)\\.:-]*[A-Z])[\\dA-Z\\(\\)\\.:-]{5,}";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(testString);
testString = matcher.replaceAll("");
System.out.println(testString);
//Clean-up extra spaces and unneeded commas
//testString = testString.replaceAll("\\s{2,}", " ").replaceAll("(\\s\\.)|(\\s\\,)", "");
testString = testString.replaceAll("[ ]{2,}", " ").replaceAll("([ ]\\.)|([ ]\\,)", "");
System.out.println(testString);

Regex to find matches words that is in a line that doesn't start with

I am really having a hard time creating a Regex that finds a word but if and only if it the line it is in does not start with --.
For example:
Look for word: if
-- check if //should not match
-- more random words if //should not match
check if //should match
I have tried using negative lookbehinds like:
(?<!-- .*)\bif\b
But I'm using JAVA and also, I cannot use quantifiers in lookbehinds.
If I try
(?<! -- )\bif\b
It does work on
-- if \\works
-- if \\does not work
I found out the usage of SKIP and F but, it seems JAVA does not cater these two.
Any advice on how can I deal with this?
Thanks!
Assuming you're using Java 8, you could do something like this:
Pattern p = Pattern.compile("^(?!--).*if");
Predicate<String> pred = s -> {return p.matcher(s).find();};
Files.lines(Paths.get("files/input.txt"))
.filter(pred).forEach(System.out::println);
It is called a negative look ahead. I hope it helps.
I don't know how long your lines are, but you can use for example this ugly construction (of course number could be smaller):
(?m)(?<!^--.{0,99999999999999999})if
EXAMPLE
Java allows some interval quantifiers in look behind and look ahead, but it just looks wrong :P however it works, at least it works for me in this example.
Another approach, if I you want to raplace all occurances, why not use capturing groups? like:
String[] examples = {"-- check if ",
"-- more random words if ",
"check if ", "-- f",
"-- check if \n-- more random words if \ncheck if "};
for(String string : examples) {
System.out.println(string.replaceAll("(?m)(?!^--)^(.*?)if","$1" + "replacement"));
System.out.println();
}
Regex (?m)(?!^--)^(.*)(if) will match everything up to searched word and capture it into group 1, then in raplacement you can put it back into text. Not most efficient, but should work anyway.

How would I use regex to output a specific set of Strings java?

I have a String output of a very long row of movie titles and music album titles.
e.g.
[(Pixel Quality) (Year of Release) MovieTitle.ext,...... Albumname-artistname.ext]
i.e. [(HD 1080p) (2015) Batman vs Superman.mov,........tearsinheavan-ericclapton.mp3,.......]
I am trying to identify the movies and music apart using regex expressions. A movie has pixel quality, a release date, a movetitle and an extension like (.mov,.flv...etc) while music has an album name followed by a - and the artist name with an extension like (.mp3,.aax.....).
The expected output would be (Pixel Quality) (Year of Release) MovieTitle.ext for a movie, and Albumname-artistname.ext for music.
I am not too familiar with regex I only know how to match single characters, or a specific word. However I can't seem to output the whole pixel quality,year of release and movietitle.ext. Only the specific words I've matched or the single characters.
Method I used to try and find the "categories".
public void FindPatterns () {
String patternFilms = ("REGEX PATTERN?");
Pattern pattern = Pattern.compile(patternFilms);
for (String name : names) {
Matcher matcher = pattern.matcher(name);
while(matcher.find()){
System.out.println(matcher.group());
}
}
}
UPDATE:
I've attempted to fiddle around with the regex patterns in my code, and I get nothing but syntax errors being flagged asking me to delete the tokens, I can't find a clear enough example for what I am trying to achieve.
Just incase I've been putting the pattern in the wrong place this whole time, I've been putting the regex pattern in String pattern and "REGEX PATTERN? is just a placeholder where I am asking if that is the correct place to put the pattern.
On the java side of things, your code needs to extract each individual group as a named or indexed group. That's (relatively) the easy part though. Before you get to that point, it sounds like you need help with your pattern, so lets look at that first.
Build up your regex piece by piece. A tool that allows you to quickly iterate your regex is useful. I like https://regex101.com/.
What you need to do is select "matching groups" out of the input String. So you want to match everything that you can throw away (things like commas and parentheses), as well as the data you want to extract. For the data you want to extract, surround the regex for each of those pieces of data in parentheses to denote the group.
Your input strings have lots of characters that have special meaning inside a regex, like [ and (. So if you want to match them explicitly, they need to be "escaped". Also keep in mind that when you translate your regex to Java, the \ character is itself an escape for a Java String, so it needs to be escaped too with another \. So, for example, a regex to match a [ character would be defined like \\[.
So, start by matching the entire input:
^.*$
The ^ character are "anchors" that mean "beginning of the input" and the "end of the input" respectively. The . just matches any character, and the * matches the previous token (any character) 0, 1, or more times (so everything).
In regex 101, this will highlight the entire input.
The entire string is surrounded with square brackets, so lets match those, and remember they need to be escaped:
^\[.*\]$
Now lets start breaking up the individual components. The first two are delimited by parentheses, and remember we need to escape parentheses, so lets match (something) (something) something:
^\[\(.*\) \(.*\) .*\]$
Now again the whole input should be highlighted again. Lets pull out the two pieces of data we just identified by surrounding them in parentheses:
^\[\((.*)\) \((.*)\) .*\]$
Now you should see the matches highlighted and shown over on the right side. Now continue to build up the regex, replacing that last .* with more specific matches.
Comment on this answer if you run into any particular issue!
It looks like it's parenthesized and then comma separated, so something along the lines of ^[((.))((.?)),(.),(.)]$
^ matches the start of the line, and $ matches the start of the line
\ escapes characters that have special regex meaning, like [. You need [ and ( to match literal brackets and parentheses.
(...) marks a group, so that you can extract it when you get a match.
.* is just zero or more repetitions of any character. Use .+ to get one or more repetitions.
Also, add " *" where needed to match spaces.
An example in Perl:
echo "(hd)(2015) Avatar.ext, Douchebagson.ext" | perl -pe "s/^\((.*)\) *\((.*)\) *(.*) *, *(.*)$/\1,\2,\3,\4/g"
hd,2015,Avatar.ext,Douchebagson.ext
What's happening is a substitution. We're substituting the input string with <1st part>,<2nd part>,.... The result is a csv-format that can be interpreted by your language of choice, Excel or what ever.
\((.*)\) matches everything within parentheses. The parentheses are not part of the capturing group, since the literal parentheses /( and /) are outside the capturing clause (.*).
^ and $ are not necessary here, but can be used to find matches only near the end or near the beginning.
Note: Since it's a school assignment, I'm not going to explain what's happening so leaving to your imagination. You should be able to explain it to your teacher.
Try following code:
String data = "(HD 1080p) (2015) Batman vs Superman.mov," +
"tearsinheavan-ericclapton.mp3," +
"(HD 1080p) (2015) Batman vs Superman.mov," +
"tearsinheavan-ericclapton.mp3,(HD 1080p) (2015) Batman vs Superman.mov," +
"tearsinheavan-ericclapton.mp3,";
String rxString = "(?ism)(?<movie>\\(.*?\\) \\(\\d{4}\\).*?\\." +
"\\w+(?=[,\n]))|(?<music>[^(,\n]*?\\-[^,]+)";
Pattern regex = Pattern.compile(rxString);
Matcher regexMatcher = regex.matcher(data);
while (regexMatcher.find()) {
String movie = regexMatcher.group("movie");
String music = regexMatcher.group("music");
if(movie!=null) {
System.out.printf("Movie:\t%s\n", movie);
}
if(music!=null) {
System.out.printf("Music:\t%s\n", music);
}
}
It will printout:
Movie: (HD 1080p) (2015) Batman vs Superman.mov
Music: tearsinheavan-ericclapton.mp3
Movie: (HD 1080p) (2015) Batman vs Superman.mov
Music: tearsinheavan-ericclapton.mp3
Movie: (HD 1080p) (2015) Batman vs Superman.mov
Music: tearsinheavan-ericclapton.mp3

Regular expression not extracting the exact pattern

I am working in Java to read a string of over 100000 characters.
I have a list of keywords, that I search the string for, and if the string is present I call a function which does some internal processing.
The kind of keyword I have is "face", for example - I wish to get all the patterns where I have matches for "faces" not "facebook". I can accept a space character behind the face in the string so if in a string I have a match like " face" or " faces" or "face " or " faces" i can accept that too. However I can not accept "duckface" or "duckface " etc.
I have written the regex
Pattern p = Pattern.compile("\\s+"+keyword+"s\\s+|\\s+");
where keyword is my list of keywords, but I am not getting the desired results. Can you read my description and please suggest what might be issue and how I can fix it?
Also if a pointer to a really good regex for Java page is shared I would appreciate that as well.
Thank you Contributers ..
Edit
The reason I know it is not working is I have used the following code:
Pattern p = Pattern.compile("\\s+"+keyword+"s\\s+|\\s+");
Matcher m = p.matcher(myInputDataSting);
if(m.find())
{
System.out.println("Its a Match: "+m.group());
}
This returns a blank string...
If keyword is "face", then your current regex is
\s+faces\s+|\s+
which matches either one or more whitespace characters, followed by faces, followed by one or more whitespace characters, or one or more whitespace characters. (The pipe | has very low precedence.)
What you really want is
\bfaces?\b
which matches a word boundary, followed by face, optionally followed by s, followed by a word boundary.
So, you can write:
Pattern p = Pattern.compile("\\b"+keyword+"s?\\b");
(though obviously this will only work for words like face that form their plurals by simply adding s).
You can find a comprehensive listing of Java's regular-expression support at http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html, but it's not much of a tutorial. For that, I'd recommend just Googling "regular expression tutorial", and finding one that suits you. (It doesn't have to be Java-specific: most of the tutorials you'll find are for flavors of regular-expression that are very similar to Java's.)
You should use
Pattern p = Pattern.compile("\b"+keyword+"s?\b");
, where keyword is not plural. \\b means that keyword must be as a complete word in searched string. s? means that keyword's value may end with s.
If you are not familar enough with regular expressions I recommend reading http://docs.oracle.com/javase/tutorial/essential/regex/index.html, because there are examples and explanations.

Parsing quoted text in java

Is there an easy way to parse quoted text as a string to java? I have this lines like this to parse:
author="Tolkien, J.R.R." title="The Lord of the Rings"
publisher="George Allen & Unwin" year=1954
and all I want is Tolkien, J.R.R.,The Lord of the Rings,George Allen & Unwin, 1954 as strings.
You could either use a regex like
"(.+)"
It will match any character between quotes. In Java would be:
Pattern p = Pattern.compile("\\"(.+)\\"";
Matcher m = p.matcher("author=\"Tolkien, J.R.R.\"");
while(matcher.find()){
System.out.println(m.group(1));
}
Note that group(1) is used, this is the second match, the first one, group(0), is the full string with quotes
Offcourse you could also use a substring to select everything except the first and last char:
String quoted = "author=\"Tolkien, J.R.R.\"";
String unquoted;
if(quoted.indexOf("\"") == 0 && quoted.lastIndexOf("\"")==quoted.length()-1){
unquoted = quoted.substring(1, quoted.lenght()-1);
}else{
unquoted = quoted;
}
There are some fancy pattern regex nonsense things that fancy people and fancy programmers like to use.
I like to use String.split(). It's a simple function and does what you need it to do.
So if I have a String word: "hello" and I want to take out "hello", I can simply do this:
myStr = string.split("\"")[1];
This will cut the string into bits based on the quote marks.
If I want to be more specific, I can do
myStr = string.split("word: \"")[1].split("\"")[0];
That way I cut it with word: " and "
Of course, you run into problems if word: " is repeated twice, which is what patterns are for. I don't think you'll have to deal with that problem for your specific question.
Also, be cautious around characters like . and . Split uses regex, so those characters will trigger funny behavior. I think that "\\" = \ will escape those funny rules. Someone correct me if I'm wrong.
Best of luck!
Can you presume your document is well-formed and does not contain syntax errors? If so, you are simply interested in every other token after using String.split().
If you need something more robust, you may need to use the Scanner class (or a StringBuffer and a for loop ;-)) to pick out the valid tokens, taking into account additional criterion beyond "I saw a quotation mark somewhere".
For example, some reasons you might need a more robust solution than splitting the string blindly on quotation marks: perhaps its only a valid token if the quotation mark starting it comes immediately after an equals sign. Or perhaps you do need to handle values that are not quoted as well as quoted ones? Will \" need to be handled as an escaped quotation mark, or does that count as the end of the string. Can it have either single or double quotes (eg: html) or will it always be correctly formatted with double quotes?
One robust way would be to think like a compiler and use a Java based Lexer (such as JFlex), but that might be overkill for what you need.
If you prefer a low-level approach, you could iterate through your input stream character by character using a while loop, and when you see an =" start copying the characters into a StringBuffer until you find another non-escaped ", either concatenating to the various wanted parsed values or adding them to a List of some sort (depending on what you plan to do with your data). Then continue reading until you encounter your start token (eg: =") again, and repeat.

Categories