Regex based on address in any order

Regex based on address in any order - java

I have a regex based on an address format :
([0-9-]*) ?([\p{L}*,\. '-]*) ?([0-9 ]*) ?([\p{L}*,\. '-]*) ([0-9]{5}) ?([\p{L}*,\. '-]*)
It can match with this :
16 Rue du Pont Louis-Philippe 75000 Paris
But I'd like to get this regex match with this format too :
75000 Paris 16 Rue du Pont Louis-Philippe
Can someone help me pls ?

There are a lot of optional parts in the pattern. You can make the last 2 groups optional, but you would have to change the quantifiers to 1 or more times using + to prevent partial matches (Or add ^ to the pattern to assert the start of the string)
([0-9-]+) ([\p{L}*,. '-]+) ([0-9 ]+) ([\p{L}*,. '-]+)(?: ([0-9]{5}) ([\p{L}*,. '-]+))?
Regex demo

Related

Regex to merge multiple numbers with spaces in one line

I need a Regex to merge multiple numbers in a line without merging them all together.
Example line :
Hello World9.99 123 456.00 7 890 123.45 0.97
My desired output is :
Hello World9.99 123456.00 7890123.45 0.97
I know basic regex but am not experienced with lookaheads/behinds.
So far I created this method :
final String regex = "(?<!\\.\\d{1,3})\\s+(?=\\d{1,3}\\.?\\d{2}?)";
public String mergeNumbers(String s){
return s.replaceAll(regex, "");
}
This works fine if the number tied to the word has a dot.
But I just can't figure out how to match this line without a dot at the beginning :
Hello World99 123 456.00 7 890 123.45 0.97
This is returning :
Hello World99123456.00 7890123.45 0.97
but I want :
Hello World99 123456.00 7890123.45 0.97
So my question is :
How can I modify my regex to match both cases?

I suggest using
.replaceAll("\\b(?<!\\.)(\\d+)\\s+(?=\\d)", "$1")
See the regex demo.
Details:
\b - a word boundary
(?<!\.) - there can be no . immediately before the current location
(\d+) - Group 1 (referred to with $1 backreference from the string replacement pattern): one or more digits
\s+ - 1+ whitespaces
(?=\\d) - there must be a digit immediately to the right of the current location.

How to find a set of words in a string?

I am working on an java and MySQL based application and i have task to find set of words in a string. no matter what is position of words in a string but should be present in a string.
consider an example:
string is "sector 10 , Delhi"
but I am trying to search by Delhi sector 10
or by sector-10 Delhi or sector 10 , Delhi
help me to find such type patter in string by java or MySQL query.

must be use FullText feature of mysql :
http://dev.mysql.com/doc/refman/5.7/en/fulltext-search.html
for example :
SELECT address FROM area WHERE MATCH(adrress) AGAINST ('+dehli -bombi' IN BOOLEAN MODE)

Use some thing like this
Pattern p = Pattern.compile("\\b(Delhi|sector|10 )\\b");
Matcher m = p.matcher("sector 10 , Delhi");
m.find();
System.out.println(m.group());

How to extract date section from filename?

I need to find a regex to extract date section from the name of several files.
In particular I have these two formats:
ATC0200720140828080610.xls
ATC0200720140901080346_UFF_ACC.xls
I use these two regex to check file name format:
^ATC02007[0-9]{14}.xls$
^ATC02007[0-9]{14}_UFF_ACC.xls$
But I need a regex to extract a specific section:
constant | yyyyMMddHHmmss | constant
^ ^ ^
ATC02007 | 20140901080346 | _UFF_ACC.xls
Both regex I'm using match the entire file name, so I can't use to extract the middle section, so which is the right expression?

You are almost there. Just use round brackets to contain the numbers you want.
^ATC02007([0-9]{14})(_UFF_ACC)?.xls$
See example. The numbers are captured in group 1$1.

You need to use capturing groups.
^(ATC02007)([0-9]{14})((?:[^.]*)?\\.xls)$
DEMO
GRoup index 1 contains the first constant and group 2 contains date and time and group 3 contains the third constant.
String s = "ATC0200720140828080610.xls\n" +
"ATC0200720140901080346_UFF_ACC.xls";
Pattern regex = Pattern.compile("(?m)^(ATC02007)([0-9]{14})((?:[^.]*)?\\.xls)$");
Matcher matcher = regex.matcher(s);
while(matcher.find()){
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println(matcher.group(3));
}
Output:
ATC02007
20140828080610
.xls
ATC02007
20140901080346
_UFF_ACC.xls

How to extract sub-strings for a collection of text?

I extracted text from pdf document. .. I want to extract some particular fields in it using java..
The portion of text ..
US00RE44697E (i9) United States (12) Reissued Patent (10)
Patent Number: RE44,697 E Jones et al. (45) Date of
ReissuedPatent: Jan. 7, 2014 (54) ENCRYPTIONPROCESSORWITH SHARED
MEMORY INTERCONNECT (75) Inventors: David E.Jones, Ottawa
(CA); Cormac M.O'Connell, Carp (CA) (73) Assignee: Mosaid
Technologies Incorporated, Ottawa, Ontario (CA) (21)
Appl.No.: 13/603,137 (22) Filed: Sep. 4, 2012 Related U.S.
Patent Documents Reissue of: (64) Patent No.: Issued:
Appl. No.: Filed: 6,088,800 Jul. 11, 2000
09/032,029 Feb. 27, 1998 (51) Int.CI. G06F 21/00
(2013.01) (52) U.S. CI. USPC .............713/189; 713/190;
713/193; 380/28; 380/33; 380/52 (58) Field of Classification
Search None
Now my mission is to extract fields form it and give to strings.. that is
the text (10) Patent Number: RE44,697 E will be extracted as String pat_no= " RE44,697 E"
the text (54) ENCRYPTIONPROCESSORWITH SHARED
MEMORY INTERCONNECT will be extracted as String title= "ENCRYPTIONPROCESSORWITH SHARED
MEMORY INTERCONNECT"
the extremely irregular text block
(64) Patent No.: Issued: Appl. No.: Filed:
6,088,800 Jul. 11, 2000 09/032,029 Feb. 27, 1998
have to be extracted as
String pat_no_org = "6,088,800";
String issued = "jul.11,2000"
String filed = "feb 27 ,1998"
......
like this..
My Works
First i used the string.split , string.substring , string,indexof and even apache string utils , but none helped.. Because the text are scattered , above methods doesn't helped.. I also tried regular expressions ,but since I very weak in it I can't program .
Please tell me how to achieve my objective using java ?

With regex, I would split it in 3 parts:
1.) (10) Patent Number the regex could look like this:
\(10\)\s*Patent Number:\s*([\w,]+)
as a java string:
"\\(10\\)\\s*Patent Number:\\s*([\\w,]+)"
The matches for the first parenthesized group will be in [1].
\s is a shorthand for [ \t\r\n\f] any kind of white-space.
\w is a shorthand for [A-Za-z0-9_] word-characters, together with , in a character class.
Some characters have special meanings in regex. They have to be escaped with a backslash.
2.) (54) ENCRYPT...
A pattern could look like:
(?s)\(54\)\s*(.*?)\s*(?=\(\d|$\))
as a java string:
"(?s)\\(54\\)\\s*(.*?)\\s*(?=\\(\\d|$\\))"
(?s) The s modifier equals Pattern.DOTALL where the dot matches new-lines too.
(?=\(\d|$\)) a lookahead is used, to match (.*?) lazy any amount of any characters until another ( followed by a digit | or string-end $ (anchor for end) is seen.
3.) For the other desired 3 parts I would try to reflect formatting of the input with the pattern. This requires, that all data is constructed compatible. A pattern could look like this:
(?s)\(64\).*?Filed:\s*([\d,]+)\s*(\w+\.\s*\d+,\s*\d+)\s*\n[\d+][^\n]+\n\s*(\w+\.\s*\d+,\s*\d+)
as a java string:
"(?s)\\(64\\).*?Filed:\\s*([\\d,]+)\\s*(\\w+\\.\\s*\\d+,\\s*\\d+)\\s*\\n[\\d+][^\\n]+\\n\\s*(\\w+\\.\\s*\\d+,\\s*\\d+)"
\n matches a newline.
Matches will be in [1] e.g. 6,088,800, [2] e.g. Jul. 11, 2000 and [3] e.g. Feb. 27, 1998.
For getting started with regex, this is too much information at once :)

Java regex trying to split string up

Hi I am trying to split this string up (its quite long):
Library Catalogue Log off | Borrower record | Course Reading | Collections | A-Z E-Journal list | ILL Request | Help   Browse | Search | Results List | Previous Searches | My e-Shelf | Self-Issue | Feedback       Selected records:  View Selected  |  Save/Mail  |  Create Subset  |  Add to My e-Shelf  |        Whole set:  Select All  |  Deselect  |  Rank  |  Refine  |  Filter   Records 1 - 15 of 101005 (maximum display and sort is 2500 records)         1 Drower, E. S. (Ethel Stefana), Lady, b. 1879. Lady E.S. Drower’s scholarly correspondence : an intrepid English autodidact in Iraq / edited by 2012. BK Book University Library( 1/ 0) 2 Kowalski, Robin M. Cyberbullying : bullying in the digital age / Robin M. Kowalski, Susan P. Limber, Patricia W. Ag 2012. BK Book University Library( 1/ 0) ... 15 Ambrose, Gavin. Approach and language [electronic resource] / Gavin Ambrose, Nigel Aono-Billson. 2011. BK Book
So that I either get back:
1 Drower, E. S. (Ethel Stefana), Lady, b. 1879. Lady E.S. Drower’s scholarly correspondence : an intrepid English autodidact in Iraq / edited by 2012. BK Book University Library( 1/ 0)
// Or
1 Drower, E. S. (Ethel Stefana), Lady, b. 1879. Lady E.S. Drower’s scholarly correspondence : an intrepid English autodidact in Iraq
This is just an example and the 1 Drower, E. S. ... will not be static. While the input will be different every time (the detail between 1 and 2) the general layout of the string will always be the same.
I have:
String top = ".* (.*)";
String bottom = "\( \d/ \d\)\W*";
Pattern p = Pattern.compile(top); //+bottom
Matcher matcher = p.matcher(td); //td is the input String
String items = matcher.group();
System.out.println(items);
When I run it with top, it is meant to remove all of the headers but all I get back is No match found. bottom is my attempt to split the rest of the string.
I can post all of the input up to number 15 if it is needed. What I need is to split up the input string so that I can work with each individual of the 15 results.
Thanks for your help!

This will provide both inputs for you. It is what you wanted?
String text = "Library Catalogue Log off ..."; \\truncated text
Pattern p = Pattern.compile("((1 Drower.+Iraq).+0\\)).+2 Kowalski");
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
}
Compile and run code here.

First off you need to separate the headers from the result data. Assuming that each time there will be that block of 9 whitespaces you can use this: .*\s{9}(.*)
Next you need to parse the data into rows, this is more difficult because you have no row delimiters. The best you can do is assume that rows are delimited by: a space then one or more digits then another space.
((?<=(?:^|\s))\d+\s.*?(?=(?:$|\s\d+\s)))
If you're planning to try to parse the records into fields then don't bother unless you can change the delimiters!
A little explanation of what each bit does:
(?<=(?:^|\s)) Look behind: Make sure the characters preceding the group is either the start of the string (1st record), or a space (all other records).
\d+\s.*? Capture group: One or more digits followed by a space, then followed by text. This is the only part of the expression that shows up in the output because of the use of non-capturing groups ?: in the assertions.
(?=(?:$|\s\d+\s)) Look ahead: Make sure the characters following the group are either the end of string marker $ or a space followed by 1+ digits, followed by a space (indicating the next record).
This method is works with the fields you provided, but it will break if you have a record that contains the custom delimiter e.g. a book called "My 10 favourite things". There other ways of parsing records that are a little safer, but if that's what you want to do then it's beyond the expectations of regex...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex based on address in any order - java

Related

Regex to merge multiple numbers with spaces in one line

How to find a set of words in a string?

How to extract date section from filename?

How to extract sub-strings for a collection of text?

Java regex trying to split string up

Categories

Resources