Java regex trying to split string up - java

Hi I am trying to split this string up (its quite long):
Library Catalogue Log off | Borrower record | Course Reading | Collections | A-Z E-Journal list | ILL Request | Help   Browse | Search | Results List | Previous Searches | My e-Shelf | Self-Issue | Feedback       Selected records:  View Selected  |  Save/Mail  |  Create Subset  |  Add to My e-Shelf  |        Whole set:  Select All  |  Deselect  |  Rank  |  Refine  |  Filter   Records 1 - 15 of 101005 (maximum display and sort is 2500 records)         1 Drower, E. S. (Ethel Stefana), Lady, b. 1879. Lady E.S. Drower’s scholarly correspondence : an intrepid English autodidact in Iraq / edited by 2012. BK Book University Library( 1/ 0) 2 Kowalski, Robin M. Cyberbullying : bullying in the digital age / Robin M. Kowalski, Susan P. Limber, Patricia W. Ag 2012. BK Book University Library( 1/ 0) ... 15 Ambrose, Gavin. Approach and language [electronic resource] / Gavin Ambrose, Nigel Aono-Billson. 2011. BK Book
So that I either get back:
1 Drower, E. S. (Ethel Stefana), Lady, b. 1879. Lady E.S. Drower’s scholarly correspondence : an intrepid English autodidact in Iraq / edited by 2012. BK Book University Library( 1/ 0)
// Or
1 Drower, E. S. (Ethel Stefana), Lady, b. 1879. Lady E.S. Drower’s scholarly correspondence : an intrepid English autodidact in Iraq
This is just an example and the 1 Drower, E. S. ... will not be static. While the input will be different every time (the detail between 1 and 2) the general layout of the string will always be the same.
I have:
String top = ".* (.*)";
String bottom = "\( \d/ \d\)\W*";
Pattern p = Pattern.compile(top); //+bottom
Matcher matcher = p.matcher(td); //td is the input String
String items = matcher.group();
System.out.println(items);
When I run it with top, it is meant to remove all of the headers but all I get back is No match found. bottom is my attempt to split the rest of the string.
I can post all of the input up to number 15 if it is needed. What I need is to split up the input string so that I can work with each individual of the 15 results.
Thanks for your help!

This will provide both inputs for you. It is what you wanted?
String text = "Library Catalogue Log off ..."; \\truncated text
Pattern p = Pattern.compile("((1 Drower.+Iraq).+0\\)).+2 Kowalski");
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
}
Compile and run code here.

First off you need to separate the headers from the result data. Assuming that each time there will be that block of 9 whitespaces you can use this: .*\s{9}(.*)
Next you need to parse the data into rows, this is more difficult because you have no row delimiters. The best you can do is assume that rows are delimited by: a space then one or more digits then another space.
((?<=(?:^|\s))\d+\s.*?(?=(?:$|\s\d+\s)))
If you're planning to try to parse the records into fields then don't bother unless you can change the delimiters!
A little explanation of what each bit does:
(?<=(?:^|\s)) Look behind: Make sure the characters preceding the group is either the start of the string (1st record), or a space (all other records).
\d+\s.*? Capture group: One or more digits followed by a space, then followed by text. This is the only part of the expression that shows up in the output because of the use of non-capturing groups ?: in the assertions.
(?=(?:$|\s\d+\s)) Look ahead: Make sure the characters following the group are either the end of string marker $ or a space followed by 1+ digits, followed by a space (indicating the next record).
This method is works with the fields you provided, but it will break if you have a record that contains the custom delimiter e.g. a book called "My 10 favourite things". There other ways of parsing records that are a little safer, but if that's what you want to do then it's beyond the expectations of regex...

Related

How to tokenize, scan or split this string of email addresses

For Simple Java Mail I'm trying to deal with a somewhat free-format of delimited email addresses. Note that I'm specifically not validating, just getting the addresses out of a list of addresses. For this use case the addresses can be assumed to be valid.
Here is an example of a valid input:
"name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
So there are two basic forms "name#domain.com" and "Joe Sixpack ", which can appear in a comma / semicolon delimited string, ignoring white space padding. The problem is that the names can contains delimiters as valid characters.
The following array shows the data needed (trailing spaces or delimiters would not be a big problem):
["name#domain.com",
"Sixpack, Joe 1 <name#domain.com>",
"Sixpack, Joe 2 <name#domain.com>",
"Sixpack, Joe, 3<name#domain.com>",
"nameFoo#domain.com",
"nameBar#domain.com",
"nameBaz#domain.com"]
I can't think of a clean way to deal with this. Any suggestion how I can reliably recognize whether a comma is part of a name or is a delimiter?
Final solution (variation on the accepted answer):
var string = "name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
// recognize value tails and replace the delimiters there, disambiguating delimiters
const result = string
.replace(/(#.*?>?)\s*[,;]/g, "$1<|>")
.replace(/<\|>$/,"") // remove trailing delimiter
.split(/\s*<\|>\s*/) // split on delimiter including surround space
console.log(result)
Or in Java:
public static String[] extractEmailAddresses(String emailAddressList) {
return emailAddressList
.replaceAll("(#.*?>?)\\s*[,;]", "$1<|>")
.replaceAll("<\\|>$", "")
.split("\\s*<\\|>\\s*");
}
since you are not validating, i assume that the email addresses are valid.
Based on this assumption, i will look up an email address followed by ; or , this way i know its valid.
var string = "name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
const result = string.match(/(.*?#.*?\..*?)[,;]/g)
console.log(result)
This pattern works for your provided examples:
([^#,;\s]+#[^#,;\s]+)|(?:$|\s*[,;])(?:\s*)(.*?)<([^#,;\s]+#[^#,;\s]+)>
([^#,;\s]+#[^#,;\s]+) # email defined by an # with connected chars except ',' ';' and white-space
| # OR
(?:$|\s*[,;])(?:\s*) # start of line OR 0 or more spaces followed by a separator, then 0 or more white-space chars
(.*?) # name
<([^#,;\s]+#[^#,;\s]+)> # email enclosed by lt-gt
PCRE Demo
Using Java's replaceAll and split functions (mimicked in javascript below), I would say lock onto what you know ends an item (the ".com"), replace separator characters with a unique temp (a uuid or something like <|>), and then split using your refactored delimiter.
Here is a javascript example, but Java's repalceAll and split can do the same job.
var string = "name#domain.com,Joe Sixpack <name#domain.com>, Sixpack, Joe <name#domain.com> ;Sixpack, Joe<name#domain.com> , name#domain.com,name#domain.com;name#domain.com;"
const result = string.replace(/(\.com>?)[\s,;]+/g, "$1<|>").replace(/<\|>$/,"").split("<|>")
console.log(result)

How to structure my XText terminals? WORDS/SL_STRING/ML_STRING

In my XText DSL, I want to be able to use three different kinds of text terminals. They are all used for adding comments on top of arrows drawn in a UML diagram:
terminal WORD:
Actor -> Actor: WORD
terminal SL_STRINGS:
Actor -> Actor: A sequence of words on a single line
terminal ML_STRINGS:
Actor -> Actor: A series of words on
multiple
lines
My initial approach was to use the ID terminal from the org.eclipse.xtext.common.Terminals as my WORD terminal, and then just have SL_STRINGS be (WORD)*, and ML_STRINGS be (NEWLINE? WORD)*, but this creates a lot of problems with ambiguity between the rules.
How would I go about structuring this in a good way?
More information about the project. (And as this is the first time working with XText, please bear with me):
I am trying to implement a DSL to be used together with the Eclipse Plugin for PlantUML http://plantuml.sourceforge.net/sequence.html mainly for Syntax Checking and Colorization. Currently my grammar works as such:
Model:
(diagrams+=Diagram)*;
Diagram:
'#startuml' NEWLINE (instructions+=(Instruction))* '#enduml' NEWLINE*
;
An instruction can be lots of things:
Instruction:
((name1=ID SEQUENCE name2=ID (':' ID)?)
| Definition (Color)?
| AutoNumber
| Title
| Legend
| Newpage
| AltElse
| GroupingMessages
| Note
| Divider
| Reference
| Delay
| Space
| Hidefootbox
| Lifeline
| ParticipantCreation
| Box)? NEWLINE
;
Example of rules that need different kinds of text terminals:
Group:
'group' TEXT
;
Reference:
'ref over' ID (',' ID)* ((':' SL_TEXT)|((ML_TEXT) NEWLINE 'end ref'))
;
For Group, the text can only be on one line, while for Reference, the text can be on two lines if there is no ":" follwing the rule call.
Currently my terminals look like this:
terminal NEWLINE : ('\r'? '\n');
// Multiline comment begins with /', and ends with '/
terminal ML_COMMENT : '/\'' -> '\'/';
// Singleline comment begins with ', and continues until end of line.
terminal SL_COMMENT : '\'' !('\n'|'\r')* ('\r'? '\n')?;
// INT is a sequence of numbers 0-9.
terminal INT returns ecore::EInt: ('0'..'9')+;
terminal WS : (' '|'\t')+;
terminal ANY_OTHER: .;
And I want on top of this to add to this add three new terminals that takes care of the text.
You should implement a data type rule in order to achieve the desired behavior.
Sebastian wrote an excellent blog post on this topic which can be found here: http://zarnekow.blogspot.de/2012/11/xtext-corner-6-data-types-terminals-why.html
Here is a minimal example of a grammar:
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
Model:
greetings+=Greeting*;
Greeting:
'Example' ':' comment=Comment;
Comment:
(ID ('\r'? '\n')?)+
;
That will allow you to write something like this:
Example: A series of words
Example: A series of words on
multiple lines
You then may want to implement your own value converter in order to fine-tune the conversion to a String.
Let me know if that helps!

How to extract sub-strings for a collection of text?

I extracted text from pdf document. .. I want to extract some particular fields in it using java..
The portion of text ..
US00RE44697E (i9) United States (12) Reissued Patent (10)
Patent Number: RE44,697 E Jones et al. (45) Date of
ReissuedPatent: Jan. 7, 2014 (54) ENCRYPTIONPROCESSORWITH SHARED
MEMORY INTERCONNECT (75) Inventors: David E.Jones, Ottawa
(CA); Cormac M.O'Connell, Carp (CA) (73) Assignee: Mosaid
Technologies Incorporated, Ottawa, Ontario (CA) (21)
Appl.No.: 13/603,137 (22) Filed: Sep. 4, 2012 Related U.S.
Patent Documents Reissue of: (64) Patent No.: Issued:
Appl. No.: Filed: 6,088,800 Jul. 11, 2000
09/032,029 Feb. 27, 1998 (51) Int.CI. G06F 21/00
(2013.01) (52) U.S. CI. USPC .............713/189; 713/190;
713/193; 380/28; 380/33; 380/52 (58) Field of Classification
Search None
Now my mission is to extract fields form it and give to strings.. that is
the text (10) Patent Number: RE44,697 E will be extracted as String pat_no= " RE44,697 E"
the text (54) ENCRYPTIONPROCESSORWITH SHARED
MEMORY INTERCONNECT will be extracted as String title= "ENCRYPTIONPROCESSORWITH SHARED
MEMORY INTERCONNECT"
the extremely irregular text block
(64) Patent No.: Issued: Appl. No.: Filed:
6,088,800 Jul. 11, 2000 09/032,029 Feb. 27, 1998
have to be extracted as
String pat_no_org = "6,088,800";
String issued = "jul.11,2000"
String filed = "feb 27 ,1998"
......
like this..
My Works
First i used the string.split , string.substring , string,indexof and even apache string utils , but none helped.. Because the text are scattered , above methods doesn't helped.. I also tried regular expressions ,but since I very weak in it I can't program .
Please tell me how to achieve my objective using java ?
With regex, I would split it in 3 parts:
1.) (10) Patent Number the regex could look like this:
\(10\)\s*Patent Number:\s*([\w,]+)
as a java string:
"\\(10\\)\\s*Patent Number:\\s*([\\w,]+)"
The matches for the first parenthesized group will be in [1].
\s is a shorthand for [ \t\r\n\f] any kind of white-space.
\w is a shorthand for [A-Za-z0-9_] word-characters, together with , in a character class.
Some characters have special meanings in regex. They have to be escaped with a backslash.
2.) (54) ENCRYPT...
A pattern could look like:
(?s)\(54\)\s*(.*?)\s*(?=\(\d|$\))
as a java string:
"(?s)\\(54\\)\\s*(.*?)\\s*(?=\\(\\d|$\\))"
(?s) The s modifier equals Pattern.DOTALL where the dot matches new-lines too.
(?=\(\d|$\)) a lookahead is used, to match (.*?) lazy any amount of any characters until another ( followed by a digit | or string-end $ (anchor for end) is seen.
3.) For the other desired 3 parts I would try to reflect formatting of the input with the pattern. This requires, that all data is constructed compatible. A pattern could look like this:
(?s)\(64\).*?Filed:\s*([\d,]+)\s*(\w+\.\s*\d+,\s*\d+)\s*\n[\d+][^\n]+\n\s*(\w+\.\s*\d+,\s*\d+)
as a java string:
"(?s)\\(64\\).*?Filed:\\s*([\\d,]+)\\s*(\\w+\\.\\s*\\d+,\\s*\\d+)\\s*\\n[\\d+][^\\n]+\\n\\s*(\\w+\\.\\s*\\d+,\\s*\\d+)"
\n matches a newline.
Matches will be in [1] e.g. 6,088,800, [2] e.g. Jul. 11, 2000 and [3] e.g. Feb. 27, 1998.
For getting started with regex, this is too much information at once :)

Regex: strip all tags except those containing keyword "univ"

[introduction][position]Lead Researcher and Research Manager[/position] in the [affiliation]Web Search and Mining Group, Microsoft Research[/affiliation]</b>.
I am a [position]lead researcher[/position] at [affiliation]Microsoft Research[/affiliation]. I am also [position]adjunct professor[/position] of [affiliation]Peking University[/affiliation], [affiliation]Xian Jiaotong University[/affiliation] and [affiliation]Nankai University[/affiliation].
I joined [affiliation]Microsoft Research[/affiliation] in June 2001. Prior to that, I worked at the Research Laboratories of NEC Corporation.
I obtained a [bsdegree]B.S.[/bsdegree] in [bsmajor]Electrical Engineering[/bsmajor] from [bsuniv]Kyoto University[/bsuniv] in [bsdate]1988[/bsdate] and a [msdegree]M.S.[/msdegree] in [msmajor]Computer Science[/msmajor] from [msuniv]Kyoto University[/msuniv] in [msdate]1990[/msdate]. I earned my [phddegree]Ph.D.[/phddegree] in [phdmajor]Computer Science[/phdmajor] from the [phduniv]University of Tokyo[/phduniv] in [phddate]1998[/phddate].
I am interested in [interests]statistical learning[/interests], [interests]natural language processing[/interests], [interests]data mining, and information retrieval[/interests].[/introduction]
I'm able to strip all tags from the paragraph above with:
String stripped = html.replaceAll("\\[.*?\\]", "");
But I'd like to keep three pairs of tags in the paragraph, which are [bsuniv][/bsuniv],[msuniv][/msuniv] and [phduniv][/phduniv]. In other words, I don't want to strip those tags containing the keyword "univ". I can't find a convenient way to rewrite the regular expression. Anyone help me?
You can use a negative-look ahead assertion here: -
str = str.replaceAll("\\[(.(?!univ))*?\\]", "");
or: -
str = str.replaceAll("\\[((?!univ).)*?\\]", "");
Both of them will give you the desired output. There is only one difference -
The first one does a negative look-ahead, against the current character, and if it is not followed by univ, it moves to the next character.
The second one does a negative look-ahead against an empty string before every character, and if it is not followed by univ, it goes ahead to match a single character.

Extracting data from a text file - repeated values

79 0009!017009!0479%0009!0479 0009!0469%0009!0469
0009!0459%0009!0459'009 0009!0459%0009!0449 0009!0449%0009!0449
0009!0439%0009!0439 0009!0429%0009!0429'009 0009!0429%0009!0419
0009!0419%0009!0409 000'009!0399 0009!0389%0009!0389'009
0009!0379%0009!0369 0009!0349%0009!0349 0009!0339%0009!0339
0009!0339%0009!0329'009 0009!0329%0009!0329 0009!032
In this data, I'm supposed to extract the number 47, 46 , 45 , 44 and so on. I´m supposed to avoid the rest. The numbers always follow this flow - 9!0 no 9%
for example: 9!0 42 9%
Which language should I go about to solve this and which function might help me?
Is there any function that can position a special character and copy the next two or three elements?
Ex: 9!0 42 9% and ' 009
look out for ! and then copy 42 from there and look out for ' that refers to another value (009). It's like two different regex to be used.
You can use whatever language you want, or even a unix command line utility like sed, awk, or grep. The regex should be something like this - you want to match 9!0 followed by digits followed by 0%. Use this regex: 9!0(\d+)0% (or if the numbers are all two digits, 9!0(\d{2})0%).
The other answers are fine, my regex solution is simply "9!.(\d\d)"
And here's a full solution in powershell, which can be easily correlated to other .net langs
$t="79 0009!017009!0479%0009!0479 0009!0469%0009!0469 0009!0459%0009!0459'009 0009!0459%0009!0449 0009!0449%0009!0449 0009!0439%0009!0439 0009!0429%0009!0429'009 0009!0429%0009!0419 0009!0419%0009!0409 000'009!0399 0009!0389%0009!0389'009 0009!0379%0009!0369 0009!0349%0009!0349 0009!0339%0009!0339 0009!0339%0009!0329'009 0009!0329%0009!0329 0009!032"
$p="9!.(\d\d)"
$ms=[regex]::match($t,$p)
while ($ms.Success) {write-host $ms.groups[1].value;$ms=$ms.NextMatch()}
This is perl:
#result = $subject =~ m/(?<=9!0)\d+(?=9%)/g;
It will give you an array of all your numbers. You didn't provide a language so I don't know if this is suitable for you or not.
Pattern regex = Pattern.compile("(?<=9!0)\\d+(?=9%)");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
// matched text: regexMatcher.group()
// match start: regexMatcher.start()
// match end: regexMatcher.end()
}

Categories