Trying to create a regexp

Trying to create a regexp - java

I have a string which I want a string to parse via Java or Python regexp:
something (\var1 \var2 \var3 $var4 #var5 $var6 *fdsfdsfd #uytuytuyt fdsgfdgfdgf aaabbccc)
The number of var is unknown. Their exact names are unknown. Their names may or may not start with "\" or "$", "*", "#" or "#" and there're delimited by whitespace.
I'd like to parse them separately, that is, in capture groups, if possible. How can I do that? The output I want is a list of:
[\var1 , \var2 , \var3 , $var4 , #var5 , $var6 , *fdsfdsfd , #uytuytuyt , fdsgfdgfdgf , aaabbccc]
I don't need the java or python code, I just need the regexp. My incomplete one is:
something\s\(.+\)

something\s\((.+)\)
In this regex you are capturing the string containing all the variables. split it based on whitespace since you are sure that they are delimited by whitespace.
m = re.search('something\s\((.+)\)', input_string)
if m:
list_of_vars = m.group(1).split()

Related

How to tokenize, scan or split this string of email addresses

For Simple Java Mail I'm trying to deal with a somewhat free-format of delimited email addresses. Note that I'm specifically not validating, just getting the addresses out of a list of addresses. For this use case the addresses can be assumed to be valid.
Here is an example of a valid input:
"name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
So there are two basic forms "name#domain.com" and "Joe Sixpack ", which can appear in a comma / semicolon delimited string, ignoring white space padding. The problem is that the names can contains delimiters as valid characters.
The following array shows the data needed (trailing spaces or delimiters would not be a big problem):
["name#domain.com",
"Sixpack, Joe 1 <name#domain.com>",
"Sixpack, Joe 2 <name#domain.com>",
"Sixpack, Joe, 3<name#domain.com>",
"nameFoo#domain.com",
"nameBar#domain.com",
"nameBaz#domain.com"]
I can't think of a clean way to deal with this. Any suggestion how I can reliably recognize whether a comma is part of a name or is a delimiter?
Final solution (variation on the accepted answer):
var string = "name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
// recognize value tails and replace the delimiters there, disambiguating delimiters
const result = string
.replace(/(#.*?>?)\s*[,;]/g, "$1<|>")
.replace(/<\|>$/,"") // remove trailing delimiter
.split(/\s*<\|>\s*/) // split on delimiter including surround space
console.log(result)
Or in Java:
public static String[] extractEmailAddresses(String emailAddressList) {
return emailAddressList
.replaceAll("(#.*?>?)\\s*[,;]", "$1<|>")
.replaceAll("<\\|>$", "")
.split("\\s*<\\|>\\s*");
}

since you are not validating, i assume that the email addresses are valid.
Based on this assumption, i will look up an email address followed by ; or , this way i know its valid.
var string = "name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
const result = string.match(/(.*?#.*?\..*?)[,;]/g)
console.log(result)

This pattern works for your provided examples:
([^#,;\s]+#[^#,;\s]+)|(?:$|\s*[,;])(?:\s*)(.*?)<([^#,;\s]+#[^#,;\s]+)>
([^#,;\s]+#[^#,;\s]+) # email defined by an # with connected chars except ',' ';' and white-space
| # OR
(?:$|\s*[,;])(?:\s*) # start of line OR 0 or more spaces followed by a separator, then 0 or more white-space chars
(.*?) # name
<([^#,;\s]+#[^#,;\s]+)> # email enclosed by lt-gt
PCRE Demo

Using Java's replaceAll and split functions (mimicked in javascript below), I would say lock onto what you know ends an item (the ".com"), replace separator characters with a unique temp (a uuid or something like <|>), and then split using your refactored delimiter.
Here is a javascript example, but Java's repalceAll and split can do the same job.
var string = "name#domain.com,Joe Sixpack <name#domain.com>, Sixpack, Joe <name#domain.com> ;Sixpack, Joe<name#domain.com> , name#domain.com,name#domain.com;name#domain.com;"
const result = string.replace(/(\.com>?)[\s,;]+/g, "$1<|>").replace(/<\|>$/,"").split("<|>")
console.log(result)

search and replace string in java using pattern

Given the string
Content ID [9283745997] Content ID [9283005997] There can be text in between Content ID [9283745953] Content ID [9283741197] Content ID [928374500] There can be valid text here which should not be removed.
I want to remove the text starting Content ID followed by [9283745997] any numbers can be present between square brackets. Eventually I want the result string to be
There can be text in between There can be valid text here which should not be removed.
Could anyone please provide a valid regex to capture this recurring text but the numerals within square brackets are unique?
I appreciate your help!
My soulution to this was :
Pattern p = Pattern.compile("(Content ID \\[\\d*\\] )");
Matcher m = p.matcher(str);
StringBuffer sb = new StringBuffer();
while(m.find()) {
m.appendReplacement(sb, "");
}
m.appendTail(sb);
System.out.println(sb);

So basically you are trying to remove each of Content ID [one or more digits].
To do this you can use replaceAll("regex","replacement") method of String class. As replacement you can use empty String "".
Only problem that stays is what regex should you use.
to match Content ID just write it normally as "Content ID "
to match [ or ] you will have to add \ before each of them because they are regex metacharacters and you need to escape them (in Java you will need to write \ as "\\")
to represent one digit (character from range 0-9) regex uses \d (again in Java you will need to write \ as "\\" which will result in "\\d")
to say "one or more of previously described element" just add + after definition of such element. For example if you want to match one or more letters a you can write it as a+.
Now you should be able to create correct regex. If you will have some questions feel free to ask them in comments.

Try this one:
(Content ID \[[0-9]+\])
You can test it here: http://regexpal.com/

I would use the regex
Content ID \[\d+\] ?
Implement it like this:
str.replaceAll("Content ID \\[\\d+\\] ?", "");
You can find an explanation and demonstration here: http://regex101.com/r/qD5rJ6

How to distinguish in quotes delimiter vs out of quotes delimiter

I have a txt file that contains the following
SELECT TOP 20 personid AS "testQu;otes"
FROM myTable
WHERE lname LIKE '%pi%' OR lname LIKE '%m;i%';
SELECT TOP 10 personid AS "testQu;otes"
FROM myTable2
WHERE lname LIKE '%ti%' OR lname LIKE '%h;i%';
............
The above query can be any legit SQl statement (on one or multiple lines , i.e. any way user wishes to type in )
I need to split this txt and put into an array
File file ... blah blah blah
..........................
String myArray [] = text.split(";");
But this does not work properly because it take into account ALL ; . I need to ignore those ; that are within ";" AND ';'. For example ; in here '%h;i%' does not count because it is inside ''. How can I split correctly ?

Assuming that each ; you want to split on is at the end of line you can try to split on each ; + line separator after it like
text.split(";"+System.lineSeparator())
If your file has other line separators then default ones you can try with
text.split(";\n")
text.split(";\r\n")
text.split(";\r")
BTW if you want to include ; in split result (if you don't want to get rid of it) you can use look-behind mechanism like
text.split("(?<=;)"+System.lineSeparator())
In case you are dynamically reading file line-by-line just check if line.endsWith(";").

I see a 'new line' after your ';' - It is generalizable to the whole text file ?
If you must/want use regular expression you could split with a regex of the form
;$
The $ means "end of line", depending of the regex implementation of Java (don't remember).
I will not use regex for this kind of task. Parsing the text and counting the number of ' or " to be able to recognize the reals ";" delimiters is sufficient.

Java replace all invalid character for regex

My String is huge and it will keep changing as I read each String in a loop. It can contain any characters like " , / , \ . $ ,? , [ , & , . , ' , ) , % , ^ , + , * etc. I would like to escape all such characters that might cause a regex to fail on this string in Java. Javascript has something like this in one of the posts which goes like this-
return str.replace(/[\-\[\]\/\{\}\(\)\*\+\?\.\\\^\$\|]/g, "\\$&");
Is there something similar for Java? I'm not sure what should be the character set to escape. Would something like str.replaceAll("[^\u0000-\u00ff]+", " ") do that? (But I'm losing data here if I'm replacing ALL of them with a space, which I want to avoid)

Use this:
String myEscapedString = Pattern.quote(myRawString);

match a string of characters between tags:

I have the following strings:
<PAUL SAINT-KARL 1997-05-07>
<BOB DEAN 2001-05-07>
<GUY JEDDY 2007-05-07>
I want a java regex that would match this type of pattern "name and date" and then extract the name and date separately.
I able to match them separately with the following java regex:
1) (\d{4}-\d{2}-\d{2})>
2) <([ A-Z&#;0-9-]*+)
What I'm looking for is one regex that would identify the full text pattern as provided, and then extract the subsections, such as the actual name, and the date.
I'm looking to use Matcher.group() to retrieve the complete match from the target string.
Thanks

Try this:
"<([ A-Z&#;0-9-]*?) (\\d{4}-\\d{2}-\\d{2})>"
I changed the *+ to *? to make the * match lazily.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Trying to create a regexp - java

something\s\((.+)\) In this regex you are capturing the string containing all the variables. split it based on whitespace since you are sure that they are delimited by whitespace. m = re.search('something\s\((.+)\)', input_string) if m: list_of_vars = m.group(1).split()

Related

How to tokenize, scan or split this string of email addresses

search and replace string in java using pattern

How to distinguish in quotes delimiter vs out of quotes delimiter

Java replace all invalid character for regex

match a string of characters between tags:

Categories

Resources