Regex for almost JSON but not quite - java

Hello all I'm trying to parse out a pretty well formed string into it's component pieces. The string is very JSON like but it's not JSON strictly speaking. They're formed like so:
createdAt=Fri Aug 24 09:48:51 EDT 2012, id=238996293417062401, text='Test Test', source="Region", entities=[foo, bar], user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}
With output just as chunks of text nothing special has to be done at this point.
createdAt=Fri Aug 24 09:48:51 EDT 2012
id=238996293417062401
text='Test Test'
source="Region"
entities=[foo, bar]
user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}
Using the following expression I am able to get most of the fields separated out
,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))(?=(?:[^']*'[^']*')*(?![^']*'))
Which will split on all the commas not in quotes of any type, but I can't seem to make the leap to where it splits on commas not in brackets or braces as well.

Because you want to handle nested parens/brackets, the "right" way to handle them is to tokenize them separately, and keep track of your nesting level. So instead of a single regex, you really need multiple regexes for your different token types.
This is Python, but converting to Java shouldn't be too hard.
# just comma
sep_re = re.compile(r',')
# open paren or open bracket
inc_re = re.compile(r'[[(]')
# close paren or close bracket
dec_re = re.compile(r'[)\]]')
# string literal
# (I was lazy with the escaping. Add other escape sequences, or find an
# "official" regex to use.)
chunk_re = re.compile(r'''"(?:[^"\\]|\\")*"|'(?:[^'\\]|\\')*[']''')
# This class could've been just a generator function, but I couldn;'t
# find a way to manage the state in the match function that wasn't
# awkward.
class tokenizer:
def __init__(self):
self.pos = 0
def _match(self, regex, s):
m = regex.match(s, self.pos)
if m:
self.pos += len(m.group(0))
self.token = m.group(0)
else:
self.token = ''
return self.token
def tokenize(self, s):
field = '' # the field we're working on
depth = 0 # how many parens/brackets deep we are
while self.pos < len(s):
if not depth and self._match(sep_re, s):
# In Java, change the "yields" to append to a List, and you'll
# have something roughly equivalent (but non-lazy).
yield field
field = ''
else:
if self._match(inc_re, s):
depth += 1
elif self._match(dec_re, s):
depth -= 1
elif self._match(chunk_re, s):
pass
else:
# everything else we just consume one character at a time
self.token = s[self.pos]
self.pos += 1
field += self.token
yield field
Usage:
>>> list(tokenizer().tokenize('foo=(3,(5+7),8),bar="hello,world",baz'))
['foo=(3,(5+7),8)', 'bar="hello,world"', 'baz']
This implementation takes a few shortcuts:
The string escapes are really lazy: it only supports \" in double quoted strings and \' in single-quoted strings. This is easy to fix.
It only keeps track of nesting level. It does not verify that parens are matched up with parens (rather than brackets). If you care about that you can change depth into some sort of stack and push/pop parens/brackets onto it.

Instead of splitting on the comma, you can use the following regular expression to match the chunks that you want.
(?:^| )(.+?)=(\{.+?\}|\[.+?\]|.+?)(?=,|$)
Python:
import re
text = "createdAt=Fri Aug 24 09:48:51 EDT 2012, id=238996293417062401, text='Test Test', source=\"Region\", entities=[foo, bar], user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}"
re.findall(r'(?:^| )(.+?)=(\{.+?\}|\[.+?\]|.+?)(?=,|$)', text)
>> [
('createdAt', 'Fri Aug 24 09:48:51 EDT 2012'),
('id', '238996293417062401'),
('text', "'Test Test'"),
('source', '"Region"'),
('entities', '[foo, bar]'),
('user', '{name=test, locations=[loc1,loc2], locations={comp1, comp2}}')
]
I've set up grouping so it will separate out the "key" and the "value". It will do the same in Java - See it working in Java here:
http://www.regexplanet.com/cookbook/ahJzfnJlZ2V4cGxhbmV0LWhyZHNyDgsSBlJlY2lwZRj0jzQM/index.html
Regular Expression explained:
(?:^| ) Non-capturing group that matches the beginning of a line, or a space
(.+?) Matches the "key" before the...
= equal sign
(\{.+?\}|\[.+?\]|.+?) Matches either a set of {characters}, [characters], or finally just characters
(?=,|$) Look ahead that matches either a , or the end of a line.

Related

how to check regex starts and ends with regex

I am having the regex for capturing string if they are in between double quote and not start or end with /.
But the regex solution which I wanted.
The regex should not capture
Condition 1. Capture text between two double or single quotes.
Condition 2. But it shouldn't capture if starts with [ and ends with ]
Condition 3. But it shouldn't if starts with /" and ends with /' or starts with /" and ends with /'
Example:
REGEX: \"(\/?.)*?\"
Input: Functions.getJsonPath(Functions.getJsonPath(Functions.getJsonPath(Functions.unescapeJson("test"), "m2m:cin.as"),"payloads_ul.test"),"[/"Dimming Value/"]",input["test"]["in"])
output:
captured output:
1. "test"
2. "m2m:cin.as"
3. "payloads_ul.test"
4. [/"Dimming Value/"]
5. "test"
6. "in"
Expected result:
1. "test"
2. "m2m:cin.as"
3. "payloads_ul.test"
4. [/"Dimming Value/"]
Condition 1 explanation:
Capture the text between double or single quotes.
example:
input : "test","m2m:cin.as"
output: "test","m2m:cin.as"
Condition 2 explanation:
If the regex is between starts with [ and ends with ] but it is having double or single quote then also it should not capture.
example:
input: ["test"]
output: it should not capture
Condition 3 explanation:
In the above-expected result for the input "[/"Dimming Value/"]" there is a two-time double quote but is capturing only one excluding /". So, the output is [/"Dimming Value/"]. Like this, I want if /' (single quote preceded by /).
Note:
For input "[/"Dimming Value/"]" or '[/'Dimming Value/']', here although the text is between double quote and single quote and having [ and ] it should not ignore the string. The output should be [/"Dimming Value/"].
As I understood, you want to capture text between double quotes, except:
if initial double quotes prefixed by [ or final double quotes suffixed by ]
doubles quotes prefixed by / should not be the begin or end of matched text
I don't know if you want also capture text between single quotes, because you text is not complete clear.
For create a non capture group with negative matching of prefixed chars, you need a group of type Negative Lookbehind, with syntax (?<!prefix that you dont want), but this is not present on java or javascript regex engine.
The best regex that I build to return what you want for you example (but only work on PHP or python (you can check it on site regex101.com or similar)) is:
(?<![\[/])\"(?!\])(\/?.)*?\"(?![\]/])
I added the restriction for don't match if initial double quotes suffixed by ] to prevent match "][" on text ["test"]["in"]
Anyway, this will not solve your problem, since will not work within java or javascript engine!
Do you have any way to process the results, and exclude the bad matches?
If so, you can match bad prefix and bad suffix and exclude it from the results:
[\[]?\"(\/?.)*?\"[\]]?
this will return:
"test"
"m2m:cin.as"
"payloads_ul.test"
"[/"Dimming Value/"]"
["test"]
["in"]
Full javascript code, including pos processing:
'Functions.getJsonPath(Functions.getJsonPath(Functions.getJsonPath(Functions.unescapeJson("test"), "m2m:cin.as"),"payloads_ul.test"),"[/"Dimming Value/"]",input["test"]["in"])'
.match(/[\[]?\"(\/?.)*?\"[\]]?/g).filter(s => !s.startsWith('[') && !s.endsWith(']'))
this will return:
"test"
"m2m:cin.as"
"payloads_ul.test"
"[/"Dimming Value/"]"
EDIT:
equivalent java code:
CharSequence yourStringHere = "Functions.getJsonPath(Functions.getJsonPath(Functions.getJsonPath(Functions.unescapeJson(\"test\"), \"m2m:cin.as\"),\"payloads_ul.test\"),\"[/\"Dimming Value/\"]\",input[\"test\"][\"in\"])";
Matcher m = Pattern.compile("[\\[]?\\\"(\\/?.)*?\\\"[\\]]?")
.matcher(yourStringHere);
while (m.find()) {
String s = m.group();
if (!s.startsWith("[") && !s.endsWith("]")) {
allMatches.add(s);
}
}

How to tokenize, scan or split this string of email addresses

For Simple Java Mail I'm trying to deal with a somewhat free-format of delimited email addresses. Note that I'm specifically not validating, just getting the addresses out of a list of addresses. For this use case the addresses can be assumed to be valid.
Here is an example of a valid input:
"name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
So there are two basic forms "name#domain.com" and "Joe Sixpack ", which can appear in a comma / semicolon delimited string, ignoring white space padding. The problem is that the names can contains delimiters as valid characters.
The following array shows the data needed (trailing spaces or delimiters would not be a big problem):
["name#domain.com",
"Sixpack, Joe 1 <name#domain.com>",
"Sixpack, Joe 2 <name#domain.com>",
"Sixpack, Joe, 3<name#domain.com>",
"nameFoo#domain.com",
"nameBar#domain.com",
"nameBaz#domain.com"]
I can't think of a clean way to deal with this. Any suggestion how I can reliably recognize whether a comma is part of a name or is a delimiter?
Final solution (variation on the accepted answer):
var string = "name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
// recognize value tails and replace the delimiters there, disambiguating delimiters
const result = string
.replace(/(#.*?>?)\s*[,;]/g, "$1<|>")
.replace(/<\|>$/,"") // remove trailing delimiter
.split(/\s*<\|>\s*/) // split on delimiter including surround space
console.log(result)
Or in Java:
public static String[] extractEmailAddresses(String emailAddressList) {
return emailAddressList
.replaceAll("(#.*?>?)\\s*[,;]", "$1<|>")
.replaceAll("<\\|>$", "")
.split("\\s*<\\|>\\s*");
}
since you are not validating, i assume that the email addresses are valid.
Based on this assumption, i will look up an email address followed by ; or , this way i know its valid.
var string = "name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
const result = string.match(/(.*?#.*?\..*?)[,;]/g)
console.log(result)
This pattern works for your provided examples:
([^#,;\s]+#[^#,;\s]+)|(?:$|\s*[,;])(?:\s*)(.*?)<([^#,;\s]+#[^#,;\s]+)>
([^#,;\s]+#[^#,;\s]+) # email defined by an # with connected chars except ',' ';' and white-space
| # OR
(?:$|\s*[,;])(?:\s*) # start of line OR 0 or more spaces followed by a separator, then 0 or more white-space chars
(.*?) # name
<([^#,;\s]+#[^#,;\s]+)> # email enclosed by lt-gt
PCRE Demo
Using Java's replaceAll and split functions (mimicked in javascript below), I would say lock onto what you know ends an item (the ".com"), replace separator characters with a unique temp (a uuid or something like <|>), and then split using your refactored delimiter.
Here is a javascript example, but Java's repalceAll and split can do the same job.
var string = "name#domain.com,Joe Sixpack <name#domain.com>, Sixpack, Joe <name#domain.com> ;Sixpack, Joe<name#domain.com> , name#domain.com,name#domain.com;name#domain.com;"
const result = string.replace(/(\.com>?)[\s,;]+/g, "$1<|>").replace(/<\|>$/,"").split("<|>")
console.log(result)

Matching groups with lookahead expression

I have problem with matching groups that contain lookahead expression. I don't know why this expressions doesn't work:
"""((?<=^)(.*)(?=\s\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%))((?<=[\w:]\s)(\w+)(?=\s[cr]))"""
When I compile them separately, for example:
"""(?<=^)(.*)(?=\s\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%)"""
I get the correct result
My sample text:
May 5 23:00:01 10.14.3.10 %ASA-6-302015: Built inbound UDP connection
Expressions have been checked with this tool: http://regex-testdrive.com/en/dotest
My Scala code:
import scala.util.matching.Regex
val text = "May 5 23:00:01 10.14.3.10 %ASA-6-302015: Built inbound UDP connection"
val regex = new Regex("""((?<=^)(.*)(?=\s\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%))((?<=[\w:]\s)(\w+)(?=\s[cr]))""")
val result = regex.findAllIn(text)
Does anyone know solution of this problem?
Multiple matching
You may fix the pattern as
^.*?(?=\s\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%)|(?<=[\w:]\s)\w+(?=\s[cr])
See the regex demo. The main point is to introduce the | alternation operator to match either of the 2 subpatterns. Note you do not need to put the ^ start of string anchor into a lookbehind, as ^ is already a zero-width assertion. Also, there are too many groupings that you do not seem to use any way. Also, to match a literal dot you need to escape it (. -> \.).
To obtain the multiple matches, you may use the following code snippet:
val text = "May 5 23:00:01 10.14.3.10 %ASA-6-302015: Built inbound UDP connection"
val regex = """^.*?(?=\s\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\s%)|(?<=[\w:]\s)\w+(?=\s[cr])""".r
val result = regex.findAllIn(text)
result.foreach { x => println(x) }
// => May 5 23:00:01
// UDP
See the Scala online demo.
Note that once a pattern is used with .FindAllIn, it is not anchored by default, so you will get all the matches there are in the input string.
Capturing groups
Another approach you may use is matching the whole line while capturing the necessary bits with capturing groups:
val text = "May 5 23:00:01 10.14.3.10 %ASA-6-302015: Built inbound UDP connection"
val regex = """^(.*?)\s+\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%.*[\w:]\s+(\w+)\s+[cr].*""".r
val results = text match {
case regex(date, protocol) => Array(date, protocol)
case _ => Array[String]()
}
// Demo printing
results.foreach { m =>
println(m)
}
See another Scala demo. Since match requires a full string match, .* is added at the end of the pattern, and only relevant pairs of unescaped (...) are kept in the pattern. See the regex demo here.
your matches are not next to each other,
try this:
"""((?<=^)(.*)(?=\s\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\s%)).*((?<=[\w:]\s)(\w+)(?=\s[cr]))"""
I just added the .* between them, it works on the link you sent :)

Regex match repeating pattern only after string

let a PropDefinition be a string of the form prop\d+ (true|false)
I have a string like:
((prop5 true))
sat
((prop0 false)
(prop1 false)
(prop2 true))
I'd like to extract the bottom PropDefinitions only after the text 'sat', so the matches should be:
prop0 false
prop1 false
prop2 true
I originally tried using /(prop\d (?:true|false))/s (see example here) but that obviously matches all PropDefinitions and I couldn't make it match repeats only after the sat string
I used rubular as an example above because it was convenient, but I'm really looking for the most language agnostic solution. If it's vital info, I'll most likely be using the regex in a Java application.
str =<<-Q
((prop5 true))
sat
((prop0 false)
(prop1 false)
(prop2 true))
Q
p str[/^sat(.*)/m, 1].scan(/prop\d+ (?:true|false)/)
# => ["prop0 false", "prop1 false", "prop2 true"]
When you have patterns that are very different in nature as in this case (string after sat and selecting the specific patterns), it is usually better to express them in multiple regexes rather than trying to do it with a single regex.
s = <<_
((prop5 true))
sat
((prop0 false)
(prop1 false)
(prop2 true))
_
s.split(/^sat\s+/, 2).last.scan(/prop\d+ (?:true|false)/)
# => ["prop0 false", "prop1 false", "prop2 true"]
\s+[(]+\K(prop\d (?:true|false)(?=[)]))
Live demo
If Ruby can support the \G anchor this is one solution.
It looks nasty, but several things are going on.
1. It only allows a single nest (outer plus many inners)
2. It will not match invalid forms that don't comply with '(prop\d true|false)'
Without condition 2, it would be alot easier which is an indicator that a two regex
solution would do the same. First to capture the outer form sat((..)..(..)..)
second to globally capture the inner form (prop\d true|false).
Can be done in a single regex, though this is going to be hard to look at, but should work (test case below in Perl).
# (?:(?!\A|sat\s*\()\G|sat\s*\()[^()]*(?:\((?!prop\d[ ](?:true|false)\))[^()]*\)[^()]*)*\((prop\d[ ](?:true|false))\)(?=(?:[^()]*\([^()]*\))*[^()]*\))
(?:
(?! \A | sat \s* \( )
\G # Start match from end of last match
| # or,
sat \s* \( # Start form 'sat ('
)
[^()]* # This check section consumes invalid inner '(..)' forms
(?: # since we are looking specifically for '(prop\d true|false)'
\(
(?!
prop \d [ ]
(?: true | false )
\)
)
[^()]*
\)
[^()]*
)* # End section, do optionally many times
\(
( # (1 start), match inner form '(prop\d true|false)'
prop \d [ ]
(?: true | false )
) # (1 end)
\)
(?= # Look ahead for end form '(..)(..))'
(?:
[^()]*
\( [^()]* \)
)*
[^()]*
\)
)
Perl test case
$/ = undef;
$str = <DATA>;
while ($str =~ /(?:(?!\A|sat\s*\()\G|sat\s*\()[^()]*(?:\((?!prop\d[ ](?:true|false)\))[^()]*\)[^()]*)*\((prop\d[ ](?:true|false))\)(?=(?:[^()]*\([^()]*\))*[^()]*\))/g)
{
print "'$1'\n";
}
__DATA__
((prop10 true))
sat
((prop3 false)
(asdg)
(propa false)
(prop1 false)
(prop2 true)
)
((prop5 true))
Output >>
'prop3 false'
'prop1 false'
'prop2 true'
Part of the confusion has to do with SingleLine vs MultiLine matching. The patterns below work for me and return all matches in a single execution and without requiring a preliminary operation to split the string.
This one requires SingleLine mode to be specified separately (as in .Net RegExOptions):
(?<=sat.*)(prop\d (?:true|false))
This one specifies SingleLine mode inline which works with many, but not all, RegEx engines:
(?s)(?<=sat.*)(?-s)(prop\d (?:true|false))
You don't need to turn SingleLine mode off via the (?-s) but I think it is clearer in its intent.
The following pattern also toggles SingleLine mode inline, but uses a Negative LookAhead instead of a Positive LookBehind as it seems (according to regular-expressions.info [be sure to select Ruby and Java from the drop-downs]) the Ruby engine doesn't support LookBehinds--Positive or Negative--depending on the version, and even then doesn't allow quantifiers (also noted by #revo in a comment below). This pattern should work in Java, .Net, most likely Ruby, and others:
(prop\d (?:true|false))(?s)(?!.*sat)(?-s)
/(?<=sat).*?(prop\d (true|false))/m
Match group 1 is what you want. See example.
BUT, I would really recommend split the string first. It's much easier.

Extracting data from a text file - repeated values

79 0009!017009!0479%0009!0479 0009!0469%0009!0469
0009!0459%0009!0459'009 0009!0459%0009!0449 0009!0449%0009!0449
0009!0439%0009!0439 0009!0429%0009!0429'009 0009!0429%0009!0419
0009!0419%0009!0409 000'009!0399 0009!0389%0009!0389'009
0009!0379%0009!0369 0009!0349%0009!0349 0009!0339%0009!0339
0009!0339%0009!0329'009 0009!0329%0009!0329 0009!032
In this data, I'm supposed to extract the number 47, 46 , 45 , 44 and so on. I´m supposed to avoid the rest. The numbers always follow this flow - 9!0 no 9%
for example: 9!0 42 9%
Which language should I go about to solve this and which function might help me?
Is there any function that can position a special character and copy the next two or three elements?
Ex: 9!0 42 9% and ' 009
look out for ! and then copy 42 from there and look out for ' that refers to another value (009). It's like two different regex to be used.
You can use whatever language you want, or even a unix command line utility like sed, awk, or grep. The regex should be something like this - you want to match 9!0 followed by digits followed by 0%. Use this regex: 9!0(\d+)0% (or if the numbers are all two digits, 9!0(\d{2})0%).
The other answers are fine, my regex solution is simply "9!.(\d\d)"
And here's a full solution in powershell, which can be easily correlated to other .net langs
$t="79 0009!017009!0479%0009!0479 0009!0469%0009!0469 0009!0459%0009!0459'009 0009!0459%0009!0449 0009!0449%0009!0449 0009!0439%0009!0439 0009!0429%0009!0429'009 0009!0429%0009!0419 0009!0419%0009!0409 000'009!0399 0009!0389%0009!0389'009 0009!0379%0009!0369 0009!0349%0009!0349 0009!0339%0009!0339 0009!0339%0009!0329'009 0009!0329%0009!0329 0009!032"
$p="9!.(\d\d)"
$ms=[regex]::match($t,$p)
while ($ms.Success) {write-host $ms.groups[1].value;$ms=$ms.NextMatch()}
This is perl:
#result = $subject =~ m/(?<=9!0)\d+(?=9%)/g;
It will give you an array of all your numbers. You didn't provide a language so I don't know if this is suitable for you or not.
Pattern regex = Pattern.compile("(?<=9!0)\\d+(?=9%)");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
// matched text: regexMatcher.group()
// match start: regexMatcher.start()
// match end: regexMatcher.end()
}

Categories