Regular expression for splitting JSON text in lines after symbols

Regular expression for splitting JSON text in lines after symbols - java

I am trying to use a regular expression to have this kind of string
{
"key1"
:
value1
,
"key2"
:
"value2"
,
"arrayKey"
:
[
{
"keyA"
:
valueA
,
"keyB"
:
"valueB"
,
"keyC"
:
[
0
,
1
,
2
]
}
]
}
from
JSONObject.toString()
that is one long line of text in my Android Java app
{"key1":"value1","key2":"value2","arrayKey":[{"keyA":"valueA","keyB":"valueB","keyC":[0,1,2]}]}
I found this regular expression for finding all commas.
/(,)(?=(?:[^"]|"[^"]*")*$)/
Now I need to know:
0- if this is reliable, that is, does what they say.
1- if this is works also with commas inside double-quotes.
2- if this takes into account escaped double-quotes.
3- if I have to take into account also single quotes, as this file is produced by my app but occasionally it could be manually edited by the user.
5- It has to be used with the multi-line flag to work with multi-line text.
6- It has to work with replaceAll().
The resulting regular expression will be be used for replacing each symbol with a two-char sequence made of the symbol itself plus \n character.
The resulting text has to be still JSON text.
Subsequent replace actions will take place also for the other symbols
: [ ] { }
and other symbols that can be found in JSON files outside the alphanumeric sequences between quotes (I do not know if the mentioned symbols are the only ones).

Its not that much simple, but yes if you want to do then you need to filter characters([,{,",',:) and replace then with a new line character against it.
like:
[ should get replaced with [\n
Answer to your question is Yes its very much reliable and good to implement its just a single line of code doing all. Thats what regex is made for.

0- if this is reliable, that is, does what they say.
Let's break down the expression a little:
(,) is a capturing group that matches a single comma
(?=...) would mean a positive lookahead meaning the comma would need to be followed by a match of that group's content
(?:...)* would be a non-capturing group that can occur 0 to many times
[^"]|"[^"]*" would match either any character except a double quote ([^"]) or (|) a pair of double quotes with any character in between except other double quotes ("[^"]*")
As you can see especially the last part could make it unreliable if there are escaped double quotes in a text value, so the answer would be "this is reliable if the input is simple enough".
1- if this is works also with commas inside double-quotes.
If the double quote pairs are correctly identified any commas in between would be ignored.
2- if this takes into account escaped double-quotes.
Here's one of the major problems: escaped double quotes would need to be handled. This can get quite complex if you want to handle arbitrary cases, especially if the texts could contain commas as well.
3- if I have to take into account also single quotes, as this file is produced by my app but occasionally it could be manually edited by the user.
Single quotes aren't allowed by the JSON sepcification but many parsers support them because humans tend to use them anyway. Thus you might need to take them into account and that makes no. 2 even more complex because now there might be an unescaped double quote in a single quote text.
5- It has to be used with the multi-line flag to work with multi-line text.
I'm not entirely sure about that but adding the multi-line flag shouldn't hurt. You could add it to the expression itself though, i.e. by prepeding (?m).
6- It has to work with replaceAll().
In its current form the regex would work with String#replaceAll() because it only matches the comma - the lookahead is used to determine a match but won't result in the wrong parts being replaced. The matches themselves might not be correct though, as described above.
That being said, you should note that JSON is not a regular language and only regular languages are a perfect fit for regular expressions.
Thus I'd recommend using a proper JSON parser (there are quite a lot out there) to parse the JSON into POJOs (might just be a bunch of generic JsonObject and JsonArray instances) and reformat that according to your needs.
Here's an example of how Jackson could be used to accomplish that: https://kodejava.org/how-to-pretty-print-json-string-using-jackson/
In fact, since you're already using JSONObject.toString() you probably don't need the parser itself but just a proper formatter (if you want/need to roll your own you could have a look at the org.json.JSONObject sources ).

Related

Match custom pattern in regex multiple times

I am trying to parse a query which I need to modify to replace a specific property and its value with another property and different values. I am struggling to write a regex that will match the specify property and its value that I need.
Here are some examples to illustrate my point. test:property is the property name that we need to match.
Property with a single value: test:property:schema:Person
Property with multiple values (there is no limit on how many values there can be - this example uses 3): test:property:(schema:Person OR schema:Organization OR schema:Place)
Property with a single value in brackets: test:property:(schema:Person)
Property with another property in the query string (i.e. there are other parts of the string that I'm not interested in): test:property:schema:Person test:otherProperty:anotherValue
Also note that other combinations are possible such as other properties being before the property I need to capture, my property having multiple values with another property present in the query.
I want to match on the entire test:property section with each value captured within that match. Given the examples above these are the results I am looking for:
#
Match
Groups
1
test:property:schema:Person
schema:Person
2
test:property:(schema:Person OR schema:Organization OR schema:Place)
schema:Personschema:Organizationschema:Person
3
test:property:(schema:Person)
schema:Person
4
test:property:schema:Person
schema:Person
Note: #1 and #4 produce the same output. I wanted to illustrate that the rest of the string should be ignored (I only need to change the test:property key and value).
The pattern of schema:Person is defined as \w+\:\w+, i.e. one or more word characters, followed by a colon, followed by one or more word characters.
If we define the known parts of the string with names I think I can express what I want to match.
schema:Person - <TypeName> - note that the first part, schema in this case, is not fixed and can be different
test:property - <MatchProperty>
<MatchProperty>: // property name (which is known and the same - in the examples this is `test:property`) followed by a colon
( // optional open bracket
<TypeName>
(OR <TypeName>)* // optional additional TypeNames separated by an OR
) // optional close bracket
Every example I've found has had simple alphanumeric characters in the repeating section but my repeating pattern contains the colon which seems to be tripping me up. The closest I've got is this:
(test\:property:(?:\(([\w+\:\w+]+ [OR [\w+\:\w+]+)\))|[\w+\:\w+]+)
Which works okayish when there are no other properties (although the match for example #2 contains the entire property and value as the first group result, and a second group with the property value) but goes crazy when other properties are included.
Also, putting that regex through https://regex101.com/ I know it's not right as the backslash characters in the square brackets are being matched exactly. I started to have a go with capturing and non-capturing groups but got as far as this before giving up!
(?:(\w+\:\w+))(?:(\sOR\s))*(?:(\w+\:\w+))*

This isn't a complete solution if you want pure regex because there are some limitations to regex and Java regex in particular, but the regexes I came up with seem to work.
If you're looking to match the entire sequence, the following regex will work.
test:property:(?:\((\w+:\w+)(?:\sOR\s(\w+:\w+))*\)|(\w+:\w+))
Unfortunately, the repeated capture groups will only capture the last match, so in queries with multiple values (like example 2), groups 1 and 2 will be the first and last values (schema:Person and schema:Place). In queries without parentheses, the value will be in group 3.
If you know the maximum number of values, you could just generate a massive regex that will have enough groups, but this might not be ideal depending on your application.
The other regex to find values in groups of arbitrary length uses regex's positive lookbehind to match valid values. You can then generate an array of matches.
(?<=test:property:(?:(?:\((?:\w+:\w+\sOR\s)+)|\(?))\w+:\w+
The issue with this method is that it looks like Java lookbehind has some limitations, specifically, not allowing unbound or complex quantifiers. I'm not a Java person so I haven't tried things out for myself, but it seems like this wouldn't work either. If someone else has another solution, please post another answer!
With this in mind, I would probably suggest going with a combination regex + string parsing method. You can use regex to parse out the value or multiple values (separated by OR), then split the string to get your final values.
To match the entire part inside parentheses or the single value no parentheses, you can use this regex:
test:property:(?:\((\w+:\w+(?:\sOR\s\w+:\w+)*)\)|(\w+:\w+))
It's still split into two groups where one matches values with parentheses and the other matches values without (to avoid matching unpaired parentheses), but it should be usable.
If you want to play around with these regexes or learn more, here's a regexr: https://regexr.com/65kma

storing CSV files containing floating point numbers with phone language set to Deutsch [duplicate]

I've got a two column CSV with a name and a number. Some people's name use commas, for example Joe Blow, CFA. This comma breaks the CSV format, since it's interpreted as a new column.
I've read up and the most common prescription seems to be replacing that character, or replacing the delimiter, with a new value (e.g. this|that|the, other).
I'd really like to keep the comma separator (I know excel supports other delimiters but other interpreters may not). I'd also like to keep the comma in the name, as Joe Blow| CFA looks pretty silly.
Is there a way to include commas in CSV columns without breaking the formatting, for example by escaping them?

To encode a field containing comma (,) or double-quote (") characters, enclose the field in double-quotes:
field1,"field, 2",field3, ...
Literal double-quote characters are typically represented by a pair of double-quotes (""). For example, a field exclusively containing one double-quote character is encoded as """".
For example:
Sheet: |Hello, World!|You "matter" to us.|
CSV: "Hello, World!","You ""matter"" to us."
More examples (sheet → csv):
regular_value → regular_value
Fresh, brown "eggs" → "Fresh, brown ""eggs"""
" → """"
"," → ""","""
,,," → ",,,"""
,"", → ","""","
""" → """"""""
See wikipedia.

I found that some applications like Numbers in Mac ignore the double quote if there is space before it.
a, "b,c" doesn't work while a,"b,c" works.

The problem with the CSV format, is there's not one spec, there are several accepted methods, with no way of distinguishing which should be used (for generate/interpret). I discussed all the methods to escape characters (newlines in that case, but same basic premise) in another post. Basically it comes down to using a CSV generation/escaping process for the intended users, and hoping the rest don't mind.
Reference spec document.

If you want to make that you said, you can use quotes. Something like this
$name = "Joe Blow, CFA.";
$arr[] = "\"".$name."\"";
so now, you can use comma in your name variable.

You need to quote that values.
Here is a more detailed spec.

In addition to the points in other answers: one thing to note if you are using quotes in Excel is the placement of your spaces. If you have a line of code like this:
print '%s, "%s", "%s", "%s"' % (value_1, value_2, value_3, value_4)
Excel will treat the initial quote as a literal quote instead of using it to escape commas. Your code will need to change to
print '%s,"%s","%s","%s"' % (value_1, value_2, value_3, value_4)
It was this subtlety that brought me here.

You can use Template literals (Template strings)
e.g -
`"${item}"`

CSV files can actually be formatted using different delimiters, comma is just the default.
You can use the sep flag to specify the delimiter you want for your CSV file.
Just add the line sep=; as the very first line in your CSV file, that is if you want your delimiter to be semi-colon. You can change it to any other character.

This isn't a perfect solution, but you can just replace all uses of commas with ‚ or a lower quote. It looks very very similar to a comma and will visually serve the same purpose. No quotes are required
in JS this would be
stringVal.replaceAll(',', '‚')
You will need to be super careful of cases where you need to directly compare that data though

Depending on your language, there may be a to_json method available. That will escape many things that break CSVs.

I faced the same problem and quoting the , did not help. Eventually, I replaced the , with +, finished the processing, saved the output into an outfile and replaced the + with ,. This may seem ugly but it worked for me.

May not be what is needed here but it's a very old question and the answer may help others. A tip I find useful with importing into Excel with a different separator is to open the file in a text editor and add a first line like:
sep=|
where | is the separator you wish Excel to use.
Alternatively you can change the default separator in Windows but a bit long-winded:
Control Panel>Clock & region>Region>Formats>Additional>Numbers>List separator [change from comma to your preferred alternative]. That means Excel will also default to exporting CSVs using the chosen separator.

You could encode your values, for example in PHP base64_encode($str) / base64_decode($str)
IMO this is simpler than doubling up quotes, etc.
https://www.php.net/manual/en/function.base64-encode.php
The encoded values will never contain a comma so every comma in your CSV will be a separator.

You can use the Text_Qualifier field in your Flat file connection manager to as ". This should wrap your data in quotes and only separate by commas which are outside the quotes.

First, if item value has double quote character ("), replace with 2 double quote character ("")
item = item.ToString().Replace("""", """""")
Finally, wrap item value:
ON LEFT: With double quote character (")
ON RIGHT: With double quote character (") and comma character (,)
csv += """" & item.ToString() & ""","

Double quotes not worked for me, it worked for me \". If you want to place a double quotes as example you can set \"\".
You can build formulas, as example:
fprintf(strout, "\"=if(C3=1,\"\"\"\",B3)\"\n");
will write in csv:
=IF(C3=1,"",B3)

A C# method for escaping delimiter characters and quotes in column text. It should be all you need to ensure your csv is not mangled.
private string EscapeDelimiter(string field)
{
if (field.Contains(yourEscapeCharacter))
{
field = field.Replace("\"", "\"\"");
field = $"\"{field}\"";
}
return field;
}

JSON broken when double quotes comes inside the key/value

Sample data:
{"630":{"TotalLength":"33-3/8" - 36-3/4""},"631":{"Length":"34 37 7/8"}}
We are facing the double quotes issue in JSON response. How we can replace the double quotes with " \" " which comes inside the key or value? Java is the development platform.

This answer is assuming that you are not in control of creating this JSON-like string. If you can control that part, then you should be escaping properly there itself.
In this case, since parsing systematically is not an option as it's not a valid JSON yet, all I could suggest is to go through the various strings and see if you can find a pattern on which you can apply some logic and escape all the "s which prevent the string from being a valid JSON.
Here is probably a way to start:
All of the "s that are needed to be there for the string to be a vaild JSON are surrounded by one or multiple characters among {, :, ,, and }, with or without space in between the " and the other JSON characters.
So, if you parse the JSON-like string using Java and look for all the "s, and, when encountered with one, if they are along with any of the above characters (with or without space in between), you just leave it as it is. If not, replace that " with a \".
Note that the above method may or may not work depending on the data in question. What I mean to convey is the approach that you may find useful if there's absolutely no way for the string to be escaped during it's creation, and, if these strings follow a strict pattern with respect to the unescaped "s.

How to write a regular expression that can matches the whole function #Prompt (…) whatever written inside () even if it contains another ()

For example I want replace any prompt function in an SQL query
I have used this expression
Query = Query.replaceAll("#prompt\\s*\\(.*?\\)", "(1)");
This expression works in this example
#Prompt('Bill Cycle','A','MIGRATION\BC',,,)
#Prompt('Bill Cycle','A','MIGRATION\BC',,,)
and the output is is (1)
but when it does not work on this example
#Prompt('Groups','A','[Lookup] Price Group (2)\Nested Group',,,)
the out put is (1) \Nested Group',,,) which is not valid

Sadly, as pointed out by Joe C in a comment, what you are trying to do cannot be done in a regular expression for arbitrary depth parenthesis. The reason is because regular-expressions are not capable of "counting". You need a stack machine for that, or a context-free language parser.
However, you also suggest that the 'prompted' content is always inside single quotes. I assume below the standard Java regexp library. Other regexp libraries might need translation...
"#Prompt\\('[^']*'(\s*,\s*(('[^']*')|([^',)]*)))*\\)"
So, you are searching within prompt for blocks of single-quoted text. The search assumes that each internal bit of content is enclosed in single quotes.
Verify at https://regex101.com/r/nByy0Y/1 (I made a couple fixes). Note that at regex101.com, it will treat the double back-slash as intending a literal back-slash. What you want instead is just to quote the parenthesis so that you want a literal parenthesis.

Because you are using the lazy quantifier '?', it is stopping the match at the end of the first ')'. removing that will let it go to the end greedily, as such:
#prompt\(.*\)
But if there is concern that the entries may have more parans after the one in question, it will cause problems.
Assuming the additional parens will always be in quotes, you can do this:
#prompt\((('([^'])*',*)*|(.*,*)*)\)
Here is it looking for items wrapped in single quotes OR text without parens, which should capture all of the single quoted elements or null params or unquoted text params

Java Splitting a String

I have this string
G234101,Non-Essential,ATPases,Respiration chain complexes,"Auxotrophies, carbon and",PS00017,2,IONIC HOMEOSTASIS,mitochondria.
That I have been trying to split in java. The file is comma delimeted but some of the strings have commas within them and I don't want them to get split up. Currently in the above example
"Auxotrophies, carbon and"
is getting split into two strings.
Any suggestions on how to best split this up by comma's. Not all of the strings have the " " for example the following string:
G234103,Essential,Protein Kinases,?,Cell cycle defects,PS00479,2,CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION,cytoplasm.

http://opencsv.sourceforge.net/
But if you really do need to reinvent the wheel (homework), you need to use a more complicated regular expression than just "what,ever".split(","). It's not simple though. And you might be better off creating your own custom Lexer. http://en.wikipedia.org/wiki/Lexical_analysis
This isn't too hard in your case. As you process your text character by character you just need to keep track of opening and closing quotes to decide when to ignore commas and when to act on them.
Also see StreamTokenizer for a built-in configurable Lexer - you should be able to use this to meet your requirements.

I would think that this would be a multi step process. First, find all the comma's in quotes from your original string, replace it with something like {comma}. You can do this with some regex. Then on the new string, split the new string with the comma symbol(,). Then go through your list, and replace the {comma} with the comma symbol {,}.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.