Java Splitting a String - java

I have this string
G234101,Non-Essential,ATPases,Respiration chain complexes,"Auxotrophies, carbon and",PS00017,2,IONIC HOMEOSTASIS,mitochondria.
That I have been trying to split in java. The file is comma delimeted but some of the strings have commas within them and I don't want them to get split up. Currently in the above example
"Auxotrophies, carbon and"
is getting split into two strings.
Any suggestions on how to best split this up by comma's. Not all of the strings have the " " for example the following string:
G234103,Essential,Protein Kinases,?,Cell cycle defects,PS00479,2,CELLULAR COMMUNICATION/SIGNAL TRANSDUCTION,cytoplasm.

http://opencsv.sourceforge.net/
But if you really do need to reinvent the wheel (homework), you need to use a more complicated regular expression than just "what,ever".split(","). It's not simple though. And you might be better off creating your own custom Lexer. http://en.wikipedia.org/wiki/Lexical_analysis
This isn't too hard in your case. As you process your text character by character you just need to keep track of opening and closing quotes to decide when to ignore commas and when to act on them.
Also see StreamTokenizer for a built-in configurable Lexer - you should be able to use this to meet your requirements.

I would think that this would be a multi step process. First, find all the comma's in quotes from your original string, replace it with something like {comma}. You can do this with some regex. Then on the new string, split the new string with the comma symbol(,). Then go through your list, and replace the {comma} with the comma symbol {,}.

Related

String split fails due to unwanted delimiter

This is the string I need to split for putting in map as key-val pair:
"jti":"4ef61081-e2e0-40e4-a9ad-8f2bf33f8923","exp":1525357546,"nbf":0,"iat":1525271146,"iss":"https://dev.open-sunbird.org/auth/realms/sunbird","aud":"admin-cli"
I tried with
String[] parts = body.split(":|,");
Problem with this approach is the ":" in https link. See the output as follows
--"jti"--"4ef61081-e2e0-40e4-a9ad-8f2bf33f8923"
--"exp"--1525357546
--"nbf"--0
--"iat"--1525271146
--"iss"--"https
--//dev.open-sunbird.org/auth/realms/sunbird"--"aud"
Any lead for the exact regex to solve the issue will be appreciated. (On top of my head is if we can do a check that every spitted word either starts and ends with " or doesn't start and end with ". But I feel that is a naive approach. even if we can do it.)
No need to get fancy with regex. There are a couple options.
This is clearly claims / attributes on a JWT token. Use a library to parse the JWT instead of parsing the string this way.
Just split first by commas, and then by the FIRST colon. Should give you what you want without trying to respect the position of the quotes.
It's JSON, so use a JSON parser.

How to validate String in Java by matches?

To validate String in Java I can use String.matches(). I would like to validate a simple string "*.txt" where "*" means anything. Input e.g. test.txt is correct, but test.tt is not correct, because of ".tt". I tried to use matches("[*].txt"), but it doesn't work. How can I improve this matches? Thanks.
Do not use code, you don't understand!
For your simple problem you could totally avoid using a regular expression and just use
yourString.endsWith(".txt")
and if you want to perform this comparison case insensitive (i.e. allow ".TXT" or ".tXt") use
yourString.toLowerCase().endsWith(".txt")
If you want to learn more about regular expressions in java, I'd recomment a tutorial. For example this one.
You may try this for txt files:
"file.txt".matches("^.*[.]txt$")
Basically ^ means the start of your string. .* means match anything greedy, hence as much as you can get to make the expression match. And [.] means match the dot character. The suffix txt is just the txt text itself. And finally $ is the anchor for the end of the string, which ensures that the string does not contain anything more.
Use .+, it means any character having one or unlimited lengths. It will ensure to avoid the inputs like only .txt
matches(".+[.]txt")
FYI: [.] simply matches with the dot character.

Best delimiter to separate multipe regex

I need to put multiple regular expressions in a single string and then parse it back as separate regex. Like below
regex1<!>regex2<!>regex3
Problem is I am not sure which delimiter will be best to use to separate the expressions in place of <!> shown in example, so that I can safely split the string when parsing it back.
Constraints are, I can not make the string in multiple lines or use xml or json string. Because this string of expressions should be easily configurable.
Looking forward for any suggestion.
Edited:
Q: Why does it have to be a single string?
A: The system has a configuration manager that loads config from properties file. And properties are containing lines like
com.some.package.Class1.Field1: value
com.some.package.Class1.Expressions: exp1<!>exp2<!>exp3
There is no way to write the value in multiple lines in the properties file. That's why.
The best way would be to use invalid regex as delimiter such as ** Because if it is used in normal regex it won't work and would throw an exception{NOTE:++ is valid}
regex1+"**"+regex2
Now you can split it with this regex
(?<!\\\\)[*][*](?![*])
------- -----
| |->to avoid matching pattern like "A*"+"**"+"n+"
|->check if * is not escaped
Following is a list of invalid regex
[+
(+
[*
(*
[?
*+
** (delimiter would be (?<!\\\\)[*][*](?![*]))
??(delimiter would be (?<!\\\\)[?][?](?![?]))
While splitting you need to check if they are escaped
(?<!\\\\)delimiter
Best delimiter is depends upon your requirement. But for best practice use sequesnce of special characters so that possibility of occureance of this sequesnce is minimal
like
$$**##$$
#$%&&%$#
i think its something helpful for u
First you have to replace tag content with single special character and then split
String inputString="regex1<!>regex2<!>regex3";
String noHTMLString = inputString.replaceAll("\\<.*?>","-");
String[] splitString1 = (noHTMLString.split("[-]+"));
for (String string : splitString1) {
System.out.println(string);
}

Suggested ways of reading a text file with inconsistent formatting

I'm trying to read a text file of numbers as a double array and after various methods (usually resulting in an input format exception) I have come to the conclusion that the text file I am trying to read is inconsistent with it's delimiting.
The majority of the text format is in the form "0.000,0.000" so I have been using a Scanner and the useDelimiter(",") to read in each value.
It turns out though (this is a big file of numbers) that some of the formatting is in the form "0.000 0.000" (at the end of a line I presume) which of course produces an input format exception.
This is an open question really, I'm a pretty basic Java programmer so I would just like to see if there are any suggestions/ways of performing this. Is Scanner the correct class to go on this?
Thank you for your time!
Read file as text line-by-line. Then split line into parts:
String[] parts = line.split("[ ,]");
Now iterate over the parts and call Double.parseDouble() for each part.
Scanner allows any Java Regex Pattern to function as a delimiter. You should be able to use any number of delimiters by doing the following:
scanner.setDelimiter("[,\\s]"); // Will match commas and whitespace
I'd like to comment this in instead of making it a separate answer, but my reputation is too low. Apologies, Alex.
You mentioned having two different delimited characters used in different instances, not a combination of the two as a single delimiter.
You can use the vertical bar as logical OR in a regular expression.
scanner.setDelimiter("[,|\\s]"); //Will match commas or whitespace as appropriate
line by line:
String[] parts = line.split("[,|\\s]");

Regular expression for splitting JSON text in lines after symbols

I am trying to use a regular expression to have this kind of string
{
"key1"
:
value1
,
"key2"
:
"value2"
,
"arrayKey"
:
[
{
"keyA"
:
valueA
,
"keyB"
:
"valueB"
,
"keyC"
:
[
0
,
1
,
2
]
}
]
}
from
JSONObject.toString()
that is one long line of text in my Android Java app
{"key1":"value1","key2":"value2","arrayKey":[{"keyA":"valueA","keyB":"valueB","keyC":[0,1,2]}]}
I found this regular expression for finding all commas.
/(,)(?=(?:[^"]|"[^"]*")*$)/
Now I need to know:
0- if this is reliable, that is, does what they say.
1- if this is works also with commas inside double-quotes.
2- if this takes into account escaped double-quotes.
3- if I have to take into account also single quotes, as this file is produced by my app but occasionally it could be manually edited by the user.
5- It has to be used with the multi-line flag to work with multi-line text.
6- It has to work with replaceAll().
The resulting regular expression will be be used for replacing each symbol with a two-char sequence made of the symbol itself plus \n character.
The resulting text has to be still JSON text.
Subsequent replace actions will take place also for the other symbols
: [ ] { }
and other symbols that can be found in JSON files outside the alphanumeric sequences between quotes (I do not know if the mentioned symbols are the only ones).
Its not that much simple, but yes if you want to do then you need to filter characters([,{,",',:) and replace then with a new line character against it.
like:
[ should get replaced with [\n
Answer to your question is Yes its very much reliable and good to implement its just a single line of code doing all. Thats what regex is made for.
0- if this is reliable, that is, does what they say.
Let's break down the expression a little:
(,) is a capturing group that matches a single comma
(?=...) would mean a positive lookahead meaning the comma would need to be followed by a match of that group's content
(?:...)* would be a non-capturing group that can occur 0 to many times
[^"]|"[^"]*" would match either any character except a double quote ([^"]) or (|) a pair of double quotes with any character in between except other double quotes ("[^"]*")
As you can see especially the last part could make it unreliable if there are escaped double quotes in a text value, so the answer would be "this is reliable if the input is simple enough".
1- if this is works also with commas inside double-quotes.
If the double quote pairs are correctly identified any commas in between would be ignored.
2- if this takes into account escaped double-quotes.
Here's one of the major problems: escaped double quotes would need to be handled. This can get quite complex if you want to handle arbitrary cases, especially if the texts could contain commas as well.
3- if I have to take into account also single quotes, as this file is produced by my app but occasionally it could be manually edited by the user.
Single quotes aren't allowed by the JSON sepcification but many parsers support them because humans tend to use them anyway. Thus you might need to take them into account and that makes no. 2 even more complex because now there might be an unescaped double quote in a single quote text.
5- It has to be used with the multi-line flag to work with multi-line text.
I'm not entirely sure about that but adding the multi-line flag shouldn't hurt. You could add it to the expression itself though, i.e. by prepeding (?m).
6- It has to work with replaceAll().
In its current form the regex would work with String#replaceAll() because it only matches the comma - the lookahead is used to determine a match but won't result in the wrong parts being replaced. The matches themselves might not be correct though, as described above.
That being said, you should note that JSON is not a regular language and only regular languages are a perfect fit for regular expressions.
Thus I'd recommend using a proper JSON parser (there are quite a lot out there) to parse the JSON into POJOs (might just be a bunch of generic JsonObject and JsonArray instances) and reformat that according to your needs.
Here's an example of how Jackson could be used to accomplish that: https://kodejava.org/how-to-pretty-print-json-string-using-jackson/
In fact, since you're already using JSONObject.toString() you probably don't need the parser itself but just a proper formatter (if you want/need to roll your own you could have a look at the org.json.JSONObject sources ).

Categories