How can i split this string in Java? [duplicate] - java

This question already has answers here:
Java: splitting a comma-separated string but ignoring commas in quotes
(12 answers)
Closed 9 years ago.
I have a problem with splitting a sentence in Java
input string :
"retinol,\"3,7,11,15-tetramethyl-2,4,6,10,14-hexadecapentaenoic acid\",C034534,81485-25-8,\"Carcinoma, Hepatocellular\",MESH:D006528,Cancer|Digestive system disease,,17270033,therapeutic";
and i want to split it and get splitted terms like as follows ;
retinol
3,7,11,15-tetramethyl-2,4,6,10,14-hexadecapentaenoic acid
C034534
81485-25-8
Carcinoma, Hepatocellular
MESH:D006528
Cancer|Digestive system disease
(nothing)
17270033
therapeutic
I tried few way to solve this problem such as Pattern/Matcher and split(",")[] etc..
But, i couldn't find the answer..

As discussed in the comments, since you're parsing a CSV file, you're going to want to use a library specifically written to parse CSVs. Otherwise you'll continue to run into problems where what you write is "useless when a different patten comes out" (as you said).
However, to solve the question at hand you just have to split on a comma, ignoring commas inside of quotes. So you can do this (from this answer):
String input = "retinol,\"3,7,11,15-tetramethyl-2,4,6,10,14-hexadecapentaenoic acid\",C034534,81485-25-8,\"Carcinoma, Hepatocellular\",MESH:D006528,Cancer|Digestive system disease,,17270033,therapeutic";
String[] output = input.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
for(String s : output){
System.out.println(s);
}
This will give you this output (note the quotes and empty line):
retinol
"3,7,11,15-tetramethyl-2,4,6,10,14-hexadecapentaenoic acid"
C034534
81485-25-8
"Carcinoma, Hepatocellular"
MESH:D006528
Cancer|Digestive system disease
17270033
therapeutic
You can replace the quotes and ignore the empty line as you wish. This loop will print the exact output requested in the question:
int i=1;
for(String s : output){
if(!s.isEmpty()){
System.out.println(i++ + ". " + s.replace("\"", ""));
}
}
Output:
retinol
3,7,11,15-tetramethyl-2,4,6,10,14-hexadecapentaenoic acid
C034534
81485-25-8
Carcinoma, Hepatocellular
MESH:D006528
Cancer|Digestive system disease
17270033
therapeutic
But, please, use a library like OpenCSV.

Related

How to print the symbols and < and > to a file as space < and > respectively

How do you print symbols in Java to a file when you have only the symbol description?
I received a string from DB2 which contains symbols.
Two samples:
1) <0800>
2) 51V 3801Z
Such a string goes to two different places. One is a JSP rendering it as HTML. That is perfect; I get <0800> and 51V 3801Z, respectively. The other place is a CSV file created with java.io.FileWriter, and it does not convert to "<", ">", and " ". Instead, it is printed exactly as it came from DB2:
<0800>
and 51V 3801Z.
Is there anything the "new" nio library could help me? I have tried apache.commons.lang3.StringScapeUtils.escapeHTML4 without success.
I suggest looking into Apache's StringEscapeUtils, namely the unescapeHtml4() method.
Example:
String input = "<0800>";
String output = StringEscapeUtils.unescapeHtml4(input);
Ensure you are using the unescapeHtml4 method, and not the regular escapeHtml4 method!

How to replace String with different slashes? [duplicate]

This question already has answers here:
java split function
(6 answers)
Closed 7 years ago.
I need to rename some paths in database.
I rename folder:
String mainFolder= "D:\test\1\data"; //folder renamed from fd
Then i need to rename all files and directories inside that folder:
String file1="D:\test\1\fd\dr.jpg";
String folder1="D:\test\1\fd\fd"; // in this case last fd needs to be renamed
String folder2="src/fd/fd/"; //fake path also needs to be renamed
What is the best and fastest way to rename that strings?
My thoughts about "/":
String folder2= "src/da/da";
String[] splittedFakePath = folder2.split("/");
splittedFakePath[splittedFakePath.length - 2] = "data";
StringBuffer newFakePath = new StringBuffer();
for (String str : splittedFakePath) {
newFakePath.append(str).append("/");
}
String after rename: src/data/da/
But when im trying split by "\":
Arrays.toString(Pattern.compile(File.separator).split(folder1));
I receive:
java.util.regex.PatternSyntaxException: Unexpected internal error near index 1
\
^
Look into java's String replace(...) method.
It is wonderful for string replacement, much better than attempting a regex.
Keep in mind that real directory handling has a few special cases, which don't lend themselves well to direct string manipulation. For example '//' often gets compacted to '/' in Unix like systems, and if you care about proper directory corner-cases, then use the Java Path class

regex extract in hadoop with pipe (escape character) as delimiter

I have a string "Hadoop|regex|Issue". I want to split using | as delimiter. I used this code -
String[] afterSplit = string.split("\\|"); but afterSplit contains only 2 strings "Hadoop" and "regex". I get ArrayIndexOutOfBoundsException exception when I try to retrieve afterSplit[2]. I want "Issue" in afterSplit[2].
I also tried the below code
String regex = Pattern.quote("|");
String[] parts = line.split(regex);
Note: Both work in simple java code but I get error while trying to implement in Hadoop. Please suggest. Thanks.

Java Western + Arabic String concatenation issues

I'm having trouble in concatenating pieces of text mixing Western and Arabic chars.
I've a list of tokens like this:
-LRB-
دریای
مازندران
-RRB-
,
I use the following procedure to concatenate these list of tokens:
String str = "";
for (String tok : tokens) {
str += tok + " ";
}
This is the output of my procedure:
-LRB- دریای مازندران -RRB- ,
As can be seen, the position of the Arabic words is inverted.
How can I solve this (maybe suggesting to Java to ignore the information about text direction)?
EDIT
Actually, it seems that my problem was a false problem.
Now I've a new one. I need to wrap each word inside a string like this (word *) so that my output will be like this:
(word1 *)(word2 *)(word3 *)...
The procedure that I use is the following:
String str = "";
for (String tok : tokens) {
str += "(" + tok + "*)";
}
However, the result that I got is this:
(-LRB- *)(دریای *)(مازندران *)(-RRB- *)(, *)
instead of:
(-LRB- *)(دریای)(* مازندران *)(-RRB- *)(, *)
** EDIT2 **
Actually, I've discovered that my problem is not a problem. I wrote my string on a file and I opened it with nano (in the console). And it was correctly concatenated.
So the problem was due to the Eclipse console (and also gedit) which --let's say-- incorrectly rendered the string.
Anyway, thanks for your help!
The output is correct, and if you are presenting this text to an Arabic-speaking user you should not override the directionality of the text. Arabic is written from right to left. When you concatenate two Arabic strings together, the first will appear to the right of the second. This is controlled by the BiDi algorithm, the details of which are covered in http://www.unicode.org/reports/tr9/.
First, I would suggest using StringBuilder instead of raw String concatination. You will make your Garbage Collector a lot happier. Second, not seeing the input or how your StringTokenizer is setup, I would venture a guess that it seems like you are having problems tokenizing the string properly.

Unable to split a string

I have a string
Mr praneel PIDIKITI
When I use this regular expression
String[] nameParts = name.split("\\s+");
instead of getting three parts I am only getting two, Mr and Praneel PIDIKITI.
I am unable to split the second string. Does anyone know what could be the problem?
I even used split(" ");.
The problem is I used replaceAll("\\<.*?>", " ").trim(); to convert html into this string and then I am using name.split("\\s+"); to get the name value.
I think it must be something other than space (some special character).
Your code should work. I suspect your input. There could be a non printable junk character between Praneel and PIDIKITI. For example,
String name = "Mr praneel" + (char)1 +"PIDIKITI";
String[] nameParts = name.split("\\s+");
for(String s : nameParts)
System.out.println(s);
Are you sure that there is no junk character between Praneel and PIDIKITI?
Remove non printable characters like this:
// remove non printable characters excluding white space characters
name = name.replaceAll("[^\\p{Print}\\s]","");
If you're parsing HTML, may I recommend JSoup? Its a good HTML parser for java

Categories