Find Comments with StringTokenizer - java

I used the following code to count the number of comments in a code:
StringTokenizer stringTokenizer = new StringTokenizer(str);
int x = 0;
while (stringTokenizer.hasMoreTokens()) {
if (exists == false && stringTokenizer.nextToken().contains("/*")) {
exists = true;
} else if (exists == true && stringTokenizer.nextToken().contains("*/")) {
x++;
exists = false;
}
}
System.out.println(x);
It works if comments have spaces:
e.g.: "/* fgdfgfgf */ /* fgdfgfgf */ /* fgdfgfgf */".
But it does not work for comments without spaces:
e.g.: "/*fgdfgfgf *//* fgdfgfgf*//* fgdfgfgf */".

Using StringUtils class in commons lang, you can very easily archive this
String str = "Your String"
if (&& StringUtils.countMatches(str,"/*") != 0) {
//no need this if condition
} else if (StringUtils.countMatches(str,"*/") != 0) {
x = StringUtils.countMatches(str,"*/");
}
System.out.println(x);

new StringTokenizer(str,"\n") tokenizes/splits str into lines rather than using the default delimiter which is \t\n\r\f, a combination of spaces, tabs, formfeed, carriage and newline
StringTokenizer stringTokenizer = new StringTokenizer(str,"\n");
This specifies newline as the only delimiter to use for Tokenizing
Using your current approach:
String line;
while(stringTokenizer.hasMoreTokens()){
line=stringTokenizer.nextToken();
if(!exists && line.contains("/*")){
exists = true;
}
if(exists && line.contains("*/")){
x++;
exists = false;
}
}
For multiple comments I tried to use /\\* & \\*/ as patterns in split() and got length for their occurrence in the string, but unfortunately length were not exact due to uneven splitting.
Multiple/Single Comments can be: (IMO)
COMMENT=/* .* */
A = COMMENT;
B = CODE;
C = AB/BA/ABA/BAB/AAB/BAA/A;

This reminds me of flip-flops in Ruby/Perl/Awk et al. There is no need to use a StringTokenizer. You just need to keep states to count the number of lines with comments.
You are inside a comment block. You start printing or collecting all the characters. As soon as you encounter a */ in its entirety you toggle the comment block switch. And switch to state 2
You reject everything until you encounter a /* and are back to state 1.
Something like this
public static int countLines(Reader reader) throws IOException {
int commentLines = 0;
boolean inComments = false;
StringBuilder line = new StringBuilder();
for (int ch = -1, prev = -1; ((ch = reader.read())) != -1; prev = ch) {
System.out.println((char)ch);
if (inComments) {
if (prev == '*' && ch == '/') { //flip state
inComments = false;
}
if (ch != '\n' && ch != '\r') {
line.append((char)ch);
}
if (!inComments || ch == '\n' || ch == '\r') {
String actualLine = line.toString().trim();
//ignore lines which only have '*/' in them
commentLines += actualLine.length() > 0 && !"*/".equals(actualLine) ? 1 : 0;
line = new StringBuilder();
}
} else {
if (prev == '/' && ch == '*') { //flip state
inComments = true;
}
}
}
return commentLines;
}
public static void main(String[] args) throws FileNotFoundException, IOException {
System.out.println(countLines(new FileReader(new File("/tmp/b"))));
}
Above program ignores empty line comments or lines with only /* or */ in them. We also need to ignore nested comments which string tokenizer may fail todo.
Example file /tmp/b
#include <stdio.h>
int main()
{
/* /******* wrong! The variable declaration must appear first */
printf( "Declare x next" );
int x;
return 0;
}
returns 1.

Related

Remove all the words in a string that are in round brackets in java?

The input is a (good) example((eo)--)e). I have used an iterative way.
I tried with the following code:
public String scartaParentesi(String s)
{
ups = s.replaceAll("\\([^()]*\\)", "");
return ups;
}
The output of this code is a example(--)e).
The expected output is a examplee).
Based on description and comments, you can do:
String str = "a (good) example((eo)--)e";
StringBuilder stringBuilder = new StringBuilder();
int openedParenthesesCount = 0;
for (char c : str.toCharArray()) {
if (c == '(') {
openedParenthesesCount++;
} else if (c == ')') {
openedParenthesesCount--;
} else if (openedParenthesesCount == 0) {
stringBuilder.append(c);
}
}
System.out.println(stringBuilder);
Output:
a examplee
Assumption - number of '(' equals to number of ')'.
A more robust solution without any assumptions of the number of opening and closing braces:
String text = "a (good) example((eo)--)e)";
StringBuilder outText = new StringBuilder();
Deque<Character> stack = new ArrayDeque<Character>();
int i=0;
while (i<text.length()) {
if (text.charAt(i) == '(') {
stack.addFirst(text.charAt(i));
i++;
}
while (!stack.isEmpty()) {
if (text.charAt(i) != ')') {
stack.addFirst(text.charAt(i));
i++;
} else {
if (stack.removeFirst() == '(') {
i++;
}
}
}
outText.append(text.charAt(i));
i++;
}
Output:
before: a (good) example((eo)--)e)
after: a examplee)
You can also use your original String replaceAll method by putting it on a loop, replacing the same pattern on the last updated string. The break condition of the loop will be checking if 2 consecutive iterations output the same string, i.e. no pattern to replace:
String prev = text.replaceAll("\\([^()]*\\)", "");
while (!text.equals(prev)) {
prev = text;
text = text.replaceAll("\\([^()]*\\)", "");
}
System.out.println("after2: " + text);

Read only Strings from a file using Scanner

I am creating a method that creates a file that contains the Strings from another file that can have anything( ints, doubles... ). I am using another method that returns true if the input its a String.
public static void buscarFichero(String ftx){
File f = new File(ftx);
Scanner s = null;
PrintWriter p = null;
try{
s = new Scanner(f).useLocale(Locale.US);
p = new PrintWriter(ftx + "_nuevo");
while(s.hasNextLine()){
String aux = s.nextLine();
if(esString(aux) == true){
String b = aux.trim();
p.println(b);
}
}
}catch(FileNotFoundException e){}
finally{
if(s != null){ s.close(); }
if(p != null){ p.close(); }
}
}
public static boolean esString(String x){
if(x.equals(x.toString())){ return true;}
else{ return false; }
}
I know I am using and auxiliar that it is always making the nextLine into a String, but I have not the knowledge to fix it. I want to get rid of everything that it is not a String
Everything you read from a file is going to technically be a String. I believe that what you are trying to accomplish is to distinguish whether or not a particular String contains only letters. If this is true then what you need to do is compare the character codes. In this example I check if the character is not a character from a-z or A-Z. If so, then it is not a word.
private static boolean isWord(String str) {
for (int i = 0; i < str.length(); i++) {
char c = str.charAt(i);
if ((c < 'A' || c > 'Z') && (c < 'a' || c > 'z')) {
return false;
}
}
return true;
}

Programmatically remove comments from Java File [duplicate]

I have a java project and i have used comments in many location in various java files in the project. Now i need to remove all type of comments : single line , multiple line comments .
Please provide automation for removing comments. using tools or in eclipse etc.
Currently i am manually trying to remove all commetns
You can remove all single- or multi-line block comments (but not line comments with //) by searching for the following regular expression in your project(s)/file(s) and replacing by $1:
^([^"\r\n]*?(?:(?<=')"[^"\r\n]*?|(?<!')"[^"\r\n]*?"[^"\r\n]*?)*?)(?<!/)/\*[^\*]*(?:\*+[^/][^\*]*)*?\*+/
It's possible that you have to execute it more than once.
This regular expression avoids the following pitfalls:
Code between two comments /* Comment 1 */ foo(); /* Comment 2 */
Line comments starting with an asterisk: //***NOTE***
Comment delimiters inside string literals: stringbuilder.append("/*");; also if there is a double quote inside single quotes before the comment
To remove all single-line comments, search for the following regular expression in your project(s)/file(s) and replace by $1:
^([^"\r\n]*?(?:(?<=')"[^"\r\n]*?|(?<!')"[^"\r\n]*?"[^"\r\n]*?)*?)\s*//[^\r\n]*
This regular expression also avoids comment delimiters inside double quotes, but does NOT check for multi-line comments, so /* // */ will be incorrectly removed.
I had to write somehting to do this a few weeks ago. This should handle all comments, nested or otherwise. It is long, but I haven't seen a regex version that handled nested comments properly. I didn't have to preserve javadoc, but I presume you do, so I added some code that I belive should handle that. I also added code to support the \r\n and \r line separators. The new code is marked as such.
public static String removeComments(String code) {
StringBuilder newCode = new StringBuilder();
try (StringReader sr = new StringReader(code)) {
boolean inBlockComment = false;
boolean inLineComment = false;
boolean out = true;
int prev = sr.read();
int cur;
for(cur = sr.read(); cur != -1; cur = sr.read()) {
if(inBlockComment) {
if (prev == '*' && cur == '/') {
inBlockComment = false;
out = false;
}
} else if (inLineComment) {
if (cur == '\r') { // start untested block
sr.mark(1);
int next = sr.read();
if (next != '\n') {
sr.reset();
}
inLineComment = false;
out = false; // end untested block
} else if (cur == '\n') {
inLineComment = false;
out = false;
}
} else {
if (prev == '/' && cur == '*') {
sr.mark(1); // start untested block
int next = sr.read();
if (next != '*') {
inBlockComment = true; // tested line (without rest of block)
}
sr.reset(); // end untested block
} else if (prev == '/' && cur == '/') {
inLineComment = true;
} else if (out){
newCode.append((char)prev);
} else {
out = true;
}
}
prev = cur;
}
if (prev != -1 && out && !inLineComment) {
newCode.append((char)prev);
}
} catch (IOException e) {
e.printStackTrace();
}
return newCode.toString();
}
you can try it with the java-comment-preprocessor:
java -jar ./jcp-6.0.0.jar --i:/sourceFolder --o:/resultFolder -ef:none --r
source
I made a open source library and uploaded to github, its called CommentRemover you can remove single line and multiple line Java Comments.
It supports remove or NOT remove TODO's.
Also it supports JavaScript , HTML , CSS , Properties , JSP and XML Comments too.
There is a little code snippet how to use it (There is 2 type usage):
First way InternalPath
public static void main(String[] args) throws CommentRemoverException {
// root dir is: /Users/user/Projects/MyProject
// example for startInternalPath
CommentRemover commentRemover = new CommentRemover.CommentRemoverBuilder()
.removeJava(true) // Remove Java file Comments....
.removeJavaScript(true) // Remove JavaScript file Comments....
.removeJSP(true) // etc.. goes like that
.removeTodos(false) // Do Not Touch Todos (leave them alone)
.removeSingleLines(true) // Remove single line type comments
.removeMultiLines(true) // Remove multiple type comments
.startInternalPath("src.main.app") // Starts from {rootDir}/src/main/app , leave it empty string when you want to start from root dir
.setExcludePackages(new String[]{"src.main.java.app.pattern"}) // Refers to {rootDir}/src/main/java/app/pattern and skips this directory
.build();
CommentProcessor commentProcessor = new CommentProcessor(commentRemover);
commentProcessor.start();
}
Second way ExternalPath
public static void main(String[] args) throws CommentRemoverException {
// example for externalInternalPath
CommentRemover commentRemover = new CommentRemover.CommentRemoverBuilder()
.removeJava(true) // Remove Java file Comments....
.removeJavaScript(true) // Remove JavaScript file Comments....
.removeJSP(true) // etc..
.removeTodos(true) // Remove todos
.removeSingleLines(false) // Do not remove single line type comments
.removeMultiLines(true) // Remove multiple type comments
.startExternalPath("/Users/user/Projects/MyOtherProject")// Give it full path for external directories
.setExcludePackages(new String[]{"src.main.java.model"}) // Refers to /Users/user/Projects/MyOtherProject/src/main/java/model and skips this directory.
.build();
CommentProcessor commentProcessor = new CommentProcessor(commentRemover);
commentProcessor.start();
}
This is an old post but this may help someone who enjoys working on command line like myself:
The perl one-liner below will remove all comments:
perl -0pe 's|//.*?\n|\n|g; s#/\*(.|\n)*?\*/##g;' test.java
Example:
cat test.java
this is a test
/**
*This should be removed
*This should be removed
*/
this should not be removed
//this should be removed
this should not be removed
this should not be removed //this should be removed
Output:
perl -0pe 's#/\*\*(.|\n)*?\*/##g; s|//.*?\n|\n|g' test.java
this is a test
this should not be removed
this should not be removed
this should not be removed
If you want get rid of multiple blank lines as well:
perl -0pe 's|//.*?\n|\n|g; s#/\*(.|\n)*?\*/##g; s/\n\n+/\n\n/g' test.java
this is a test
this should not be removed
this should not be removed
this should not be removed
EDIT: Corrected regex
Dealing with source code is hard unless you know more on the writing of comment.
In the more general case, you could have // or /* in text constants. So your really need to parse the file at a syntaxic level, not only lexical. IMHO the only bulletproof solution would be to start for example with the java parser from openjdk.
If you know that your comments are never deeply mixed with the code (in my exemple comments MUST be full lines), a python script could help
multiple = False
for line in text:
stripped = line.strip()
if multiple:
if stripped.endswith('*/'):
multiple = False
continue
elif stripped.startswith('/*'):
multiple = True
elif stripped.startswith('//'):
pass
else:
print(line)
If you are using Eclipse IDE, you could make regex do the work for you.
Open the search window (Ctrl+F), and check 'Regular Expression'.
Provide the expression as
/\*\*(?s:(?!\*/).)*\*/
Prasanth Bhate has explained it in Tool to remove JavaDoc comments?
public class TestForStrings {
/**
* The main method.
*
* #param args
* the arguments
* #throws Exception
* the exception
*/
public static void main(String args[]) throws Exception {
String[] imports = new String[100];
String fileName = "Menu.java";
// This will reference one API at a time
String line = null;
try {
FileReader fileReader = new FileReader(fileName);
// Always wrap FileReader in BufferedReader.
BufferedReader bufferedReader = new BufferedReader(fileReader);
int startingOffset = 0;
// This will reference one API at a time
List<String> lines = Files.readAllLines(Paths.get(fileName),
Charset.forName("ISO-8859-1"));
// remove single line comments
for (int count = 0; count < lines.size(); count++) {
String tempString = lines.get(count);
lines.set(count, removeSingleLineComment(tempString));
}
// remove multiple lines comment
for (int count = 0; count < lines.size(); count++) {
String tempString = lines.get(count);
removeMultipleLineComment(tempString, count, lines);
}
for (int count = 0; count < lines.size(); count++) {
System.out.println(lines.get(count));
}
} catch (FileNotFoundException ex) {
System.out.println("Unable to open file '" + fileName + "'");
} catch (IOException ex) {
System.out.println("Error reading file '" + fileName + "'");
} catch (Exception e) {
}
}
/**
* Removes the multiple line comment.
*
* #param tempString
* the temp string
* #param count
* the count
* #param lines
* the lines
* #return the string
*/
private static List<String> removeMultipleLineComment(String tempString,
int count, List<String> lines) {
try {
if (tempString.contains("/**") || (tempString.contains("/*"))) {
int StartIndex = count;
while (!(lines.get(count).contains("*/") || lines.get(count)
.contains("**/"))) {
count++;
}
int endIndex = ++count;
if (StartIndex != endIndex) {
while (StartIndex != endIndex) {
lines.set(StartIndex, "");
StartIndex++;
}
}
}
} catch (Exception e) {
// Do Nothing
}
return lines;
}
/**
* Remove single line comments .
*
* #param line
* the line
* #return the string
* #throws Exception
* the exception
*/
private static String removeSingleLineComment(String line) throws Exception {
try {
if (line.contains(("//"))) {
int startIndex = line.indexOf("//");
int endIndex = line.length();
String tempoString = line.substring(startIndex, endIndex);
line = line.replace(tempoString, "");
}
if ((line.contains("/*") || line.contains("/**"))
&& (line.contains("**/") || line.contains("*/"))) {
int startIndex = line.indexOf("/**");
int endIndex = line.length();
String tempoString = line.substring(startIndex, endIndex);
line = line.replace(tempoString, "");
}
} catch (Exception e) {
// Do Nothing
}
return line;
}
}
This is what I came up with yesterday.
This is actually homework I got from school so if anybody reads this and finds a bug before I turn it in, please leave a comment =)
ps. 'FilterState' is a enum class
public static String deleteComments(String javaCode) {
FilterState state = FilterState.IN_CODE;
StringBuilder strB = new StringBuilder();
char prevC=' ';
for(int i = 0; i<javaCode.length(); i++){
char c = javaCode.charAt(i);
switch(state){
case IN_CODE:
if(c=='/')
state = FilterState.CAN_BE_COMMENT_START;
else {
if (c == '"')
state = FilterState.INSIDE_STRING;
strB.append(c);
}
break;
case CAN_BE_COMMENT_START:
if(c=='*'){
state = FilterState.IN_COMMENT_BLOCK;
}
else if(c=='/'){
state = FilterState.ON_COMMENT_LINE;
}
else {
state = FilterState.IN_CODE;
strB.append(prevC+c);
}
break;
case ON_COMMENT_LINE:
if(c=='\n' || c=='\r') {
state = FilterState.IN_CODE;
strB.append(c);
}
break;
case IN_COMMENT_BLOCK:
if(c=='*')
state=FilterState.CAN_BE_COMMENT_END;
break;
case CAN_BE_COMMENT_END:
if(c=='/')
state = FilterState.IN_CODE;
else if(c!='*')
state = FilterState.IN_COMMENT_BLOCK;
break;
case INSIDE_STRING:
if(c == '"' && prevC!='\\')
state = FilterState.IN_CODE;
strB.append(c);
break;
default:
System.out.println("unknown case");
return null;
}
prevC = c;
}
return strB.toString();
}
private static int find(String s, String t, int start) {
int ret = s.indexOf(t, start);
return ret < 0 ? Integer.MAX_VALUE : ret;
}
private static int findSkipEsc(String s, String t, int start) {
while(true) {
int ret = find(s, t, start);
if( ret == Integer.MAX_VALUE) return -1;
int esc = find(s, "\\", start);
if( esc > ret) return ret;
start += 2;
}
}
private static String removeLineCommnt(String s) {
int i, start = 0;
while (0 <= (i = find(s, "//", start))) { //Speed it up
int j = find(s, "'", start);
int k = find(s, "\"", start);
int first = min(i, min(j, k));
if (first == Integer.MAX_VALUE) return s;
if (i == first) return s.substring(0, i);
//skipp quoted string
start = first+1;
if (k == first) { // " asdas\"dasd "
start = findSkipEsc(s,"\"",start);
if (start < 0) return s;
start++;
continue;
}
//if j == first ' asda\'sasd ' --- not in JSON
start = findSkipEsc(s,"'\"'",start);
if (start < 0) return s;
start++;
}
return s;
}
static String removeLineCommnts(String s) {
if (!s.contains("//")) return s; //Speed it up
return Arrays.stream(s.split("[\\n\\r]+")).
map(Common::removeLineCommnt).
collect(Collectors.joining("\n"));
}

Read a file and count how many times the vowels (characters) "aeiou" appear in order

It ignores consonants.
It ignores any type of space.
It ignores case.
The only thing it cannot ignore is if another vowel occurs out of order.
These count:
AEIOU,
aeiou,
hahehihohu,
Take it out
These do not:
AEIuO,
Taco is good,
Take it over
Here is what I have so far:
import java.util.Scanner;
public class AEIOU_Counter {
public static void main(String[] args) throws Exception {
java.io.File file = new java.io.File("vowels.txt");
Scanner input = new Scanner(file);
String fileContent = "";
while (input.hasNext())
{
fileContent += input.next() + " ";
}
input.close();
char[] charArr = fileContent.toCharArray();
int counter = 0;
for (char c : charArr)
{
if(c == 'a' || c == 'e' ||c == 'i' ||c == 'o' ||c == 'u')
counter++;
}
System.out.println("The file " + file + " has AEIOU in order " + counter + " times");
}
}
The problem is the output:
The file vowels.txt has AEIOU in order 50 times
However, the file vowels.txt contains:
AEIOU aeiou baeiboeu bbbaaaaaa beaeiou caeuoi ajejijoju aeioo
aeiOu ma me mi mo mu take it OUT!
So the correct output should be:
The file vowels.txt has AEIOU in order 8 times
theres two ways i can think to do it. No real code since this is your assignment :)
first way is to edit the input to be as simple as possible.
1. Read input from file
2. toLowerCase() the input (to make "aEiOU" simplar as just "aeiou")
3. Remove all non-vowel characters. (so that 'hahehihohu' becomes 'aeiou')
4. Search for literal string "aeiou" and count occurrances.
Second way is leave the input alone, but use loops and counters. the 'sequence' could be an array, or a linked list maybe
sequence = [a,e,i,o,u] // (or a->e->i->o->u)
curr_char_of_sequence = 'a'
counter = 0
for each char in the input, loop {
if the char is not a vowel {
continue to next char
}
//see if the vowel is the one we want next
if char == curr_char_of_sequence {
//it is! update whats the next vowel we want.
// ie, if we were looking for an 'a', now look for an 'e'
curr_char_of_sequence = sequence.next
//check to see if we reached the end of the sequence, if so, we found a completed 'aeiou' set
if curr_char_of_sequence == invalid {
counter++
curr_char_of_sequence = 'a'
}
//we found a vowel that isn't the right one, restart the sequence
} else {
curr_char_of_sequence = 'a'
}
}
As people pointed out, you should use regular expression.
Here is a little help to get you every AEIOU in this specific order (doesn't ignore the non vowels in between)
java.io.File file = new java.io.File("vowels.txt");
Scanner input = null;
try {
input = new Scanner(file);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
String fileContent = "";
while (input.hasNext())
{
fileContent += input.next().toLowerCase();
}
input.close();
int counter = 0;
for (int i = 0; i<fileContent.length()-4;i++)
{
if(fileContent.charAt(i) == 'a'){
if(fileContent.charAt(i+1)=='e'){
if(fileContent.charAt(i+2)=='i')
if(fileContent.charAt(i+3)=='o')
if(fileContent.charAt(i+4)=='u'){
counter++;
}
}
}
}
System.out.println("The file " + file + " has AEIOU in order " + counter + " times");
}
Of course, it counts every times theres ONE of these letters. (a OR e OR i...)
Try using booleans, if u have a letter ("a"), u look for the next ("e").
for (char c : charArr)
{
if(c == 'a') {
boolA = true;
}
else if(c=='e') {
if (boolA) {
boolE = true;
}
else {
boolA = false;
boolE = false;
boolI = false;
boolO = false;
boolU = false;
}
}
else if (c=='i') {
if (boolE) {
boolI = true;
}
else {
boolA = false;
boolE = false;
boolI = false;
boolO = false;
boolU = false;
}
//etc, etc ....
}
if u understand what i mean ^^
Or, there is the other way (for the lazy dudes)
u remember the last valide character u found, and if the actual letters follows it, it's won.
char lastChar;
String validLetters = "aeiou";
String myArray = "eiou"; //i removed the first
for (char c : charArr) {
if (c=='a') {
lastChar=='a';
}
else if ( validLetters.contains(c) && lastChar==validLetters.charAt(myArray.indexOf(c)) ) {
lastChar = c; //u understand it, u get the answer ^^
}
else {
lastChar='w' //just a random char, not in AEIOU
}
the last one is better,
Hope it helps, Bye :)
Looks like Me Good Guy beat me to it, but here is a full example with the boolean idea
public static void main(String[] args) {
boolean a = false;
boolean e = false;
boolean i = false;
boolean o = false;
boolean u = false;
int vowelCounter = 0;
String s = "AEIOU aeiou hahehihohu Take it out";
for (int index = 0; index < s.length(); index++) {
Character c = Character.toLowerCase(s.charAt(index));
if (c == 'a') {
a = true;
continue;
}
if (a && c == 'e') {
e = true;
continue;
}
if (a && e && c == 'i') {
i = true;
continue;
}
if (a && e && i && c == 'o') {
o = true;
continue;
}
if (a && e && i && o && c == 'u') {
u = true;
// no continue because we want to exit this if-chain
}
if (a && e && i && o && u) {
vowelCounter++;
a = e = i = o = u = false; // reset
}
}
System.out.printf("The string \"%s\" contains 'aeiou' in order %d times.\n", s, vowelCounter);
// The string "AEIOU aeiou hahehihohu Take it out" contains 'aeiou' in order 4 times.
}

java regex, split on comma only if not in quotes or brackets

I would like to do a java split via regex.
I would like to split my string on every comma when it is NOT in single quotes or brackets.
example:
Hello, 'my,',friend,(how ,are, you),(,)
should give:
hello
my,
friend
how, are, you
,
I tried this:
(?i),(?=([^\'|\(]*\'|\([^\'|\(]*\'|\()*[^\'|\)]*$)
But I can't get it to work (I tested via http://java-regex-tester.appspot.com/)
Any ideas?
Nested paranthesises can't be split by regex. Its easier to split them manually.
public static List<String> split(String orig) {
List<String> splitted = new ArrayList<String>();
int nextingLevel = 0;
StringBuilder result = new StringBuilder();
for (char c : orig.toCharArray()) {
if (c == ',' && nextingLevel == 0) {
splitted.add(result.toString());
result.setLength(0);// clean buffer
} else {
if (c == '(')
nextingLevel++;
if (c == ')')
nextingLevel--;
result.append(c);
}
}
// Thanks PoeHah for pointing it out. This adds the last element to it.
splitted.add(result.toString());
return splitted;
}
Hope this helps.
A java CSV parser library would be better suited to this task than regex: http://sourceforge.net/projects/javacsv/
Assuming no nested (), you could split on
",(?=(?:[^']*'[^']*')*[^']*$)(?=(?:[^()]*\\([^()]*\\))*[^()]*$)"
It will only split on a comma when ahead in the string is an even number of ' and bracket pairs.
It's a brittle solution, but it may be good enough.
As in some comments and answer by #Balthus this should better be done in a CSV Parser. You do need to do some smart RexEx replacement to prepare the input string for parsing. Consider code like this:
String str = "Hello, 'my,',friend,(how ,are, you),(,)"; // input string
// prepare String for CSV parser: replace left/right brackets OR ' by a "
CsvReader reader = CsvReader.parse(str.replaceAll("[(')]", "\""));
reader.readRecord(); // read the CSV input
for (int i=0; i<reader.getColumnCount(); i++)
System.out.printf("col[%d]: [%s]%n", i, reader.get(i));
OUTPUT
col[0]: [Hello]
col[1]: [my,]
col[2]: [friend]
col[3]: [how ,are, you]
col[4]: [,]
I also need to split on comma outside of quotes and brackets.
After searching over all the related answers on SO, I realized a lexer is needed in such a case, and I wrote a generic implementation for myself. It supports a separator, multiple quotes and multiple brackets as regexes.
public static List<String> split(String string, String regex, String[] quotesRegex, String[] leftBracketsRegex,
String[] rightBracketsRegex) {
if (leftBracketsRegex.length != rightBracketsRegex.length) {
throw new IllegalArgumentException("Bracket count mismatch, left: " + leftBracketsRegex.length + ", right: "
+ rightBracketsRegex.length);
}
// Prepare all delimiters.
String[] delimiters = new String[1 + quotesRegex.length + leftBracketsRegex.length + rightBracketsRegex.length];
delimiters[0] = regex;
System.arraycopy(quotesRegex, 0, delimiters, 1, quotesRegex.length);
System.arraycopy(leftBracketsRegex, 0, delimiters, 1 + quotesRegex.length, leftBracketsRegex.length);
System.arraycopy(rightBracketsRegex, 0, delimiters, 1 + quotesRegex.length + leftBracketsRegex.length,
rightBracketsRegex.length);
// Build delimiter regex.
StringBuilder delimitersRegexBuilder = new StringBuilder("(?:");
boolean first = true;
for (String delimiter : delimiters) {
if (delimiter.endsWith("\\") && !delimiter.endsWith("\\\\")) {
throw new IllegalArgumentException("Delimiter contains trailing single \\: " + delimiter);
}
if (first) {
first = false;
} else {
delimitersRegexBuilder.append("|");
}
delimitersRegexBuilder
.append("(")
.append(delimiter)
.append(")");
}
delimitersRegexBuilder.append(")");
String delimitersRegex = delimitersRegexBuilder.toString();
// Scan.
int pendingQuoteIndex = -1;
Deque<Integer> bracketStack = new LinkedList<>();
StringBuilder pendingSegmentBuilder = new StringBuilder();
List<String> segmentList = new ArrayList<>();
Matcher matcher = Pattern.compile(delimitersRegex).matcher(string);
int matcherIndex = 0;
while (matcher.find()) {
pendingSegmentBuilder.append(string.substring(matcherIndex, matcher.start()));
int delimiterIndex = -1;
for (int i = 1; i <= matcher.groupCount(); ++i) {
if (matcher.group(i) != null) {
delimiterIndex = i - 1;
break;
}
}
if (delimiterIndex < 1) {
// Regex.
if (pendingQuoteIndex == -1 && bracketStack.isEmpty()) {
segmentList.add(pendingSegmentBuilder.toString());
pendingSegmentBuilder.setLength(0);
} else {
pendingSegmentBuilder.append(matcher.group());
}
} else {
delimiterIndex -= 1;
pendingSegmentBuilder.append(matcher.group());
if (delimiterIndex < quotesRegex.length) {
// Quote.
if (pendingQuoteIndex == -1) {
pendingQuoteIndex = delimiterIndex;
} else if (pendingQuoteIndex == delimiterIndex) {
pendingQuoteIndex = -1;
}
// Ignore unpaired quotes.
} else if (pendingQuoteIndex == -1) {
delimiterIndex -= quotesRegex.length;
if (delimiterIndex < leftBracketsRegex.length) {
// Left bracket
bracketStack.push(delimiterIndex);
} else {
delimiterIndex -= leftBracketsRegex.length;
// Right bracket
int topBracket = bracketStack.peek();
// Ignore unbalanced brackets.
if (delimiterIndex == topBracket) {
bracketStack.pop();
}
}
}
}
matcherIndex = matcher.end();
}
pendingSegmentBuilder.append(string.substring(matcherIndex, string.length()));
segmentList.add(pendingSegmentBuilder.toString());
while (segmentList.size() > 0 && segmentList.get(segmentList.size() - 1).isEmpty()) {
segmentList.remove(segmentList.size() - 1);
}
return segmentList;
}

Categories