regular expression for \" in java - java

I need to write a regular expression for string read from a file
apple,boy,cat,"dog,cat","time\" after\"noon"
I need to split it into
apple
boy
cat
dog,cat
time"after"noon
I tried using
Pattern pattern =
Pattern.compile("[\\\"]");
String items[]=pattern.split(match);
for the second part but I could not get the right answer,can you help me with this?

Since your question is more of a parsing problem than a regex problem, here's another solution that will work:
public class CsvReader {
Reader r;
int row, col;
boolean endOfRow;
public CsvReader(Reader r){
this.r = r instanceof BufferedReader ? r : new BufferedReader(r);
this.row = -1;
this.col = 0;
this.endOfRow = true;
}
/**
* Returns the next string in the input stream, or null when no input is left
* #return
* #throws IOException
*/
public String next() throws IOException {
int i = r.read();
if(i == -1)
return null;
if(this.endOfRow){
this.row++;
this.col = 0;
this.endOfRow = false;
} else {
this.col++;
}
StringBuilder b = new StringBuilder();
outerLoop:
while(true){
char c = (char) i;
if(i == -1)
break;
if(c == ','){
break;
} else if(c == '\n'){
endOfRow = true;
break;
} else if(c == '\\'){
i = r.read();
if(i == -1){
break;
} else {
b.append((char)i);
}
} else if(c == '"'){
while(true){
i = r.read();
if(i == -1){
break outerLoop;
}
c = (char)i;
if(c == '\\'){
i = r.read();
if(i == -1){
break outerLoop;
} else {
b.append((char)i);
}
} else if(c == '"'){
r.mark(2);
i = r.read();
if(i == '"'){
b.append('"');
} else {
r.reset();
break;
}
} else {
b.append(c);
}
}
} else {
b.append(c);
}
i = r.read();
}
return b.toString().trim();
}
public int getColNum(){
return col;
}
public int getRowNum(){
return row;
}
public static void main(String[] args){
try {
String input = "apple,boy,cat,\"dog,cat\",\"time\\\" after\\\"noon\"\nquick\"fix\" hello, \"\"\"who's there?\"";
System.out.println(input);
Reader r = new StringReader(input);
CsvReader csv = new CsvReader(r);
String s;
while((s = csv.next()) != null){
System.out.println("R" + csv.getRowNum() + "C" + csv.getColNum() + ": " + s);
}
} catch(IOException e){
e.printStackTrace();
}
}
}
Running this code, I get the output:
R0C0: apple
R0C1: boy
R0C2: cat
R0C3: dog,cat
R0C4: time" after"noon
R1C0: quickfix hello
R1C1: "who's there?
This should fit your needs pretty well.
A few disclaimers, though:
It won't catch errors in the syntax of the CSV format, such as an unescaped quotation mark in the middle of a value.
It won't perform any character conversion (such as converting "\n" to a newline character). Backslashes simply cause the following character to be treated literally, including other backslashes. (That should be easy enough to alter if you need additional functionality)
Some csv files escape quotes by doubling them rather than using a backslash, this code now looks for both.
Edit: Looked up the csv format, discovered there's no real standard, but updated my code to catch quotes escaped by doubling rather than backslashes.
Edit 2: Fixed. Should work as advertised now. Also modified it to test the tracking of row and column numbers.

First thing: String.split() uses the regex to find the separators, not the substrings.
Edit: I'm not sure if this can be done with String.split(). I think the only way you could deal with the quotes while only matching the comma would be by readahead and lookbehind, and that's going to break in quite a lot of cases.
Edit2: I'm pretty sure it can be done with a regular expression. And I'm sure this one case could be solved with string.split() -- but a general solution wouldn't be simple.
Basically, you're looking for anything that isn't a comma as input [^,], you can handle quotes as a separate character. I've gotten most of the way there myself. I'm getting this as output:
apple
boy
cat
dog
cat
time\" after\"noon
But I'm not sure why it has so many blank lines.
My complete code is:
String input = "apple,boy,cat,\"dog,cat\",\"time\\\" after\\\"noon\"";
Pattern pattern =
Pattern.compile("(\\s|[^,\"\\\\]|(\\\\.)||(\".*\"))*");
Matcher m = pattern.matcher(input);
while(m.find()){
System.out.println(m.group());
}
But yeah, I'll echo the guy above and say that if there's no requirement to use a regular expression, then it's probably simpler to do it manually.
But then I guess I'm almost there. It's spitting out ... oh hey, I see what's going on here. I think I can fix that.
But I'm going to echo the guy above and say that if there's no requirement to use a regular expression, it's probably better to do it one character at a time and implement the logic manually. If your regex isn't picture-perfect, then it could cause all kinds of unpredictable weirdness down the line.

I am not really sure about this but you could have a go at Pattern.compile("[\\\\"]");
\ is an escape character and to detect a \ in the expression, \\\\ could be used.
A similar thing worked for me in another context and I hope it solves your problem too.

Related

Trouble parsing a sentence with Java Pattern matcher [duplicate]

I'm working with a Lexical Analyzer program right now and I'm using Java. I've been researching for answers on this problem but until now I failed to find any. Here's my problem:
Input:
System.out.println ("Hello World");
Desired Output:
Lexeme----------------------Token
System [Key_Word]
. [Object_Accessor]
out [Key_Word]
. [Object_Accessor]
println [Key_Word]
( [left_Parenthesis]
"Hello World" [String_Literal]
) [right_Parenthesis]
; [statement_separator]
I'm still a beginner so I hope you guys can help me on this. Thanks.
You need neither ANTLR nor the Dragon book to write a simple lexical analyzer by hand. Even lexical analyzers for fuller languages (like Java) aren't terribly complicated to write by hand. Obviously if you have an industrial task you might want to consider industrial strength tools like ANTLR or some lex variant, but for the sake of learning how lexical analysis works, writing one by hand would likely prove to be a useful exercise. I'm assuming that this is the case, since you said you're still a beginner.
Here's a simple lexical analyzer, written in Java, for a subset of a Scheme-like language, that I wrote after seeing this question. I think the code is relatively easy to understand even if you've never seen a lexer before, simply because breaking a stream of characters (in this case a String) into a stream of tokens (in this case a List<Token>) isn't that hard. If you have questions I can try to explain in more depth.
import java.util.List;
import java.util.ArrayList;
/*
* Lexical analyzer for Scheme-like minilanguage:
* (define (foo x) (bar (baz x)))
*/
public class Lexer {
public static enum Type {
// This Scheme-like language has three token types:
// open parens, close parens, and an "atom" type
LPAREN, RPAREN, ATOM;
}
public static class Token {
public final Type t;
public final String c; // contents mainly for atom tokens
// could have column and line number fields too, for reporting errors later
public Token(Type t, String c) {
this.t = t;
this.c = c;
}
public String toString() {
if(t == Type.ATOM) {
return "ATOM<" + c + ">";
}
return t.toString();
}
}
/*
* Given a String, and an index, get the atom starting at that index
*/
public static String getAtom(String s, int i) {
int j = i;
for( ; j < s.length(); ) {
if(Character.isLetter(s.charAt(j))) {
j++;
} else {
return s.substring(i, j);
}
}
return s.substring(i, j);
}
public static List<Token> lex(String input) {
List<Token> result = new ArrayList<Token>();
for(int i = 0; i < input.length(); ) {
switch(input.charAt(i)) {
case '(':
result.add(new Token(Type.LPAREN, "("));
i++;
break;
case ')':
result.add(new Token(Type.RPAREN, ")"));
i++;
break;
default:
if(Character.isWhitespace(input.charAt(i))) {
i++;
} else {
String atom = getAtom(input, i);
i += atom.length();
result.add(new Token(Type.ATOM, atom));
}
break;
}
}
return result;
}
public static void main(String[] args) {
if(args.length < 1) {
System.out.println("Usage: java Lexer \"((some Scheme) (code to) lex)\".");
return;
}
List<Token> tokens = lex(args[0]);
for(Token t : tokens) {
System.out.println(t);
}
}
}
Example use:
~/code/scratch $ java Lexer ""
~/code/scratch $ java Lexer "("
LPAREN
~/code/scratch $ java Lexer "()"
LPAREN
RPAREN
~/code/scratch $ java Lexer "(foo)"
LPAREN
ATOM<foo>
RPAREN
~/code/scratch $ java Lexer "(foo bar)"
LPAREN
ATOM<foo>
ATOM<bar>
RPAREN
~/code/scratch $ java Lexer "(foo (bar))"
LPAREN
ATOM<foo>
LPAREN
ATOM<bar>
RPAREN
RPAREN
Once you've written one or two simple lexers like this, you will get a pretty good idea of how this problem decomposes. Then it would be interesting to explore how to use automated tools like lex. The theory behind regular expression based matchers is not too difficult, but it does take a while to fully understand. I think writing lexers by hand motivates that study and helps you come to grips with the problem better than diving into the theory behind converting regular expressions to finite automate (first NFAs, then NFAs to DFAs), etc... wading into that theory can be a lot to take in at once, and it is easy to get overwhelmed.
Personally, while the Dragon book is good and very thorough, the coverage might not be the easiest to understand because it aims to be complete, not necessarily accessible. You might want to try some other compiler texts before opening up the Dragon book. Here are a few free books, which have pretty good introductory coverage, IMHO:
http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf
http://www.diku.dk/~torbenm/Basics/
Some articles on the implementation of regular expressions (automated lexical analysis usually uses regular expressions)
http://swtch.com/~rsc/regexp/
ANTLR 4 will do exactly this with the Java.g4 reference grammar. You have two options depending on how closely you want the handling of Unicode escape sequences to follow the language specification.
https://github.com/antlr/grammars-v4/blob/master/java/Java.g4: This grammar only handles Unicode escape sequences as characters within a string or character literal.
https://github.com/antlr/antlr4/blob/master/tool/test/org/antlr/v4/test/Java-LR.g4 (must be renamed to Java.g4 before use): This grammar requires that you wrap your ANTLRInputStream in a JavaUnicodeInputStream, which processes Unicode escape sequences according to the JLS prior to feeding them to the lexer.
Edit: The names of the tokens produced by this grammar differ slightly from your table.
Your Key_Word token is Identifier
Your Object_Accessor token is DOT
Your left_Parenthesis token is LPAREN
Your String_Literal token is StringLiteral
Your right_Parenthesis token is RPAREN
Your statement_separator token is SEMI
Lexical analysis is a topic by itself that usually goes together with compiler design and analysis. You should read up about it before trying to code anything. My favourite book on this topic is the Dragon book which should give you a good introduction to compiler design and even provides pseudocodes for all compiler phases which you can easily translate to Java and move from there.
In short, the main idea is to parse the input and divide it into tokens which belong to certain classes (parentheses or keywords, for example, in your desired output) using a finite state machine. Process of state machine building is actually the only hard part of this analysis and the Dragon book will provide you with great insight into this thing.
You can use libraries like Lex & Bison in C or Antlr in Java. Lexical analysis can be done through making automata. I'll give you small example:
Suppose you need to tokenize a string where keywords (language) are {'echo', '.', ' ', 'end'). By keywords I mean language consists of following keywords only. So if I input
echo .
end .
My lexer should output
echo ECHO
SPACE
. DOT
end END
SPACE
. DOT
Now to build automata for such a tokenizer, I can start by
->(SPACE) (Back)
|
(S)-------------E->C->H->O->(ECHO) (Back)
| |
.->(DOT)(Back) ->N->D ->(END) (Back to Start)
Above diagram is prolly very bad, but idea is that you have a start state represented by S now you consume E and go to some other state, now you expect N or C to come for END and ECHO respectively. You keep consuming characters and reach different states within this simple finite state machine. Ultimately, you reach certain Emit state, for example after consuming E, N, D you reach emit state for END which emits the token out and then you go back to start state. This cycle continues forever as far as you have characters stream coming to your tokenizer. On invalid character you can either thrown an error or ignore depending on the design.
CookCC ( https://github.com/coconut2015/cookcc ) generates a very fast, small, zero-dependency lexer for Java.
Write a program to make a simple lexical analyzer that will build a symbol table from given stream of chars. You will need to read a file named “input.txt” to collect all chars. For simplicity, input file will be a C/Java/Python program without headers and methods(body of the main progrm). Then you will identify all the numerical values, identifiers, keywords, math operators, logical operators and others[distinct]. See the example for more details. You can assume that, there will be a space after each keyword.
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
int main(){
/* By Ashik Rabbani
Daffodil International University,CSE43 */
keyword_check();
identifier_check();
math_operator_check();
logical_operator_check();
numerical_check();
others_check();
return 0;
}
void math_operator_check()
{
char ch, string_input[15], operators[] = "+-*/%";
FILE *fp;
char tr[20];
int i,j=0;
fp = fopen("input.txt","r");
if(fp == NULL){
printf("error while opening the file\n");
exit(0);
}
printf("\nMath Operators : ");
while((ch = fgetc(fp)) != EOF){
for(i = 0; i < 6; ++i){
if(ch == operators[i])
printf("%c ", ch);
}
}
printf("\n");
fclose(fp);
}
void logical_operator_check()
{
char ch, string_input[15], operators[] = "&&||<>";
FILE *fp;
char tr[20];
int i,j=0;
fp = fopen("input.txt","r");
if(fp == NULL){
printf("error while opening the file\n");
exit(0);
}
printf("\nLogical Operators : ");
while((ch = fgetc(fp)) != EOF){
for(i = 0; i < 6; ++i){
if(ch == operators[i])
printf("%c ", ch);
}
}
printf("\n");
fclose(fp);
}
void numerical_check()
{
char ch, string_input[15], operators[] ={'0','1','2','3','4','5','6','7','8','9'};
FILE *fp;
int i,j=0;
fp = fopen("input.txt","r");
if(fp == NULL){
printf("error while opening the file\n");
exit(0);
}
printf("\nNumerical Values : ");
while((ch = fgetc(fp)) != EOF){
for(i = 0; i < 6; ++i){
if(ch == operators[i])
printf("%c ", ch);
}
}
printf("\n");
fclose(fp);
}
void others_check()
{
char ch, string_input[15], symbols[] = "(){}[]";
FILE *fp;
char tr[20];
int i,j=0;
fp = fopen("input.txt","r");
if(fp == NULL){
printf("error while opening the file\n");
exit(0);
}
printf("\nOthers : ");
while((ch = fgetc(fp)) != EOF){
for(i = 0; i < 6; ++i){
if(ch == symbols[i])
printf("%c ", ch);
}
}
printf("\n");
fclose(fp);
}
void identifier_check()
{
char ch, string_input[15];
FILE *fp;
char operators[] ={'0','1','2','3','4','5','6','7','8','9'};
int i,j=0;
fp = fopen("input.txt","r");
if(fp == NULL){
printf("error while opening the file\n");
exit(0);
}
printf("\nIdentifiers : ");
while((ch = fgetc(fp)) != EOF){
if(isalnum(ch)){
string_input[j++] = ch;
}
else if((ch == ' ' || ch == '\n') && (j != 0)){
string_input[j] = '\0';
j = 0;
if(isKeyword(string_input) == 1)
{
}
else
printf("%s ", string_input);
}
}
printf("\n");
fclose(fp);
}
int isKeyword(char string_input[]){
char keywords[32][10] = {"auto","break","case","char","const","continue","default",
"do","double","else","enum","extern","float","for","goto",
"if","int","long","register","return","short","signed",
"sizeof","static","struct","switch","typedef","union",
"unsigned","void","volatile","while"};
int i, flag = 0;
for(i = 0; i < 32; ++i){
if(strcmp(keywords[i], string_input) == 0){
flag = 1;
break;
}
}
return flag;
}
void keyword_check()
{
char ch, string_input[15], operators[] = "+-*/%=";
FILE *fp;
char tr[20];
int i,j=0;
printf(" Token Identification using C \n By Ashik-E-Rabbani \n 161-15-7093\n\n");
fp = fopen("input.txt","r");
if(fp == NULL){
printf("error while opening the file\n");
exit(0);
}
printf("\nKeywords : ");
while((ch = fgetc(fp)) != EOF){
if(isalnum(ch)){
string_input[j++] = ch;
}
else if((ch == ' ' || ch == '\n') && (j != 0)){
string_input[j] = '\0';
j = 0;
if(isKeyword(string_input) == 1)
printf("%s ", string_input);
}
}
printf("\n");
fclose(fp);
}

How to optimize do-while loops?

I want to execute a loop as long as a certain condition applies. At the end, I want to return the value that was last being found inside the loop.
Non-realworld example:
teststring = " abcde";
String letter = null;
do {
letter = reader.read(); //reads the teststring char by char
} while (letter.equals(" "));
return letter; //return "a"
Could this be optimized from the coding point of view, eg transform it from a do-while loop to just a while-loop?
If you use Java 1.7 or 1.8, you can do this:
while((letter=reader.read()).equals(" ")){
}
return letter;
if you are reading from a Reader it returns an int which is the char or -1 if at the end of input.
int ch;
while((ch = reader.read()) == ' ');
return ch;
Note: " " is a String and ' ' is a char.
No sure about what is more efficient but you could do something like:
`return teststring.trim().charAt(0);
do {
...
} while (<condition>);
I am going to explain your question on do/while vs while alone.
The do is only a label. It has no impact on efficiency. The while at the bottom is effectively an if(condition) goto line #, where line # is the do. The "do" is simply a way of telling the compiler what number you want in that goto statement at the bottom.
Putting the while statement at the top would actually be less efficient because it means the condition has to be evaluated on the first iteration. Perhaps your reader does need to be checked on the first iteration, then it should be a while statement, but that requires more work, you see?
Second even transforming it to only a while statement, still places an unconditional goto at the bottom, with a conditional goto on the top, so even though it looks like less code, it could possibly be more.
I think it would be easier to just use String.toCharArray() and a For Each loop like
String teststring = " abcde";
for (char ch : teststring.toCharArray()) {
if (ch != ' ') return ch; // <-- 'a'
}
throw new ParseException("Whitespace only");
But, you could use a StringReader and you're using char (a primitive), so I think you've asked for
String teststring = " abcde";
StringReader reader = new StringReader(teststring);
try {
int letter;
do {
letter = reader.read();
} while (letter == ' ');
return ((char) letter);
} catch (IOException e) {
e.printStackTrace();
}
throw new ParseException("Whitespace only");
or return a default value if the character isn't found.

Parsing comma-separated values enclosed with quotes

I'm trying to parse comma separated values that are enclosed in quotes using only standard Java libraries (I know this must be possible)
As an example file.txt contains a new line for each row of
"Foo","Bar","04042013","04102013","Stuff"
"Foo2","Bar2","04042013","04102013","Stuff2"
However when I parse the file with the code I've written so far:
import java.io.*;
import java.util.Arrays;
public class ReadCSV{
public static void main(String[] arg) throws Exception {
BufferedReader myFile = new BufferedReader(new FileReader("file.txt"));
String myRow = myFile.readLine();
while (myRow != null){
//split by comma separated quote enclosed values
//BUG - first and last values get an extra quote
String[] myArray = myRow.split("\",\""); //the problem
for (String item:myArray) { System.out.print(item + "\t"); }
System.out.println();
myRow = myFile.readLine();
}
myFile.close();
}
}
However the output is
"Foo Bar 04042013 04102013 Stuff"
"Foo2 Bar2 04042013 04102013 Stuff2"
Instead of
Foo Bar 04042013 04102013 Stuff
Foo2 Bar2 04042013 04102013 Stuff2
I know I went wrong on the Split but I'm not sure how to fix it.
Before doing split, just remove first double quote and last double quote in myRow variable using below line.
myRow = myRow.substring(1, myRow.length() - 1);
(UPDATE) Also check if myRow is not empty. Otherwise above code will cause exception. For example below code checks if myRow is not empty and then only removes double quotes from the string.
if (!myRow.isEmpty()) {
myRow = myRow.substring(1, myRow.length() - 1);
}
i think you will probably have to go for a stateful approach, basically like the code below (another state would be necessary if you want to allow escaping of quotes within a value):
import java.util.ArrayList;
import java.util.List;
public class CSV {
public static void main(String[] args) {
String s = "\"hello, i am\",\"a string\"";
String x = s;
List<String> l = new ArrayList<String>();
int state = 0;
while(x.length()>0) {
if(state == 0) {
if(x.indexOf("\"")>-1) {
x = x.substring(x.indexOf("\"")+1).trim();
state = 1;
} else {
break;
}
} else if(state == 1) {
if(x.indexOf("\"")>-1) {
String found = x.substring(0,x.indexOf("\""));
System.err.println("found: "+found);
l.add(found);
x = x.substring(x.indexOf("\"")+1).trim();
state = 0;
} else {
throw new RuntimeException("bad format");
}
} else if(state == 2) {
if(x.indexOf(",")>-1) {
x = x.substring(x.indexOf(",")+1).trim();
state = 0;
} else {
break;
}
}
}
for(String f : l) {
System.err.println(f);
}
}
}
Instead, you can use replaceAll, which, for me, looks more suitable for this task:
myRow = myRow.replaceAll("\"", "").replaceAll(","," ");
This will replace all the " with nothing (Will remove them), then it'll replace all , with space (You can increase the number of spaces of course).
The problem in above code snippet is that you are splitting the String based on ",".
on your Line start "foo"," and end ","stuff" the starting and ending quotes does not match with "," so there are not splitted.
so this definitely not a bug in java. in your case you need to handle that part yourself.
You have multiple options to do it. some of them can be like below.
1. If you are sure there will be always a starting " and ending " you can remove them from String before hand before splitting.
2. If the starting " and " are optional, you can first check it with startsWith endsWith and then remove if exists before splitting.
You can simply get the String delimitered by the comma and then delete the first and last '"'.
=)
hope thats helpfull
dont have much time :D
String s = "\"Foo\",\"Bar\",\"04042013\",\"04102013\",\"Stuff\"";
String[] bufferArray = new String[10];
String bufferString;
int i = 0;
System.out.println(s);
Scanner scanner = new Scanner(s);
scanner.useDelimiter(",");
while(scanner.hasNext()) {
bufferString = scanner.next();
bufferArray[i] = bufferString.subSequence(1, bufferString.length() - 1).toString();
i++;
}
System.out.println(bufferArray[0]);
System.out.println(bufferArray[1]);
System.out.println(bufferArray[2]);
This solution is less elegant than a String.split() oneliner. The advantage is that we avoid fragile string manipulation, ie. the use of String.substring(). The string must end with ," however.
This version handles spaces between delimiters. Delimiter characters within quotes are ignored as expected, as are escaped quotes (for example \").
String s = "\"F\\\",\\\"oo\" , \"B,ar\",\"04042013\",\"04102013\",\"St,u\\\"ff\"";
Pattern p = Pattern.compile("(.*?)\"\\s*,\\s*\"");
Matcher m = p.matcher(s + ",\""); // String must end with ,"
while (m.find()) {
String result = m.group(1);
System.out.println(result);
}

I cant find what causes the Exception

I am a java noob as well as very new to this site so please bear with me here. If I do something wrong in posting this please let me know and I apologize in advance for anything I do happen to do or any bad grammar.
I need to write a custom CSV parser in java that does not separate commas inside quotations. I cannot use anything related to the CSV class.
1,2,3,4 -> 1 2 3 4
a,"b,c",d -> a b,c d
No matter what i try i always get an exception
import java.io.*;
import java.util.*;
public class csv
{
public static void main(String[] args)
{
File file = new File("csv.test");
BufferedReader buf = null;
try
{
buf = new BufferedReader(new FileReader(file));
String str;
while ((str = buf.readLine()) != null)
{
String[] values = null; // array for saving each split substr
int c1 = 0; // counter for current position in string
int c2 = 0; // counter for next space in array to save substr
int c3 = 0; // location of last comma or quotation for substring creation
int i = 0; // counter used later for printing
while (c1 <= str.length())
{
char x = str.charAt(c1);
String s = Character.toString(x);
if (c1 == str.length())
{
values[c2] = str.substring(c3, (c1 - 1));
}
else if (s == ",")
{
values[c2] = str.substring(c3, (c1 - 1));
c1++;
c2++;
c3++;
}
else if (s == "\"")
{
c1++;
c3 = c1;
while (s != "\"")
c1++;
values[c2] = str.substring(c3, (c1 - 1));
c1 = c1 + 2;
c2++;
c3 = c1;
}
else
{
c1++;
}
while (i < values.length)
System.out.println("\"" + values[i] + "\"");
}
}
}
catch (Exception e)
{
System.out.println("Error locating test file");
}
}
}
I have tried cutting out all logic and testing if it is file io related. that reads fine so im down to it being logic related. I looked it over and cannot find anything wrong with it. I even had a friend look it through and said it looked fine. Where is the exception thrown?
You are not initializing your values in this line String[] values = null; hence it will fail when you use it i.e. at list values[c2] = str.substring(c3, (c1 - 1));.
To resolve the issue, please initialize the values array with proper length e.g. String[] values = new String[SIZE];. Probably you need to use str.length() as SIZE of the array.
Also in your comparison else if (s == ","), you are comparing String using ==, which is incorrect. Instead, you can use x itself as else if (x == ',').
Edit: Your condition below will make c1 in increment indefinitely as you are not changing the value of str(x after correction as advised) anywhere.
Old condition:
while (s != "\"")
c1++;
After change as advised(still wrong as x is not changing within the while loop):
while (x != '"')
c1++;
Please correct the loop logic.
Not related to the concrete problem but instead of s == ",", you should do ",".equals(s), similar for other string equality checks.
This is not the cause of your exception, but it is obviously incorrect nonetheless:
while (s != "\"")
c1++;
It is incorrect for two reasons:
Since the c1++ doesn't alter the loop condition, this will loop for ever.
You are using == / != to compare strings when you should be using equals(...). You appear to have made this mistake in other places too.
To find out what is causing the exception, the first thing you should do is to print out the stacktrace. Add e.printStackTrace(); to the catch block.
Or better still, change it to just catch IOException. Catching Exception is a bad idea ... unless you are going to log / output the full stacktrace.

Determine if a String is a number and convert in Java?

I know variants of this question have been asked frequently before (see here and here for instance), but this is not an exact duplicate of those.
I would like to check if a String is a number, and if so I would like to store it as a double. There are several ways to do this, but all of them seem inappropriate for my purposes.
One solution would be to use Double.parseDouble(s) or similarly new BigDecimal(s). However, those solutions don't work if there are commas present (so "1,234" would cause an exception). I could of course strip out all commas before using these techniques, but that would seem to pose loads of problems in other locales.
I looked at Apache Commons NumberUtils.isNumber(s), but that suffers from the same comma issue.
I considered NumberFormat or DecimalFormat, but those seemed far too lenient. For instance, "1A" is formatted to "1" instead of indicating that it's not a number. Furthermore, something like "127.0.0.1" will be counted as the number 127 instead of indicating that it's not a number.
I feel like my requirements aren't so exotic that I'm the first to do this, but none of the solutions does exactly what I need. I suppose even I don't know exactly what I need (otherwise I could write my own parser), but I know the above solutions do not work for the reasons indicated. Does any solution exist, or do I need to figure out precisely what I need and write my own code for it?
Sounds quite weird, but I would try to follow this answer and use java.util.Scanner.
Scanner scanner = new Scanner(input);
if (scanner.hasNextInt())
System.out.println(scanner.nextInt());
else if (scanner.hasNextDouble())
System.out.println(scanner.nextDouble());
else
System.out.println("Not a number");
For inputs such as 1A, 127.0.0.1, 1,234, 6.02e-23 I get the following output:
Not a number
Not a number
1234
6.02E-23
Scanner.useLocale can be used to change to the desired locale.
You can specify the Locale that you need:
NumberFormat nf = NumberFormat.getInstance(Locale.GERMAN);
double myNumber = nf.parse(myString).doubleValue();
This should work in your example since German Locale has commas as decimal separator.
You can use the ParsePosition as a check for complete consumption of the string in a NumberFormat.parse operation. If the string is consumed, then you don't have a "1A" situation. If not, you do and can behave accordingly. See here for a quick outline of the solution and here for the related JDK bug that is closed as wont fix because of the ParsePosition option.
Unfortunately Double.parseDouble(s) or new BigDecimal(s) seem to be your best options.
You cite localisation concerns, but unfortunately there is no way reliably support all locales w/o specification by the user anyway. It is just impossible.
Sometimes you can reason about the scheme used by looking at whether commas or periods are used first, if both are used, but this isn't always possible, so why even try? Better to have a system which you know works reliably in certain situations than try to rely on one which may work in more situations but can also give bad results...
What does the number 123,456 represent? 123456 or 123.456?
Just strip commas, or spaces, or periods, depending on locale specified by user. Default to stripping spaces and commas. If you want to make it stricter, only strip commas OR spaces, not both, and only before the period if there is one. Also should be pretty easy to check manually if they are spaced properly in threes. In fact a custom parser might be easiest here.
Here is a bit of a proof of concept. It's a bit (very) messy but I reckon it works, and you get the idea anyways :).
public class StrictNumberParser {
public double parse(String numberString) throws NumberFormatException {
numberString = numberString.trim();
char[] numberChars = numberString.toCharArray();
Character separator = null;
int separatorCount = 0;
boolean noMoreSeparators = false;
for (int index = 1; index < numberChars.length; index++) {
char character = numberChars[index];
if (noMoreSeparators || separatorCount < 3) {
if (character == '.') {
if (separator != null) {
throw new NumberFormatException();
} else {
noMoreSeparators = true;
}
} else if (separator == null && (character == ',' || character == ' ')) {
if (noMoreSeparators) {
throw new NumberFormatException();
}
separator = new Character(character);
separatorCount = -1;
} else if (!Character.isDigit(character)) {
throw new NumberFormatException();
}
separatorCount++;
} else {
if (character == '.') {
noMoreSeparators = true;
} else if (separator == null) {
if (Character.isDigit(character)) {
noMoreSeparators = true;
} else if (character == ',' || character == ' ') {
separator = new Character(character);
} else {
throw new NumberFormatException();
}
} else if (!separator.equals(character)) {
throw new NumberFormatException();
}
separatorCount = 0;
}
}
if (separator != null) {
if (!noMoreSeparators && separatorCount != 3) {
throw new NumberFormatException();
}
numberString = numberString.replaceAll(separator.toString(), "");
}
return Double.parseDouble(numberString);
}
public void testParse(String testString) {
try {
System.out.println("result: " + parse(testString));
} catch (NumberFormatException e) {
System.out.println("Couldn't parse number!");
}
}
public static void main(String[] args) {
StrictNumberParser p = new StrictNumberParser();
p.testParse("123 45.6");
p.testParse("123 4567.8");
p.testParse("123 4567");
p.testParse("12 45");
p.testParse("123 456 45");
p.testParse("345.562,346");
p.testParse("123 456,789");
p.testParse("123,456,789");
p.testParse("123 456 789.52");
p.testParse("23,456,789");
p.testParse("3,456,789");
p.testParse("123 456.12");
p.testParse("1234567.8");
}
}
EDIT: obviously this would need to be extended for recognising scientific notation, but this should be simple enough, especially as you don't have to actually validate anything after the e, you can just let parseDouble fail if it is badly formed.
Also might be a good idea to properly extend NumberFormat with this. have a getSeparator() for parsed numbers and a setSeparator for giving desired output format... This sort of takes care of localisation, but again more work would need to be done to support ',' for decimals...
Not sure if it meets all your requirements, but the code found here might point you in the right direction?
From the article:
To summarize, the steps for proper input processing are:
Get an appropriate NumberFormat and define a ParsePosition variable.
Set the ParsePosition index to zero.
Parse the input value with parse(String source, ParsePosition parsePosition).
Perform error operations if the input length and ParsePosition index value don't match or if the parsed Number is null.
Otherwise, the value passed validation.
This is an interesting problem. But perhaps it is a little open-ended? Are you looking specifically to identify base-10 numbers, or hex, or what? I'm assuming base-10. What about currency? Is that important? Or is it just numbers.
In any case, I think that you can use the deficiencies of Number format to your advantage. Since you no that something like "1A", will be interpreted as 1, why not check the result by formatting it and comparing against the original string?
public static boolean isNumber(String s){
try{
Locale l = Locale.getDefault();
DecimalFormat df = new DecimalFormat("###.##;-##.##");
Number n = df.parse(s);
String sb = df.format(n);
return sb.equals(s);
}
catch(Exception e){
return false;
}
}
What do you think?
This is really interesting, and I think people are trying to overcomplicate it. I would really just break this down by rules:
1) Check for scientific notation (does it match the pattern of being all numbers, commas, periods, -/+ and having an 'e' in it?) -- if so, parse however you want
2) Does it match the regexp for valid numeric characters (0-9 , . - +) (only 1 . - or + allowed)
if so, strip out everything that's not a digit and parse appropriately, otherwise fail.
I can't see a shortcut that's going to work here, just take the brute force approach, not everything in programming can be (or needs to be) completely elegant.
My understanding is that you want to cover Western/Latin languages while retaining as much strict interpretation as possible. So what I'm doing here is asking DecimalFormatSymbols to tell me what the grouping, decimal, negative, and zero separators are, and swapping them out for symbols Double will recognize.
How does it perform?
In the US, it rejects: "1A", "127.100.100.100"
and accepts "1.47E-9"
In Germany it still rejects "1A"
It ACCEPTS "1,024.00" but interprets it correctly as 1.024. Likewise, it accepts "127.100.100.100" as 127100100100.0
In fact, the German locale correctly identifies and parses "1,47E-9"
Let me know if you have any trouble in a different locale.
import java.util.Locale;
import java.text.DecimalFormatSymbols;
public class StrictNumberFormat {
public static boolean isDouble(String s, Locale l) {
String clean = convertLocaleCharacters(s,l);
try {
Double.valueOf(clean);
return true;
} catch (NumberFormatException nfe) {
return false;
}
}
public static double doubleValue(String s, Locale l) {
return Double.valueOf(convertLocaleCharacters(s,l));
}
public static boolean isDouble(String s) {
return isDouble(s,Locale.getDefault());
}
public static double doubleValue(String s) {
return doubleValue(s,Locale.getDefault());
}
private static String convertLocaleCharacters(String number, Locale l) {
DecimalFormatSymbols symbols = new DecimalFormatSymbols(l);
String grouping = getUnicodeRepresentation( symbols.getGroupingSeparator() );
String decimal = getUnicodeRepresentation( symbols.getDecimalSeparator() );
String negative = getUnicodeRepresentation( symbols.getMinusSign() );
String zero = getUnicodeRepresentation( symbols.getZeroDigit() );
String clean = number.replaceAll(grouping, "");
clean = clean.replaceAll(decimal, ".");
clean = clean.replaceAll(negative, "-");
clean = clean.replaceAll(zero, "0");
return clean;
}
private static String getUnicodeRepresentation(char ch) {
String unicodeString = Integer.toHexString(ch); //ch implicitly promoted to int
while(unicodeString.length()<4) unicodeString = "0"+unicodeString;
return "\\u"+unicodeString;
}
}
You're best off doing it manually. Figure out what you can accept as a number and disregard everything else:
import java.lang.NumberFormatException;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class ParseDouble {
public static void main(String[] argv) {
String line = "$$$|%|#|1A|127.0.0.1|1,344|95|99.64";
for (String s : line.split("\\|")) {
try {
System.out.println("parsed: " +
any2double(s)
);
}catch (NumberFormatException ne) {
System.out.println(ne.getMessage());
}
}
}
public static double any2double(String input) throws NumberFormatException {
double out =0d;
Pattern special = Pattern.compile("[^a-zA-Z0-9\\.,]+");
Pattern letters = Pattern.compile("[a-zA-Z]+");
Pattern comma = Pattern.compile(",");
Pattern allDigits = Pattern.compile("^[0-9]+$");
Pattern singleDouble = Pattern.compile("^[0-9]+\\.[0-9]+$");
Matcher[] goodCases = new Matcher[]{
allDigits.matcher(input),
singleDouble.matcher(input)
};
Matcher[] nanCases = new Matcher[]{
special.matcher(input),
letters.matcher(input)
};
// maybe cases
if (comma.matcher(input).find()){
out = Double.parseDouble(
comma.matcher(input).replaceFirst("."));
return out;
}
for (Matcher m : nanCases) {
if (m.find()) {
throw new NumberFormatException("Bad input "+input);
}
}
for (Matcher m : goodCases) {
if (m.find()) {
try {
out = Double.parseDouble(input);
return out;
} catch (NumberFormatException ne){
System.out.println(ne.getMessage());
}
}
}
throw new NumberFormatException("Could not parse "+input);
}
}
If you set your locale right, built in parseDouble will work with commas. Example is here.
I think you've got a multi step process to handle here with a custom solution, if you're not willing to accept the results of DecimalFormat or the answers already linked.
1) Identify the decimal and grouping separators. You might need to identify other format symbols (such as scientific notation indicators).
http://download.oracle.com/javase/1.4.2/docs/api/java/text/DecimalFormat.html#getDecimalFormatSymbols()
2) Strip out all grouping symbols (or craft a regex, be careful of other symbols you accept such as the decimal if you do). Then strip out the first decimal symbol. Other symbols as needed.
3) Call parse or isNumber.
One of the easy hacks would be to use replaceFirst for String you get and check the new String whether it is a double or not. In case it's a double - convert back (if needed)
If you want to convert some string number which is comma separated decimal to double, you could use DecimalSeparator + DecimalFormalSymbols:
final double strToDouble(String str, char separator){
DecimalFormatSymbols s = new DecimalFormatSymbols();
s.setDecimalSeparator(separator);
DecimalFormat df = new DecimalFormat();
double num = 0;
df.setDecimalFormatSymbols(s);
try{
num = ((Double) df.parse(str)).doubleValue();
}catch(ClassCastException | ParseException ex){
// if you want, you could add something here to
// indicate the string is not double
}
return num;
}
well, lets test it:
String a = "1.2";
String b = "2,3";
String c = "A1";
String d = "127.0.0.1";
System.out.println("\"" + a + "\" = " + strToDouble(a, ','));
System.out.println("\"" + a + "\" (with '.' as separator) = "
+ strToDouble(a, '.'));
System.out.println("\"" + b + "\" = " + strToDouble(b, ','));
System.out.println("\"" + c + "\" = " + strToDouble(c, ','));
System.out.println("\"" + d + "\" = " + strToDouble(d, ','));
if you run the above code, you'll see:
"1.2" = 0.0
"1.2" (with '.' as separator) = 1.2
"2,3" = 2.3
"A1" = 0.0
"127.0.0.1" = 0.0
This will take a string, count its decimals and commas, remove commas, conserve a valid decimal (note that this is based on US standardization - in order to handle 1.000.000,00 as 1 million this process would have to have the decimal and comma handling switched), determine if the structure is valid, and then return a double. Returns null if the string could not be converted. Edit: Added support for international or US. convertStoD(string,true) for US, convertStoD(string,false) for non US. Comments are now for US version.
public double convertStoD(string s,bool isUS){
//string s = "some string or number, something dynamic";
bool isNegative = false;
if(s.charAt(0)== '-')
{
s = s.subString(1);
isNegative = true;
}
string ValidNumberArguements = new string();
if(isUS)
{
ValidNumberArguements = ",.";
}else{
ValidNumberArguements = ".,";
}
int length = s.length;
int currentCommas = 0;
int currentDecimals = 0;
for(int i = 0; i < length; i++){
if(s.charAt(i) == ValidNumberArguements.charAt(0))//charAt(0) = ,
{
currentCommas++;
continue;
}
if(s.charAt(i) == ValidNumberArguements.charAt(1))//charAt(1) = .
{
currentDec++;
continue;
}
if(s.charAt(i).matches("\D"))return null;//remove 1 A
}
if(currentDecimals > 1)return null;//remove 1.00.00
string decimalValue = "";
if(currentDecimals > 0)
{
int index = s.indexOf(ValidNumberArguements.charAt(1));
decimalValue += s.substring(index);
s = s.substring(0,index);
if(decimalValue.indexOf(ValidNumberArguements.charAt(0)) != -1)return null;//remove 1.00,000
}
int allowedCommas = (s.length-1) / 3;
if(currentCommas > allowedCommas)return null;//remove 10,00,000
String[] NumberParser = s.split(ValidNumberArguements.charAt(0));
length = NumberParser.length;
StringBuilder returnString = new StringBuilder();
for(int i = 0; i < length; i++)
{
if(i == 0)
{
if(NumberParser[i].length > 3 && length > 1)return null;//remove 1234,0,000
returnString.append(NumberParser[i]);
continue;
}
if(NumberParser[i].length != 3)return null;//ensure proper 1,000,000
returnString.append(NumberParser[i]);
}
returnString.append(decimalValue);
double answer = Double.parseDouble(returnString);
if(isNegative)answer *= -1;
return answer;
}
This code should handle most inputs, except IP addresses where all groups of digits are in three's (ex: 255.255.255.255 is valid, but not 255.1.255.255). It also doesn't support scientific notation
It will work with most variants of separators (",", "." or space). If more than one separator is detected, the first is assumed to be the thousands separator, with additional checks (validity etc.)
Edit: prevDigit is used for checking that the number uses thousand separators correctly. If there are more than one group of thousands, all but the first one must be in groups of 3. I modified the code to make it clearer so that "3" is not a magic number but a constant.
Edit 2: I don't mind the down votes much, but can someone explain what the problem is?
/* A number using thousand separator must have
groups of 3 digits, except the first one.
Numbers following the decimal separator can
of course be unlimited. */
private final static int GROUP_SIZE=3;
public static boolean isNumber(String input) {
boolean inThousandSep = false;
boolean inDecimalSep = false;
boolean endsWithDigit = false;
char thousandSep = '\0';
int prevDigits = 0;
for(int i=0; i < input.length(); i++) {
char c = input.charAt(i);
switch(c) {
case ',':
case '.':
case ' ':
endsWithDigit = false;
if(inDecimalSep)
return false;
else if(inThousandSep) {
if(c != thousandSep)
inDecimalSep = true;
if(prevDigits != GROUP_SIZE)
return false; // Invalid use of separator
}
else {
if(prevDigits > GROUP_SIZE || prevDigits == 0)
return false;
thousandSep = c;
inThousandSep = true;
}
prevDigits = 0;
break;
default:
if(Character.isDigit(c)) {
prevDigits++;
endsWithDigit = true;
}
else {
return false;
}
}
}
return endsWithDigit;
}
Test code:
public static void main(String[] args) {
System.out.println(isNumber("100")); // true
System.out.println(isNumber("100.00")); // true
System.out.println(isNumber("1,5")); // true
System.out.println(isNumber("1,000,000.00.")); // false
System.out.println(isNumber("100,00,2")); // false
System.out.println(isNumber("123.123.23.123")); // false
System.out.println(isNumber("123.123.123.123")); // true
}

Categories