Determine if a String is a number and convert in Java? - java

I know variants of this question have been asked frequently before (see here and here for instance), but this is not an exact duplicate of those.
I would like to check if a String is a number, and if so I would like to store it as a double. There are several ways to do this, but all of them seem inappropriate for my purposes.
One solution would be to use Double.parseDouble(s) or similarly new BigDecimal(s). However, those solutions don't work if there are commas present (so "1,234" would cause an exception). I could of course strip out all commas before using these techniques, but that would seem to pose loads of problems in other locales.
I looked at Apache Commons NumberUtils.isNumber(s), but that suffers from the same comma issue.
I considered NumberFormat or DecimalFormat, but those seemed far too lenient. For instance, "1A" is formatted to "1" instead of indicating that it's not a number. Furthermore, something like "127.0.0.1" will be counted as the number 127 instead of indicating that it's not a number.
I feel like my requirements aren't so exotic that I'm the first to do this, but none of the solutions does exactly what I need. I suppose even I don't know exactly what I need (otherwise I could write my own parser), but I know the above solutions do not work for the reasons indicated. Does any solution exist, or do I need to figure out precisely what I need and write my own code for it?

Sounds quite weird, but I would try to follow this answer and use java.util.Scanner.
Scanner scanner = new Scanner(input);
if (scanner.hasNextInt())
System.out.println(scanner.nextInt());
else if (scanner.hasNextDouble())
System.out.println(scanner.nextDouble());
else
System.out.println("Not a number");
For inputs such as 1A, 127.0.0.1, 1,234, 6.02e-23 I get the following output:
Not a number
Not a number
1234
6.02E-23
Scanner.useLocale can be used to change to the desired locale.

You can specify the Locale that you need:
NumberFormat nf = NumberFormat.getInstance(Locale.GERMAN);
double myNumber = nf.parse(myString).doubleValue();
This should work in your example since German Locale has commas as decimal separator.

You can use the ParsePosition as a check for complete consumption of the string in a NumberFormat.parse operation. If the string is consumed, then you don't have a "1A" situation. If not, you do and can behave accordingly. See here for a quick outline of the solution and here for the related JDK bug that is closed as wont fix because of the ParsePosition option.

Unfortunately Double.parseDouble(s) or new BigDecimal(s) seem to be your best options.
You cite localisation concerns, but unfortunately there is no way reliably support all locales w/o specification by the user anyway. It is just impossible.
Sometimes you can reason about the scheme used by looking at whether commas or periods are used first, if both are used, but this isn't always possible, so why even try? Better to have a system which you know works reliably in certain situations than try to rely on one which may work in more situations but can also give bad results...
What does the number 123,456 represent? 123456 or 123.456?
Just strip commas, or spaces, or periods, depending on locale specified by user. Default to stripping spaces and commas. If you want to make it stricter, only strip commas OR spaces, not both, and only before the period if there is one. Also should be pretty easy to check manually if they are spaced properly in threes. In fact a custom parser might be easiest here.
Here is a bit of a proof of concept. It's a bit (very) messy but I reckon it works, and you get the idea anyways :).
public class StrictNumberParser {
public double parse(String numberString) throws NumberFormatException {
numberString = numberString.trim();
char[] numberChars = numberString.toCharArray();
Character separator = null;
int separatorCount = 0;
boolean noMoreSeparators = false;
for (int index = 1; index < numberChars.length; index++) {
char character = numberChars[index];
if (noMoreSeparators || separatorCount < 3) {
if (character == '.') {
if (separator != null) {
throw new NumberFormatException();
} else {
noMoreSeparators = true;
}
} else if (separator == null && (character == ',' || character == ' ')) {
if (noMoreSeparators) {
throw new NumberFormatException();
}
separator = new Character(character);
separatorCount = -1;
} else if (!Character.isDigit(character)) {
throw new NumberFormatException();
}
separatorCount++;
} else {
if (character == '.') {
noMoreSeparators = true;
} else if (separator == null) {
if (Character.isDigit(character)) {
noMoreSeparators = true;
} else if (character == ',' || character == ' ') {
separator = new Character(character);
} else {
throw new NumberFormatException();
}
} else if (!separator.equals(character)) {
throw new NumberFormatException();
}
separatorCount = 0;
}
}
if (separator != null) {
if (!noMoreSeparators && separatorCount != 3) {
throw new NumberFormatException();
}
numberString = numberString.replaceAll(separator.toString(), "");
}
return Double.parseDouble(numberString);
}
public void testParse(String testString) {
try {
System.out.println("result: " + parse(testString));
} catch (NumberFormatException e) {
System.out.println("Couldn't parse number!");
}
}
public static void main(String[] args) {
StrictNumberParser p = new StrictNumberParser();
p.testParse("123 45.6");
p.testParse("123 4567.8");
p.testParse("123 4567");
p.testParse("12 45");
p.testParse("123 456 45");
p.testParse("345.562,346");
p.testParse("123 456,789");
p.testParse("123,456,789");
p.testParse("123 456 789.52");
p.testParse("23,456,789");
p.testParse("3,456,789");
p.testParse("123 456.12");
p.testParse("1234567.8");
}
}
EDIT: obviously this would need to be extended for recognising scientific notation, but this should be simple enough, especially as you don't have to actually validate anything after the e, you can just let parseDouble fail if it is badly formed.
Also might be a good idea to properly extend NumberFormat with this. have a getSeparator() for parsed numbers and a setSeparator for giving desired output format... This sort of takes care of localisation, but again more work would need to be done to support ',' for decimals...

Not sure if it meets all your requirements, but the code found here might point you in the right direction?
From the article:
To summarize, the steps for proper input processing are:
Get an appropriate NumberFormat and define a ParsePosition variable.
Set the ParsePosition index to zero.
Parse the input value with parse(String source, ParsePosition parsePosition).
Perform error operations if the input length and ParsePosition index value don't match or if the parsed Number is null.
Otherwise, the value passed validation.

This is an interesting problem. But perhaps it is a little open-ended? Are you looking specifically to identify base-10 numbers, or hex, or what? I'm assuming base-10. What about currency? Is that important? Or is it just numbers.
In any case, I think that you can use the deficiencies of Number format to your advantage. Since you no that something like "1A", will be interpreted as 1, why not check the result by formatting it and comparing against the original string?
public static boolean isNumber(String s){
try{
Locale l = Locale.getDefault();
DecimalFormat df = new DecimalFormat("###.##;-##.##");
Number n = df.parse(s);
String sb = df.format(n);
return sb.equals(s);
}
catch(Exception e){
return false;
}
}
What do you think?

This is really interesting, and I think people are trying to overcomplicate it. I would really just break this down by rules:
1) Check for scientific notation (does it match the pattern of being all numbers, commas, periods, -/+ and having an 'e' in it?) -- if so, parse however you want
2) Does it match the regexp for valid numeric characters (0-9 , . - +) (only 1 . - or + allowed)
if so, strip out everything that's not a digit and parse appropriately, otherwise fail.
I can't see a shortcut that's going to work here, just take the brute force approach, not everything in programming can be (or needs to be) completely elegant.

My understanding is that you want to cover Western/Latin languages while retaining as much strict interpretation as possible. So what I'm doing here is asking DecimalFormatSymbols to tell me what the grouping, decimal, negative, and zero separators are, and swapping them out for symbols Double will recognize.
How does it perform?
In the US, it rejects: "1A", "127.100.100.100"
and accepts "1.47E-9"
In Germany it still rejects "1A"
It ACCEPTS "1,024.00" but interprets it correctly as 1.024. Likewise, it accepts "127.100.100.100" as 127100100100.0
In fact, the German locale correctly identifies and parses "1,47E-9"
Let me know if you have any trouble in a different locale.
import java.util.Locale;
import java.text.DecimalFormatSymbols;
public class StrictNumberFormat {
public static boolean isDouble(String s, Locale l) {
String clean = convertLocaleCharacters(s,l);
try {
Double.valueOf(clean);
return true;
} catch (NumberFormatException nfe) {
return false;
}
}
public static double doubleValue(String s, Locale l) {
return Double.valueOf(convertLocaleCharacters(s,l));
}
public static boolean isDouble(String s) {
return isDouble(s,Locale.getDefault());
}
public static double doubleValue(String s) {
return doubleValue(s,Locale.getDefault());
}
private static String convertLocaleCharacters(String number, Locale l) {
DecimalFormatSymbols symbols = new DecimalFormatSymbols(l);
String grouping = getUnicodeRepresentation( symbols.getGroupingSeparator() );
String decimal = getUnicodeRepresentation( symbols.getDecimalSeparator() );
String negative = getUnicodeRepresentation( symbols.getMinusSign() );
String zero = getUnicodeRepresentation( symbols.getZeroDigit() );
String clean = number.replaceAll(grouping, "");
clean = clean.replaceAll(decimal, ".");
clean = clean.replaceAll(negative, "-");
clean = clean.replaceAll(zero, "0");
return clean;
}
private static String getUnicodeRepresentation(char ch) {
String unicodeString = Integer.toHexString(ch); //ch implicitly promoted to int
while(unicodeString.length()<4) unicodeString = "0"+unicodeString;
return "\\u"+unicodeString;
}
}

You're best off doing it manually. Figure out what you can accept as a number and disregard everything else:
import java.lang.NumberFormatException;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class ParseDouble {
public static void main(String[] argv) {
String line = "$$$|%|#|1A|127.0.0.1|1,344|95|99.64";
for (String s : line.split("\\|")) {
try {
System.out.println("parsed: " +
any2double(s)
);
}catch (NumberFormatException ne) {
System.out.println(ne.getMessage());
}
}
}
public static double any2double(String input) throws NumberFormatException {
double out =0d;
Pattern special = Pattern.compile("[^a-zA-Z0-9\\.,]+");
Pattern letters = Pattern.compile("[a-zA-Z]+");
Pattern comma = Pattern.compile(",");
Pattern allDigits = Pattern.compile("^[0-9]+$");
Pattern singleDouble = Pattern.compile("^[0-9]+\\.[0-9]+$");
Matcher[] goodCases = new Matcher[]{
allDigits.matcher(input),
singleDouble.matcher(input)
};
Matcher[] nanCases = new Matcher[]{
special.matcher(input),
letters.matcher(input)
};
// maybe cases
if (comma.matcher(input).find()){
out = Double.parseDouble(
comma.matcher(input).replaceFirst("."));
return out;
}
for (Matcher m : nanCases) {
if (m.find()) {
throw new NumberFormatException("Bad input "+input);
}
}
for (Matcher m : goodCases) {
if (m.find()) {
try {
out = Double.parseDouble(input);
return out;
} catch (NumberFormatException ne){
System.out.println(ne.getMessage());
}
}
}
throw new NumberFormatException("Could not parse "+input);
}
}

If you set your locale right, built in parseDouble will work with commas. Example is here.

I think you've got a multi step process to handle here with a custom solution, if you're not willing to accept the results of DecimalFormat or the answers already linked.
1) Identify the decimal and grouping separators. You might need to identify other format symbols (such as scientific notation indicators).
http://download.oracle.com/javase/1.4.2/docs/api/java/text/DecimalFormat.html#getDecimalFormatSymbols()
2) Strip out all grouping symbols (or craft a regex, be careful of other symbols you accept such as the decimal if you do). Then strip out the first decimal symbol. Other symbols as needed.
3) Call parse or isNumber.

One of the easy hacks would be to use replaceFirst for String you get and check the new String whether it is a double or not. In case it's a double - convert back (if needed)

If you want to convert some string number which is comma separated decimal to double, you could use DecimalSeparator + DecimalFormalSymbols:
final double strToDouble(String str, char separator){
DecimalFormatSymbols s = new DecimalFormatSymbols();
s.setDecimalSeparator(separator);
DecimalFormat df = new DecimalFormat();
double num = 0;
df.setDecimalFormatSymbols(s);
try{
num = ((Double) df.parse(str)).doubleValue();
}catch(ClassCastException | ParseException ex){
// if you want, you could add something here to
// indicate the string is not double
}
return num;
}
well, lets test it:
String a = "1.2";
String b = "2,3";
String c = "A1";
String d = "127.0.0.1";
System.out.println("\"" + a + "\" = " + strToDouble(a, ','));
System.out.println("\"" + a + "\" (with '.' as separator) = "
+ strToDouble(a, '.'));
System.out.println("\"" + b + "\" = " + strToDouble(b, ','));
System.out.println("\"" + c + "\" = " + strToDouble(c, ','));
System.out.println("\"" + d + "\" = " + strToDouble(d, ','));
if you run the above code, you'll see:
"1.2" = 0.0
"1.2" (with '.' as separator) = 1.2
"2,3" = 2.3
"A1" = 0.0
"127.0.0.1" = 0.0

This will take a string, count its decimals and commas, remove commas, conserve a valid decimal (note that this is based on US standardization - in order to handle 1.000.000,00 as 1 million this process would have to have the decimal and comma handling switched), determine if the structure is valid, and then return a double. Returns null if the string could not be converted. Edit: Added support for international or US. convertStoD(string,true) for US, convertStoD(string,false) for non US. Comments are now for US version.
public double convertStoD(string s,bool isUS){
//string s = "some string or number, something dynamic";
bool isNegative = false;
if(s.charAt(0)== '-')
{
s = s.subString(1);
isNegative = true;
}
string ValidNumberArguements = new string();
if(isUS)
{
ValidNumberArguements = ",.";
}else{
ValidNumberArguements = ".,";
}
int length = s.length;
int currentCommas = 0;
int currentDecimals = 0;
for(int i = 0; i < length; i++){
if(s.charAt(i) == ValidNumberArguements.charAt(0))//charAt(0) = ,
{
currentCommas++;
continue;
}
if(s.charAt(i) == ValidNumberArguements.charAt(1))//charAt(1) = .
{
currentDec++;
continue;
}
if(s.charAt(i).matches("\D"))return null;//remove 1 A
}
if(currentDecimals > 1)return null;//remove 1.00.00
string decimalValue = "";
if(currentDecimals > 0)
{
int index = s.indexOf(ValidNumberArguements.charAt(1));
decimalValue += s.substring(index);
s = s.substring(0,index);
if(decimalValue.indexOf(ValidNumberArguements.charAt(0)) != -1)return null;//remove 1.00,000
}
int allowedCommas = (s.length-1) / 3;
if(currentCommas > allowedCommas)return null;//remove 10,00,000
String[] NumberParser = s.split(ValidNumberArguements.charAt(0));
length = NumberParser.length;
StringBuilder returnString = new StringBuilder();
for(int i = 0; i < length; i++)
{
if(i == 0)
{
if(NumberParser[i].length > 3 && length > 1)return null;//remove 1234,0,000
returnString.append(NumberParser[i]);
continue;
}
if(NumberParser[i].length != 3)return null;//ensure proper 1,000,000
returnString.append(NumberParser[i]);
}
returnString.append(decimalValue);
double answer = Double.parseDouble(returnString);
if(isNegative)answer *= -1;
return answer;
}

This code should handle most inputs, except IP addresses where all groups of digits are in three's (ex: 255.255.255.255 is valid, but not 255.1.255.255). It also doesn't support scientific notation
It will work with most variants of separators (",", "." or space). If more than one separator is detected, the first is assumed to be the thousands separator, with additional checks (validity etc.)
Edit: prevDigit is used for checking that the number uses thousand separators correctly. If there are more than one group of thousands, all but the first one must be in groups of 3. I modified the code to make it clearer so that "3" is not a magic number but a constant.
Edit 2: I don't mind the down votes much, but can someone explain what the problem is?
/* A number using thousand separator must have
groups of 3 digits, except the first one.
Numbers following the decimal separator can
of course be unlimited. */
private final static int GROUP_SIZE=3;
public static boolean isNumber(String input) {
boolean inThousandSep = false;
boolean inDecimalSep = false;
boolean endsWithDigit = false;
char thousandSep = '\0';
int prevDigits = 0;
for(int i=0; i < input.length(); i++) {
char c = input.charAt(i);
switch(c) {
case ',':
case '.':
case ' ':
endsWithDigit = false;
if(inDecimalSep)
return false;
else if(inThousandSep) {
if(c != thousandSep)
inDecimalSep = true;
if(prevDigits != GROUP_SIZE)
return false; // Invalid use of separator
}
else {
if(prevDigits > GROUP_SIZE || prevDigits == 0)
return false;
thousandSep = c;
inThousandSep = true;
}
prevDigits = 0;
break;
default:
if(Character.isDigit(c)) {
prevDigits++;
endsWithDigit = true;
}
else {
return false;
}
}
}
return endsWithDigit;
}
Test code:
public static void main(String[] args) {
System.out.println(isNumber("100")); // true
System.out.println(isNumber("100.00")); // true
System.out.println(isNumber("1,5")); // true
System.out.println(isNumber("1,000,000.00.")); // false
System.out.println(isNumber("100,00,2")); // false
System.out.println(isNumber("123.123.23.123")); // false
System.out.println(isNumber("123.123.123.123")); // true
}

Related

Convert String to Number, Java

Edit: Clarification convert any valid number encoding from a string to a number
How does one convert a string to a number, say just for integers, for all accepted integer formats, particularly the ones that throw NumberFormatException under Integer.parseInt. For example, the code
...
int i = 0xff;
System.out.println(i);
String s = "0xff";
System.out.println( Integer.parseInt(s) );
....
Will throw a NumberFormatException on the fourth line, even though the string is clearly a valid encoding for a hexadecimal integer. We can assume that we already know that the encoding is a valid number, say by checking it against a regex. It would be nice to also check for overflow (like Integer.parseInt does), but it would be okay if that has to be done as a separate step.
I could loop through every digit and manually calculate the composite, but that would pretty difficult. Is there a better way?
EDIT: a lot of people are answering this for hexidecimal, which is great, but not completely what I was asking (it's my fault, I used hexidecimal as the example). I'm wondering if there's a way to decode all valid java numbers. Long.decode is definitely great for just catching hex, but it fails on
222222L
which is a perfectly valid long. Do I have to catch for every different number format separately? I'm assuming you've used a regex to tell what category of number it is, i.e, distinguish floats, integers, etc.
You could do
System.out.println(Integer.decode(s));
You need to specify the base of the number you are trying to parse:
Integer.parseInt(s,16);
This will fail if you have that "0x" starting it off so you could just add a check:
if (s.startsWith("0x")) {
s = s.substring(2);
}
Integer.parseInt(s,16);
EDIT
In response to the information that this was not a hex specific question I would recommend writing your own method to parse out all the numbers formats you like and build in on top of Integer.decode which can save you from having to handle a couple of cases.
I would say use regex or create your own methods to validate other formats:
public static int manualDecode(String s) throws NumberFormatException {
// Match against #####L long format
Pattern p = Pattern.compile("\\d+L"); // Matches ########L
Matcher m = p.matcher(s);
if (m.matches()) {
return Integer.decode(s.substring(0,s.length()-1));
}
// Match against that weird underscore format
p = Pattern.compile("(\\d{1,3})_((\\d{3})_)*?(\\d{3})"); // Matches ###_###_###_###_###
m = p.matcher(s);
if (m.matches()) {
String reformattedString = "";
char c;
for (int i = 0; i < s.length(); i++) {
c = s.charAt(i);
if ( c >= '0' && c <= '9') {
reformattedString += c;
}
}
return Integer.decode(reformattedString);
}
// Add as many more as you wish
throw new NumberFormatException();
}
public int parseIntExtended(String s) {
try {
return Integer.decode(s);
} catch (NumberFormatException e) {
return manualDecode(s);
}
}
Integer.decode should do the trick:
public class a{
public static void main(String[] args){
String s="0xff";
System.out.println(Integer.decode(s));
}
}
You can try using BigInteger also but sill you have to remove 0x first or replace x from 0
int val = new BigInteger("ff", 16).intValue(); // output 255

Java codingbat help - withoutString

I'm using codingbat.com to get some java practice in. One of the String problems, 'withoutString' is as follows:
Given two strings, base and remove, return a version of the base string where all instances of the remove string have been removed (not case sensitive).
You may assume that the remove string is length 1 or more. Remove only non-overlapping instances, so with "xxx" removing "xx" leaves "x".
This problem can be found at: http://codingbat.com/prob/p192570
As you can see from the the dropbox-linked screenshot below, all of the runs pass except for three and a final one called "other tests." The thing is, even though they are marked as incorrect, my output matches exactly the expected output for the correct answer.
Here's a screenshot of my output:
And here's the code I'm using:
public String withoutString(String base, String remove) {
String result = "";
int i = 0;
for(; i < base.length()-remove.length();){
if(!(base.substring(i,i+remove.length()).equalsIgnoreCase(remove))){
result = result + base.substring(i,i+1);
i++;
}
else{
i = i + remove.length();
}
if(result.startsWith(" ")) result = result.substring(1);
if(result.endsWith(" ") && base.substring(i,i+1).equals(" ")) result = result.substring(0,result.length()-1);
}
if(base.length()-i <= remove.length() && !(base.substring(i).equalsIgnoreCase(remove))){
result = result + base.substring(i);
}
return result;
}
Your solution IS failing AND there is a display bug in coding bat.
The correct output should be:
withoutString("This is a FISH", "IS") -> "Th a FH"
Yours is:
withoutString("This is a FISH", "IS") -> "Th a FH"
Yours fails because it is removing spaces, but also, coding bat does not display the correct expected and run output string due to HTML removing extra spaces.
This recursive solution passes all tests:
public String withoutString(String base, String remove) {
int remIdx = base.toLowerCase().indexOf(remove.toLowerCase());
if (remIdx == -1)
return base;
return base.substring(0, remIdx ) +
withoutString(base.substring(remIdx + remove.length()) , remove);
}
Here is an example of an optimal iterative solution. It has more code than the recursive solution but is faster since far fewer function calls are made.
public String withoutString(String base, String remove) {
int remIdx = 0;
int remLen = remove.length();
remove = remove.toLowerCase();
while (true) {
remIdx = base.toLowerCase().indexOf(remove);
if (remIdx == -1)
break;
base = base.substring(0, remIdx) + base.substring(remIdx + remLen);
}
return base;
}
I just ran your code in an IDE. It compiles correctly and matches all tests shown on codingbat. There must be some bug with codingbat's test cases.
If you are curious, this problem can be solved with a single line of code:
public String withoutString(String base, String remove) {
return base.replaceAll("(?i)" + remove, ""); //String#replaceAll(String, String) with case insensitive regex.
}
Regex explaination:
The first argument taken by String#replaceAll(String, String) is what is known as a Regular Expression or "regex" for short.
Regex is a powerful tool to perform pattern matching within Strings. In this case, the regular expression being used is (assuming that remove is equal to IS):
(?i)IS
This particular expression has two parts: (?i) and IS.
IS matches the string "IS" exactly, nothing more, nothing less.
(?i) is simply a flag to tell the regex engine to ignore case.
With (?i)IS, all of: IS, Is, iS and is will be matched.
As an addition, this is (almost) equivalent to the regular expressions: (IS|Is|iS|is), (I|i)(S|s) and [Ii][Ss].
EDIT
Turns out that your output is not correct and is failing as expected. See: dansalmo's answer.
public String withoutString(String base, String remove) {
String temp = base.replaceAll(remove, "");
String temp2 = temp.replaceAll(remove.toLowerCase(), "");
return temp2.replaceAll(remove.toUpperCase(), "");
}
Please find below my solution
public String withoutString(String base, String remove) {
final int rLen=remove.length();
final int bLen=base.length();
String op="";
for(int i = 0; i < bLen;)
{
if(!(i + rLen > bLen) && base.substring(i, i + rLen).equalsIgnoreCase(remove))
{
i +=rLen;
continue;
}
op += base.substring(i, i + 1);
i++;
}
return op;
}
Something things go really weird on codingBat this is just one of them.
I am adding to a previous solution, but using a StringBuilder for better practice. Most credit goes to Anirudh.
public String withoutString(String base, String remove) {
//create a constant integer the size of remove.length();
final int rLen=remove.length();
//create a constant integer the size of base.length();
final int bLen=base.length();
//Create an empty string;
StringBuilder op = new StringBuilder();
//Create the for loop.
for(int i = 0; i < bLen;)
{
//if the remove string lenght we are looking for is not less than the base length
// and the base substring equals the remove string.
if(!(i + rLen > bLen) && base.substring(i, i + rLen).equalsIgnoreCase(remove))
{
//Increment by the remove length, and skip adding it to the string.
i +=rLen;
continue;
}
//else, we add the character at i to the string builder.
op.append(base.charAt(i));
//and increment by one.
i++;
}
//We return the string.
return op.toString();
}
Taylor's solution is the most efficient one, however I have another solution that is a naive one and it works.
public String withoutString(String base, String remove) {
String returnString = base;
while(returnString.toLowerCase().indexOf(remove.toLowerCase())!=-1){
int start = returnString.toLowerCase().indexOf(remove.toLowerCase());
int end = remove.length();
returnString = returnString.substring(0, start) + returnString.substring(start+end);
}
return returnString;
}
#Daemon
your code works. Thanks for the regex explanation. Though dansalmo pointed out that codingbat is displaying the intended output incorrectly, I through in some extra lines to your code to unnecessarily account for the double spaces with the following:
public String withoutString(String base, String remove){
String result = base.replaceAll("(?i)" + remove, "");
for(int i = 0; i < result.length()-1;){
if(result.substring(i,i+2).equals(" ")){
result = result.replace(result.substring(i,i+2), " ");
}
else i++;
}
if(result.startsWith(" ")) result = result.substring(1);
return result;
}
public String withoutString(String base, String remove){
return base.replace(remove,"");
}

Parsing comma-separated values enclosed with quotes

I'm trying to parse comma separated values that are enclosed in quotes using only standard Java libraries (I know this must be possible)
As an example file.txt contains a new line for each row of
"Foo","Bar","04042013","04102013","Stuff"
"Foo2","Bar2","04042013","04102013","Stuff2"
However when I parse the file with the code I've written so far:
import java.io.*;
import java.util.Arrays;
public class ReadCSV{
public static void main(String[] arg) throws Exception {
BufferedReader myFile = new BufferedReader(new FileReader("file.txt"));
String myRow = myFile.readLine();
while (myRow != null){
//split by comma separated quote enclosed values
//BUG - first and last values get an extra quote
String[] myArray = myRow.split("\",\""); //the problem
for (String item:myArray) { System.out.print(item + "\t"); }
System.out.println();
myRow = myFile.readLine();
}
myFile.close();
}
}
However the output is
"Foo Bar 04042013 04102013 Stuff"
"Foo2 Bar2 04042013 04102013 Stuff2"
Instead of
Foo Bar 04042013 04102013 Stuff
Foo2 Bar2 04042013 04102013 Stuff2
I know I went wrong on the Split but I'm not sure how to fix it.
Before doing split, just remove first double quote and last double quote in myRow variable using below line.
myRow = myRow.substring(1, myRow.length() - 1);
(UPDATE) Also check if myRow is not empty. Otherwise above code will cause exception. For example below code checks if myRow is not empty and then only removes double quotes from the string.
if (!myRow.isEmpty()) {
myRow = myRow.substring(1, myRow.length() - 1);
}
i think you will probably have to go for a stateful approach, basically like the code below (another state would be necessary if you want to allow escaping of quotes within a value):
import java.util.ArrayList;
import java.util.List;
public class CSV {
public static void main(String[] args) {
String s = "\"hello, i am\",\"a string\"";
String x = s;
List<String> l = new ArrayList<String>();
int state = 0;
while(x.length()>0) {
if(state == 0) {
if(x.indexOf("\"")>-1) {
x = x.substring(x.indexOf("\"")+1).trim();
state = 1;
} else {
break;
}
} else if(state == 1) {
if(x.indexOf("\"")>-1) {
String found = x.substring(0,x.indexOf("\""));
System.err.println("found: "+found);
l.add(found);
x = x.substring(x.indexOf("\"")+1).trim();
state = 0;
} else {
throw new RuntimeException("bad format");
}
} else if(state == 2) {
if(x.indexOf(",")>-1) {
x = x.substring(x.indexOf(",")+1).trim();
state = 0;
} else {
break;
}
}
}
for(String f : l) {
System.err.println(f);
}
}
}
Instead, you can use replaceAll, which, for me, looks more suitable for this task:
myRow = myRow.replaceAll("\"", "").replaceAll(","," ");
This will replace all the " with nothing (Will remove them), then it'll replace all , with space (You can increase the number of spaces of course).
The problem in above code snippet is that you are splitting the String based on ",".
on your Line start "foo"," and end ","stuff" the starting and ending quotes does not match with "," so there are not splitted.
so this definitely not a bug in java. in your case you need to handle that part yourself.
You have multiple options to do it. some of them can be like below.
1. If you are sure there will be always a starting " and ending " you can remove them from String before hand before splitting.
2. If the starting " and " are optional, you can first check it with startsWith endsWith and then remove if exists before splitting.
You can simply get the String delimitered by the comma and then delete the first and last '"'.
=)
hope thats helpfull
dont have much time :D
String s = "\"Foo\",\"Bar\",\"04042013\",\"04102013\",\"Stuff\"";
String[] bufferArray = new String[10];
String bufferString;
int i = 0;
System.out.println(s);
Scanner scanner = new Scanner(s);
scanner.useDelimiter(",");
while(scanner.hasNext()) {
bufferString = scanner.next();
bufferArray[i] = bufferString.subSequence(1, bufferString.length() - 1).toString();
i++;
}
System.out.println(bufferArray[0]);
System.out.println(bufferArray[1]);
System.out.println(bufferArray[2]);
This solution is less elegant than a String.split() oneliner. The advantage is that we avoid fragile string manipulation, ie. the use of String.substring(). The string must end with ," however.
This version handles spaces between delimiters. Delimiter characters within quotes are ignored as expected, as are escaped quotes (for example \").
String s = "\"F\\\",\\\"oo\" , \"B,ar\",\"04042013\",\"04102013\",\"St,u\\\"ff\"";
Pattern p = Pattern.compile("(.*?)\"\\s*,\\s*\"");
Matcher m = p.matcher(s + ",\""); // String must end with ,"
while (m.find()) {
String result = m.group(1);
System.out.println(result);
}

regular expression for \" in java

I need to write a regular expression for string read from a file
apple,boy,cat,"dog,cat","time\" after\"noon"
I need to split it into
apple
boy
cat
dog,cat
time"after"noon
I tried using
Pattern pattern =
Pattern.compile("[\\\"]");
String items[]=pattern.split(match);
for the second part but I could not get the right answer,can you help me with this?
Since your question is more of a parsing problem than a regex problem, here's another solution that will work:
public class CsvReader {
Reader r;
int row, col;
boolean endOfRow;
public CsvReader(Reader r){
this.r = r instanceof BufferedReader ? r : new BufferedReader(r);
this.row = -1;
this.col = 0;
this.endOfRow = true;
}
/**
* Returns the next string in the input stream, or null when no input is left
* #return
* #throws IOException
*/
public String next() throws IOException {
int i = r.read();
if(i == -1)
return null;
if(this.endOfRow){
this.row++;
this.col = 0;
this.endOfRow = false;
} else {
this.col++;
}
StringBuilder b = new StringBuilder();
outerLoop:
while(true){
char c = (char) i;
if(i == -1)
break;
if(c == ','){
break;
} else if(c == '\n'){
endOfRow = true;
break;
} else if(c == '\\'){
i = r.read();
if(i == -1){
break;
} else {
b.append((char)i);
}
} else if(c == '"'){
while(true){
i = r.read();
if(i == -1){
break outerLoop;
}
c = (char)i;
if(c == '\\'){
i = r.read();
if(i == -1){
break outerLoop;
} else {
b.append((char)i);
}
} else if(c == '"'){
r.mark(2);
i = r.read();
if(i == '"'){
b.append('"');
} else {
r.reset();
break;
}
} else {
b.append(c);
}
}
} else {
b.append(c);
}
i = r.read();
}
return b.toString().trim();
}
public int getColNum(){
return col;
}
public int getRowNum(){
return row;
}
public static void main(String[] args){
try {
String input = "apple,boy,cat,\"dog,cat\",\"time\\\" after\\\"noon\"\nquick\"fix\" hello, \"\"\"who's there?\"";
System.out.println(input);
Reader r = new StringReader(input);
CsvReader csv = new CsvReader(r);
String s;
while((s = csv.next()) != null){
System.out.println("R" + csv.getRowNum() + "C" + csv.getColNum() + ": " + s);
}
} catch(IOException e){
e.printStackTrace();
}
}
}
Running this code, I get the output:
R0C0: apple
R0C1: boy
R0C2: cat
R0C3: dog,cat
R0C4: time" after"noon
R1C0: quickfix hello
R1C1: "who's there?
This should fit your needs pretty well.
A few disclaimers, though:
It won't catch errors in the syntax of the CSV format, such as an unescaped quotation mark in the middle of a value.
It won't perform any character conversion (such as converting "\n" to a newline character). Backslashes simply cause the following character to be treated literally, including other backslashes. (That should be easy enough to alter if you need additional functionality)
Some csv files escape quotes by doubling them rather than using a backslash, this code now looks for both.
Edit: Looked up the csv format, discovered there's no real standard, but updated my code to catch quotes escaped by doubling rather than backslashes.
Edit 2: Fixed. Should work as advertised now. Also modified it to test the tracking of row and column numbers.
First thing: String.split() uses the regex to find the separators, not the substrings.
Edit: I'm not sure if this can be done with String.split(). I think the only way you could deal with the quotes while only matching the comma would be by readahead and lookbehind, and that's going to break in quite a lot of cases.
Edit2: I'm pretty sure it can be done with a regular expression. And I'm sure this one case could be solved with string.split() -- but a general solution wouldn't be simple.
Basically, you're looking for anything that isn't a comma as input [^,], you can handle quotes as a separate character. I've gotten most of the way there myself. I'm getting this as output:
apple
boy
cat
dog
cat
time\" after\"noon
But I'm not sure why it has so many blank lines.
My complete code is:
String input = "apple,boy,cat,\"dog,cat\",\"time\\\" after\\\"noon\"";
Pattern pattern =
Pattern.compile("(\\s|[^,\"\\\\]|(\\\\.)||(\".*\"))*");
Matcher m = pattern.matcher(input);
while(m.find()){
System.out.println(m.group());
}
But yeah, I'll echo the guy above and say that if there's no requirement to use a regular expression, then it's probably simpler to do it manually.
But then I guess I'm almost there. It's spitting out ... oh hey, I see what's going on here. I think I can fix that.
But I'm going to echo the guy above and say that if there's no requirement to use a regular expression, it's probably better to do it one character at a time and implement the logic manually. If your regex isn't picture-perfect, then it could cause all kinds of unpredictable weirdness down the line.
I am not really sure about this but you could have a go at Pattern.compile("[\\\\"]");
\ is an escape character and to detect a \ in the expression, \\\\ could be used.
A similar thing worked for me in another context and I hope it solves your problem too.

Most elegant way to detect if a String is a number?

Is there a better, more elegant (and/or possibly faster) way than
boolean isNumber = false;
try{
Double.valueOf(myNumber);
isNumber = true;
} catch (NumberFormatException e) {
}
...?
Edit:
Since I can't pick two answers I'm going with the regex one because a) it's elegant and b) saying "Jon Skeet solved the problem" is a tautology because Jon Skeet himself is the solution to all problems.
I don't believe there's anything built into Java to do it faster and still reliably, assuming that later on you'll want to actually parse it with Double.valueOf (or similar).
I'd use Double.parseDouble instead of Double.valueOf to avoid creating a Double unnecessarily, and you can also get rid of blatantly silly numbers quicker than the exception will by checking for digits, e/E, - and . beforehand. So, something like:
public boolean isDouble(String value)
{
boolean seenDot = false;
boolean seenExp = false;
boolean justSeenExp = false;
boolean seenDigit = false;
for (int i=0; i < value.length(); i++)
{
char c = value.charAt(i);
if (c >= '0' && c <= '9')
{
seenDigit = true;
continue;
}
if ((c == '-' || c=='+') && (i == 0 || justSeenExp))
{
continue;
}
if (c == '.' && !seenDot)
{
seenDot = true;
continue;
}
justSeenExp = false;
if ((c == 'e' || c == 'E') && !seenExp)
{
seenExp = true;
justSeenExp = true;
continue;
}
return false;
}
if (!seenDigit)
{
return false;
}
try
{
Double.parseDouble(value);
return true;
}
catch (NumberFormatException e)
{
return false;
}
}
Note that despite taking a couple of tries, this still doesn't cover "NaN" or hex values. Whether you want those to pass or not depends on context.
In my experience regular expressions are slower than the hard-coded check above.
You could use a regex, i.e. something like String.matches("^[\\d\\-\\.]+$"); (if you're not testing for negative numbers or floating point numbers you could simplify a bit).
Not sure whether that would be faster than the method you outlined though.
Edit: in the light of all this controversy, I decided to make a test and get some data about how fast each of these methods were. Not so much the correctness, but just how quickly they ran.
You can read about my results on my blog. (Hint: Jon Skeet FTW).
See java.text.NumberFormat (javadoc).
NumberFormat nf = NumberFormat.getInstance(Locale.FRENCH);
Number myNumber = nf.parse(myString);
int myInt = myNumber.intValue();
double myDouble = myNumber.doubleValue();
The correct regex is actually given in the Double javadocs:
To avoid calling this method on an invalid string and having a NumberFormatException be thrown, the regular expression below can be used to screen the input string:
final String Digits = "(\\p{Digit}+)";
final String HexDigits = "(\\p{XDigit}+)";
// an exponent is 'e' or 'E' followed by an optionally
// signed decimal integer.
final String Exp = "[eE][+-]?"+Digits;
final String fpRegex =
("[\\x00-\\x20]*"+ // Optional leading "whitespace"
"[+-]?(" + // Optional sign character
"NaN|" + // "NaN" string
"Infinity|" + // "Infinity" string
// A decimal floating-point string representing a finite positive
// number without a leading sign has at most five basic pieces:
// Digits . Digits ExponentPart FloatTypeSuffix
//
// Since this method allows integer-only strings as input
// in addition to strings of floating-point literals, the
// two sub-patterns below are simplifications of the grammar
// productions from the Java Language Specification, 2nd
// edition, section 3.10.2.
// Digits ._opt Digits_opt ExponentPart_opt FloatTypeSuffix_opt
"((("+Digits+"(\\.)?("+Digits+"?)("+Exp+")?)|"+
// . Digits ExponentPart_opt FloatTypeSuffix_opt
"(\\.("+Digits+")("+Exp+")?)|"+
// Hexadecimal strings
"((" +
// 0[xX] HexDigits ._opt BinaryExponent FloatTypeSuffix_opt
"(0[xX]" + HexDigits + "(\\.)?)|" +
// 0[xX] HexDigits_opt . HexDigits BinaryExponent FloatTypeSuffix_opt
"(0[xX]" + HexDigits + "?(\\.)" + HexDigits + ")" +
")[pP][+-]?" + Digits + "))" +
"[fFdD]?))" +
"[\\x00-\\x20]*");// Optional trailing "whitespace"
if (Pattern.matches(fpRegex, myString))
Double.valueOf(myString); // Will not throw NumberFormatException
else {
// Perform suitable alternative action
}
This does not allow for localized representations, however:
To interpret localized string representations of a floating-point value, use subclasses of NumberFormat.
Use StringUtils.isDouble(String) in Apache Commons.
Leveraging off Mr. Skeet:
private boolean IsValidDoubleChar(char c)
{
return "0123456789.+-eE".indexOf(c) >= 0;
}
public boolean isDouble(String value)
{
for (int i=0; i < value.length(); i++)
{
char c = value.charAt(i);
if (IsValidDoubleChar(c))
continue;
return false;
}
try
{
Double.parseDouble(value);
return true;
}
catch (NumberFormatException e)
{
return false;
}
}
Most of these answers are somewhat acceptable solutions. All of the regex solutions have the issue of not being correct for all cases you may care about.
If you really want to ensure that the String is a valid number, then I would use your own solution. Don't forget that, I imagine, that most of the time the String will be a valid number and won't raise an exception. So most of the time the performance will be identical to that of Double.valueOf().
I guess this really isn't an answer, except that it validates your initial instinct.
Randy
I would use the Jakarta commons-lang, as always ! But I have no idea if their implementation is fast or not. It doesnt rely on Exceptions, which might be a good thig performance wise ...
Following Phill's answer can I suggest another regex?
String.matches("^-?\\d+(\\.\\d+)?$");
I prefer using a loop over the Strings's char[] representation and using the Character.isDigit() method. If elegance is desired, I think this is the most readable:
package tias;
public class Main {
private static final String NUMERIC = "123456789";
private static final String NOT_NUMERIC = "1L5C";
public static void main(String[] args) {
System.out.println(isStringNumeric(NUMERIC));
System.out.println(isStringNumeric(NOT_NUMERIC));
}
private static boolean isStringNumeric(String aString) {
if (aString == null || aString.length() == 0) {
return false;
}
for (char c : aString.toCharArray() ) {
if (!Character.isDigit(c)) {
return false;
}
}
return true;
}
}
If you want something that's blisteringly fast, and you have a very clear idea of what formats you want to accept, you can build a state machine DFA by hand. This is essentially how regexes work under the hood anyway, but you can avoid the regex compilation step this way, and it may well be faster than a generic regex compiler.

Categories