finding standard deviation in csv file - java

i am trying to find standard deviation(σ = √[(Σ(x - MEAN))2 ÷ n]) of single extracted column of csv file.csv file contain around 45000 instance and 17 attribute saperated with ';'.
for finding standard deviation it need MEAN value in every iteration of while loop for substact with Xi. so i think MEAN need before while loop iteration for find standard deviation.but i dont know how to do this or is there any way to do this. am getting stuck here. then i had puted code for replace old Xi with new Xi. and then write(generate) new csv file.
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.io.FileWriter;
import java.io.*;
import static java.lang.Math.sqrt;
public class Main {
public static void main(String[] args) throws IOException {
String filename = "ly.csv";
File file = new File(filename);
BufferedWriter writer = null;
try {
writer = new BufferedWriter(new FileWriter("bank-full_updated.csv"));
}
catch (IOException e) {
}
try {
double Tuple,avg;
double temp;
Tuple = 0;
double stddev=0;
Scanner inputStream = new Scanner(file);
inputStream.next();
while (inputStream.hasNext()) {
String data1 = inputStream.next();
String[] values = data1.split(";");
double Xi = Double.parseDouble(values[1]);
//now finding standard deviation
temp1 += (Xi-MEAN);
// temp2=(temp1*temp1);
// temp3=(temp2/count);
// standard deviation=Math.sqrt(temp3);
Xi=standard deviation * Xi
//now replace new Xi to original values1
values[1] = String.valueOf(Xi);
// iterate through the values and build a string out of them for write a new file
StringBuilder sb = new StringBuilder();
String newData = sb.toString();
for (int i = 0; i < values.length; i++) {
sb.append(values[i]);
if (i < values.length - 1) {
sb.append(";");
}
}
// get the new string
System.out.println(sb.toString());
writer.write(sb.toString()+"\n");
}
writer.close();
inputStream.close();
}
catch (FileNotFoundException ex) {
Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
}
}
}

It is possible to calculate the standard deviation in a single pass. Professor Donald Knuth has an algorithm that does it using the Kahan summation. Here is the paper: http://researcher.ibm.com/files/us-ytian/stability.pdf
Here is another way but it suffers from rounding errors:
double std_dev2(double a[], int n) {
if(n == 0)
return 0.0;
double sum = 0;
double sq_sum = 0;
for(int i = 0; i < n; ++i) {
sum += a[i];
sq_sum += a[i] * a[i];
}
double mean = sum / n;
double variance = sq_sum / n - mean * mean;
return sqrt(variance);
}

Related

Read binary numbers from file and separate integers from longs/doubles

In this programmer i found Prime numbers in first 100.Numbers are in INT format and totally number of them is in DOUBLE format.I want to read that file and i did it for only INT numbers but i dont know hot to do it for DOUBLE number.
Here is the code:
package int1;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.DoubleBuffer;
import java.nio.IntBuffer;
import java.nio.LongBuffer;
import java.nio.channels.FileChannel;
public class int_upis {
public static void main(String[] args) {
File a = new File("C:\\Users\\Jovan\\Desktop\\Pisem.txt");
FileOutputStream fos = null;
try {
fos = new FileOutputStream(a);
} catch (Exception e) {
}
FileChannel ch = fos.getChannel();
ByteBuffer bff = ByteBuffer.allocate(100);
IntBuffer ibf = bff.asIntBuffer(); // Int type
DoubleBuffer db = bff.asDoubleBuffer(); // Double type
double p = 0;
for (int i = 1; i <= 100; i++) {
int t = 0;
for (int j = 1; j <= i; j++) {
if (i % j == 0) {
t = t + 1;
}
}
if (t < 3) {
p = p + 1; // number of Prime numbers
System.out.println(i);
ibf.put(i);
bff.position(4 * ibf.position());
bff.flip();
try {
ch.write(bff);
bff.clear();
ibf.clear();
} catch (IOException e) {
}
}
}
try {
db.put(p); //At the end of the txt-file i put double format of number (Number of Prime numbers)
bff.position(8*db.position());
bff.flip();
ch.write(bff);
System.out.println("File is writen with: " + ch.size());
ch.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Now I tried to read this file:
public class int_ispis {
public static void main(String[] args) throws IOException {
File a = new File("C:\\Users\\Jovan\\Desktop\\Pisem.txt");
FileInputStream fis = null;
try {
fis = new FileInputStream(a);
} catch (Exception e) {
}
FileChannel ch = fis.getChannel();
ByteBuffer bff = ByteBuffer.allocate(6 * 4);
This is one line of Prime Number put in a 6-row array (this line below):
int[] niz = new int[6];
System.out.println("Pre flipa: " + bff.position() + " " + bff.limit());
System.out.println("++++++++++++++++++++++++++++++++++++");
while (ch.read(bff) != -1) {
bff.flip();
System.out.println();
System.out.println("Posle flipa: " + bff.position() + " " + bff.limit());
IntBuffer ib = bff.asIntBuffer();
System.out.println("IB: " + ib.position() + " " + ib.limit());
int read = ib.remaining();
System.out.println(read);
When it come to the end of file it puts Double Number as Integer and writes wrong number(How to separate Integer form Double number?)
ib.get(niz, 0, ib.remaining());
for (int i = 0; i < read; i++) {
System.out.print(niz[i] + " ");
}
System.out.println();
System.out.println("=================================");
ib.clear();
bff.clear();
}
}
}
A binary file does not have any "separators".
You need to know the structure of the file content and use this knowledge.
In this programmer i found Prime numbers in first 100.Numbers are in INT format and totally number of them is in DOUBLE format.
This means that there is only one long value in the file and this is in the last 8 bytes. So you simply have to check if the current position is fileLenght - 8 and then read these last 8 bytes as a long value.

Fast parsing of strings of numbers in java

I have found plenty of different suggestions on how to parse an ASCII file containing double precision numbers into an array of doubles in Java. What I currently use is roughly the following:
stream = FileInputStream(fname);
breader = BufferedReader(InputStreamReader(stream));
scanner = java.util.Scanner(breader);
array = new double[size]; // size is known upfront
idx = 0;
try {
while(idx<size){
array[idx] = scanner.nextDouble();
idx++;
}
}
catch {...}
For an example file with 1 million numbers this code takes roughly 2 seconds. Similar code written in C, using fscanf, takes 0.1 second (!) Clearly I got it all wrong. I guess calling nextDouble() so many times is the wrong way to go because of the overhead, but I cannot figure out a better way.
I am no Java expert and hence I need a little help with this: can you tell me how to improve this code?
Edit The corresponding C code follows
fd = fopen(fname, "r+");
vals = calloc(sizeof(double), size);
do{
nel = fscanf(fd, "%lf", vals+idx);
idx++;
} while(nel!=-1);
(Summarizing some of the things that I already mentioned in the comments:)
You should be careful with manual benchmarks. The answer to the question How do I write a correct micro-benchmark in Java? points out some of the basic caveats. However, this case is not so prone to the classical pitfalls. In fact, the opposite might be the case: When the benchmark solely consists of reading a file, then you are most likely not benchmarking the code, but mainly the hard disc. This involves the usual side effects of caching.
However, there obviously is an overhead beyond the pure file IO.
You should be aware that the Scanner class is very powerful and convenient. But internally, it is a beast consisting of large regular expressions and hides a tremendous complexity from the user - a complexity that is not necessary at all when your intention is to only read double values!
There are solutions with less overhead.
Unfortunately, the simplest solution is only applicable when the numbers in the input are separated by line separators. Then, reading this file into an array could be written as
double result[] =
Files.lines(Paths.get(fileName))
.mapToDouble(Double::parseDouble)
.toArray();
and this could even be rather fast. When there are multiple numbers in one line (as you mentioned in the comment), then this could be extended:
double result[] =
Files.lines(Paths.get(fileName))
.flatMap(s -> Stream.of(s.split("\\s+")))
.mapToDouble(Double::parseDouble)
.toArray();
So regarding the general question of how to efficiently read a set of double values from a file, separated by whitespaces (but not necessarily separated by newlines), I wrote a small test.
This should not be considered as a real benchmark, and be taken with a grain of salt, but it at least tries to address some basic issues: It reads files with different sizes, multiple times, with different methods, so that for the later runs, the effects of hard disc caching should be the same for all methods:
Updated to generate sample data as described in the comment, and added the stream-based approach
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.io.StreamTokenizer;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Locale;
import java.util.Random;
import java.util.Scanner;
import java.util.StringTokenizer;
import java.util.stream.Stream;
public class ReadingFileWithDoubles
{
private static final int MIN_SIZE = 256000;
private static final int MAX_SIZE = 2048000;
public static void main(String[] args) throws IOException
{
generateFiles();
long before = 0;
long after = 0;
double result[] = null;
for (int n=MIN_SIZE; n<=MAX_SIZE; n*=2)
{
String fileName = "doubles"+n+".txt";
for (int i=0; i<10; i++)
{
before = System.nanoTime();
result = readWithScanner(fileName, n);
after = System.nanoTime();
System.out.println(
"size = " + n +
", readWithScanner " +
(after - before) / 1e6 +
", result " + result);
before = System.nanoTime();
result = readWithStreamTokenizer(fileName, n);
after = System.nanoTime();
System.out.println(
"size = " + n +
", readWithStreamTokenizer " +
(after - before) / 1e6 +
", result " + result);
before = System.nanoTime();
result = readWithBufferAndStringTokenizer(fileName, n);
after = System.nanoTime();
System.out.println(
"size = " + n +
", readWithBufferAndStringTokenizer " +
(after - before) / 1e6 +
", result " + result);
before = System.nanoTime();
result = readWithStream(fileName, n);
after = System.nanoTime();
System.out.println(
"size = " + n +
", readWithStream " +
(after - before) / 1e6 +
", result " + result);
}
}
}
private static double[] readWithScanner(
String fileName, int size) throws IOException
{
try (
InputStream is = new FileInputStream(fileName);
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr);
Scanner scanner = new Scanner(br))
{
// Do this to avoid surprises on systems with a different locale!
scanner.useLocale(Locale.ENGLISH);
int idx = 0;
double array[] = new double[size];
while (idx < size)
{
array[idx] = scanner.nextDouble();
idx++;
}
return array;
}
}
private static double[] readWithStreamTokenizer(
String fileName, int size) throws IOException
{
try (
InputStream is = new FileInputStream(fileName);
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr))
{
StreamTokenizer st = new StreamTokenizer(br);
st.resetSyntax();
st.wordChars('0', '9');
st.wordChars('.', '.');
st.wordChars('-', '-');
st.wordChars('e', 'e');
st.wordChars('E', 'E');
double array[] = new double[size];
int index = 0;
boolean eof = false;
do
{
int token = st.nextToken();
switch (token)
{
case StreamTokenizer.TT_EOF:
eof = true;
break;
case StreamTokenizer.TT_WORD:
double d = Double.parseDouble(st.sval);
array[index++] = d;
break;
}
} while (!eof);
return array;
}
}
// This one is reading the whole file into memory, as a String,
// which may not be appropriate for large files
private static double[] readWithBufferAndStringTokenizer(
String fileName, int size) throws IOException
{
double array[] = new double[size];
try (
InputStream is = new FileInputStream(fileName);
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr))
{
StringBuilder sb = new StringBuilder();
char buffer[] = new char[1024];
while (true)
{
int n = br.read(buffer);
if (n == -1)
{
break;
}
sb.append(buffer, 0, n);
}
int index = 0;
StringTokenizer st = new StringTokenizer(sb.toString());
while (st.hasMoreTokens())
{
array[index++] = Double.parseDouble(st.nextToken());
}
return array;
}
}
private static double[] readWithStream(
String fileName, int size) throws IOException
{
double result[] =
Files.lines(Paths.get(fileName))
.flatMap(s -> Stream.of(s.split("\\s+")))
.mapToDouble(Double::parseDouble)
.toArray();
return result;
}
private static void generateFiles() throws IOException
{
for (int n=MIN_SIZE; n<=MAX_SIZE; n*=2)
{
String fileName = "doubles"+n+".txt";
if (!new File(fileName).exists())
{
System.out.println("Creating "+fileName);
writeDoubles(new FileOutputStream(fileName), n);
}
else
{
System.out.println("File "+fileName+" already exists");
}
}
}
private static void writeDoubles(OutputStream os, int n) throws IOException
{
OutputStreamWriter writer = new OutputStreamWriter(os);
Random random = new Random(0);
int numbersPerLine = random.nextInt(4) + 1;
for (int i=0; i<n; i++)
{
writer.write(String.valueOf(random.nextDouble()));
numbersPerLine--;
if (numbersPerLine == 0)
{
writer.write("\n");
numbersPerLine = random.nextInt(4) + 1;
}
else
{
writer.write(" ");
}
}
writer.close();
}
}
It compares 4 methods:
Reading with a Scanner, as in your original code snippet
Reading with a StreamTokenizer
Reading the whole file into a String, and dissecting it with a StringTokenizer
Reading the file as a Stream of lines, which are then flat-mapped to a Stream of tokens, which are then mapped to a DoubleStream
Reading the file as one large String may not be appropriate in all cases: When the files become (much) larger, then keeping the whole file in memory as a String may not be a viable solution.
A test run (on a rather old PC, with a slow hard disc drive (no solid state)) showed roughly these results:
...
size = 1024000, readWithScanner 9932.940919, result [D#1c7353a
size = 1024000, readWithStreamTokenizer 1187.051427, result [D#1a9515
size = 1024000, readWithBufferAndStringTokenizer 1172.235019, result [D#f49f1c
size = 1024000, readWithStream 2197.785473, result [D#1469ea2 ...
Obviously, the scanner imposes a considerable overhead that may be avoided when reading more directly from the stream.
This may not be the final answer, as there may be more efficient and/or more elegant solutions (and I'm looking forward to see them!), but maybe it is helpful at least.
EDIT
A small remark: There is a certain conceptual difference between the approaches in general. Roughly speaking, the difference lies in who determines the number of elements that are read. In pseudocode, this difference is
double array[] = new double[size];
for (int i=0; i<size; i++)
{
array[i] = readDoubleFromInput();
}
versus
double array[] = new double[size];
int index = 0;
while (thereAreStillNumbersInTheInput())
{
double d = readDoubleFromInput();
array[index++] = d;
}
Your original approach with the scanner was written like the first one, while the solutions that I proposed are more similar to the second. But this should not make a large difference here, assuming that the size is indeed the real size, and potential errors (like too few or too many numbers in the input) don't appear or are handled in some other way.

how to reverse arraylist<Double> of extracted column data in file

I am working with csv file having very large dataset. while reading file i had extracted 4th place(BALANCE) ';' separated numeric value from each rows through while loop iteration. and make a arraylist of Double after some mathematical calculation(here division).
Now I want to store this arraylist of Double in reverse order(from end to beginning).as its original position(here 4th place in the file).example
input
1,2,3,4
2,3,4,5
3,4,5,6
output
1,2,3,6
2,3,4,5
3,4,5,4
I had try to reverse it but not succeeded. I don’t know whether it was suitable method for my problem or not. How can I do this?
Then after using string builder I write back data in new file using writer method.
package csvtest7;
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.io.FileWriter;
import java.io.*;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.ListIterator;
public class Csvtest7 {
public static void main(String[] args)throws IOException {
String filename = "sample dataset.csv";
List<Double> list = new ArrayList<Double>();
File file = new File(filename);
BufferedWriter writer = null;
try {
writer = new BufferedWriter(new FileWriter("lyupdated.csv"));
} catch (IOException e) {
}
try {
Scanner inputStream = new Scanner(file);
inputStream.next();
int count = 0;
int number = 11;
while (inputStream.hasNext()) {
String data = inputStream.next();
String[] values = data.split(",");
double balance = Double.parseDouble(values[3]);
balance = balance / number;
count = count+1;
values[3] = String.valueOf(balance);
list.add(balance);
Collections.reverse(list); // I tryied this method but don't work.
// iterate through the values and build a string out of them
StringBuilder sb = new StringBuilder();
// String newData = sb.toString();
for (int i = 0; i < values.length; i++) {
sb.append(values[i]);
if (i < values.length - 1) {
sb.append(";");
}
}
// get the new string
System.out.println(sb.toString());
writer.write(sb.toString()+"\n");
}
writer.close();
inputStream.close();
} catch (FileNotFoundException ex) {
Logger.getLogger(Csvtest7.class.getName()).log(Level.SEVERE, null, ex);
}
}
}

modify and generate new csv file by extracting column values

I want to modify on column values of csv file with large dataset. so I had extracted on single column values(here 2nd) then find standard deviation by 2 time iteration of while loop.1st for find mean and 2nd for find standard deviation. standard deviation is multiply with extracted value and that values are replace with it. then generate updated csv file. here when I run code it generate new file successfully with blank file by without while loop iteration. i think there is something problem with both while loop or it is not reading a file. i don't know what it is? standard deviation(σ = √[(Σ(x - MEAN))2 ÷ n]) pls help me
package csvtest7;
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.io.FileWriter;
import java.io.*;
public class Csvtest7 {
public static void main(String[] args)throws IOException {
String filename = "ly.csv";
File file = new File(filename);
BufferedWriter writer = null;
try {
writer = new BufferedWriter(new FileWriter("ly_updated.csv"));
}
catch (IOException e) {
}
try {
Scanner inputStream = new Scanner(file);
inputStream.next();
double Tuple;
int count=0;
Tuple = 0;
double stddev=0;
double stddev1;
double stddev2;
//double Xi;
double MEAN;
double standarddeviation;
while (inputStream.hasNext()) {
String data = inputStream.next();
String[] values = data.split(";");
double balance = Double.parseDouble(values[2]);
balance = balance + 1;
Tuple += balance ;
}
MEAN=Tuple/count;
while (inputStream.hasNext()) {
String data = inputStream.next();
String[] values = data.split(";");
double balance = Double.parseDouble(values[2]);
stddev=balance-MEAN;
stddev1=(stddev*stddev);
stddev2=(stddev1/count);
standarddeviation=Math.sqrt(stddev2);
balance=standarddeviation*balance;
values[2] = String.valueOf(balance);
// iterate through the values and build a string out of them
StringBuilder sb = new StringBuilder();
// String newData = sb.toString();
for (int i = 0; i < values.length; i++) {
sb.append(values[i]);
if (i < values.length - 1) {
sb.append(";");
}
}
// get the new string
System.out.println(sb.toString());
writer.write(sb.toString()+"\n");
}
writer.close();
inputStream.close();
} catch (FileNotFoundException ex) {
Logger.getLogger(Csvtest7.class.getName()).log(Level.SEVERE, null, ex);
}
}
You are skipping the second while loop.
You are executing first while loop while (inputStream.hasNext()) { successfully until there are no more tokens to read from the file. Now your second while loop again says while (inputStream.hasNext()) { Now since you already read the file, it wont move the pointer back to start of the file and it would say that there are no more tokens to read from the file and hence skips the second while loop.
One way to resolve this issue is to redefine the inputStream as:
inputStream = new Scanner(file);
while (inputStream.hasNext()) {//start second while loop.
Or
else within your first while loop, you could do processing what you are trying to do in second while loop. You don't need second while loop.

Display data on console and also save data to text file.

So here is my code, it seems to work, but it just prints out the info on file rather than doing both (displaying data on console and saving the information to a text file). Help appreciated.
// imports
import java.io.BufferedReader;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.PrintStream;
public class DTM {
// The main method for our Digital Terrain Models
/** #param args
* #throws IOException
*/
public static void main(String[] args) throws IOException {
//Prints the console output on a text file (Output.txt)
PrintStream out = new PrintStream(new FileOutputStream("output.txt"));
System.setOut(out);
//Declare some variables
int aRows = 401;
int bCols = 401;
String DMTfile = "sk28.asc";
//Declare some tables
double data[][] = new double[aRows][bCols];
BufferedReader file = new BufferedReader(new FileReader(DMTfile));
//Write data into array
for (int i = 0; i < aRows; i++) {
String rowArray[] = file.readLine().split(" ");
for (int j = 0; j < bCols; j++) {
data[i][j] = Double.parseDouble(rowArray[j]);
}
}
//Closing the file
file.close();
//print out the array
for (int i = 0; i < aRows; i++) {
for (int j = 0; j < bCols; j++) {
System.out.println(data[i][j]);
}
}
// this hold's the smallest number
double high = Double.MIN_VALUE;
// this hold's the biggest number
double low = Double.MAX_VALUE;
//initiate a "For" loop to act as a counter through an array
for (int i = 0; i < data.length; i++) {
for (int j = 0; j < data[i].length; j++)
//determine the highest value
if (data[i][j] > high) {
high = data[i][j];
}
//determine the lowest value
else if (data[i][j] < low) {
low = data[i][j];
}
}
// Code here to find the highest number
System.out.println("Peak in this area = " + high);
// Code here to find the lowest number
System.out.println("Dip in this area = " + low);
}
}
Try the Apache Commons TeeOutputStream.
Untested, but should do the tric:
outStream = System.out;
// only the file output stream
OutputStream os = new FileOutputStream("output.txt", true);
// create a TeeOutputStream that duplicates data to outStream and os
os = new TeeOutputStream(outStream, os);
PrintStream printStream = new PrintStream(os);
System.setOut(printStream);
You're merely redirecting standard output to a file instead of the console. As far as I know there is no way to automagically clone an output onto two streams, but it's pretty easy to do it by hand:
public static void multiPrint(String s, FileOutputStream out){
System.out.print(s);
out.write(s);
}
Whenever you want to print you just have to call this function:
FileOutputStream out=new FileOutputStream("out.txt");
multiPrint("hello world\n", out);

Categories