loading large matrix from text file into Java arrays - java

My data is stored in large matrices stored in txt files with millions of rows and 4 columns of comma-separated values. (Each column stores a different variable, and each row stores a different millisecond's data for all four variables.) There is also some irrelevant header data in the first dozen or so lines. I need to write Java code to load this data into four arrays, with one array for each column in the txt matrix. The Java code also needs to be able to tell when the header is done, so that the first data row can be split into entries for the 4 arrays. Finally, the java code needs to iterate through the millions of data rows, repeating the process of decomposing each row into four numbers which are each entered into the appropriate array for the column in which the number was located.
Can anyone show me how to alter the code below in order to accomplish this?
I want to find the fastest way to accomplish this processing of millions of rows. Here is my code:
MainClass2.java
package packages;
public class MainClass2{
public static void main(String[] args){
readfile2 r = new readfile2();
r.openFile();
int x1Count = r.readFile();
r.populateArray(x1Count);
r.closeFile();
}
}
readfile2.java
package packages;
import java.io.*;
import java.util.*;
public class readfile2 {
private Scanner scan1;
private Scanner scan2;
public void openFile(){
try{
scan1 = new Scanner(new File("C:\\test\\samedatafile.txt"));
scan1 = new Scanner(new File("C:\\test\\samedatafile.txt"));
}
catch(Exception e){
System.out.println("could not find file");
}
}
public int readFile(){
int scan1Count = 0;
while(scan1.hasNext()){
scan1.next();
scan1Count += 1;
}
return scan1Count;
}
public double[] populateArray(int scan1Count){
double[] outputArray1 = new double[scan1Count];
double[] outputArray2 = new double[scan1Count];
double[] outputArray3 = new double[scan1Count];
double[] outputArray4 = new double[scan1Count];
int i = 0;
while(scan2.hasNext()){
//what code do I write here to:
// 1.) identify the start of my time series rows after the end of the header rows (e.g. row starts with a number AT LEAST 4 digits in length.)
// 2.) split each time series row's data into a separate new entry for each of the 4 output arrays
i++;
}
return outputArray1, outputArray2, outputArray3, outputArray4;
}
public void closeFile(){
scan1.close();
scan2.close();
}
}
Here are the first 19 lines of a typical data file:
text and numbers on first line
1 msec/sample
3 channels
ECG
Volts
Z_Hamming_0_05_LPF
Ohms
dz/dt
Volts
min,CH2,CH4,CH41,
,3087747,3087747,3087747,
0,-0.0518799,17.0624,0,
1.66667E-05,-0.0509644,17.0624,-0.00288295,
3.33333E-05,-0.0497437,17.0624,-0.00983428,
5E-05,-0.0482178,17.0624,-0.0161573,
6.66667E-05,-0.0466919,17.0624,-0.0204402,
8.33333E-05,-0.0448608,17.0624,-0.0213986,
0.0001,-0.0427246,17.0624,-0.0207532,
0.000116667,-0.0405884,17.0624,-0.0229672,
EDIT
I tested Shilaghae's code suggestion. It seems to work. However, the length of all the resulting arrays is the same as x1Count, so that zeros remain in the places where Shilaghae's pattern matching code is not able to place a number. (This is a result of how I wrote the code originally.)
I was having trouble finding the indices where zeros remain, but there seemed to be a lot more zeros besides the ones expected where the header was. When I graphed the derivative of the temp[1] output, I saw a number of sharp spikes where false zeros in temp[1] might be. If I can tell where the zeros in temp[1], temp[2], and temp[3] are, I might be able to modify the pattern matching to better retain all the data.
Also, it would be nice to simply shorten the output array to no longer include the rows where the header was in the input file. However, the tutorials I have found regarding variable length arrays only show oversimplified examples like:
int[] anArray = {100, 200, 300, 400};
The code might run faster if it no longer uses scan1 to produce scan1Count. I do not want to slow the code down by using an inefficient method to produce a variable-length array. And I also do not want to skip data in my time series in the cases where the pattern matching is not able to split the input row into 4 numbers. I would rather keep the in-time-series zeros so that I can find them and use them to debug the pattern matching.
Can anyone show how to do these things in fast-running code?
SECOND EDIT
So
"-{0,1}\\d+.\\d+,"
repeats for times in the expression:
"-{0,1}\\d+.\\d+,-{0,1}\\d+.\\d+,-{0,1}\\d+.\\d+,-{0,1}\\d+.\\d+,"
Does
"-{0,1}\\d+.\\d+,"
decompose into the following three statements:
"-{0,1}" means that a minus sign occurs zero or one times, while
"\\d+." means that the minus sign(or lack of minus sign) is followed by several digits of any value followed by a decimal point, so that finally
"\\d+," means that the decimal point is followed by several digits of any value?
If so, what about numbers in my data like "1.66667E-05," or "-8.06131E-05," ? I just scanned one of the input files, and (out of 3+ million 4-column rows) it contains 638 numbers that contain E, of which 5 were in the first column, and 633 were in the last column.
FINAL EDIT
The final code was very simple, and simply involved using string.split() with "," as the regular expression. To do that, I had to manually delete the headers from the input file so that the data only contained rows with 4 comma separated numbers.
In case anyone is curious, the final working code for this is:
public double[][] populateArray(int scan1Count){
double[] outputArray1 = new double[scan1Count];
double[] outputArray2 = new double[scan1Count];
double[] outputArray3 = new double[scan1Count];
double[] outputArray4 = new double[scan1Count];
try {
File tempfile = new File("C:\\test\\mydatafile.txt");
FileInputStream fis = new FileInputStream(tempfile);
DataInputStream in = new DataInputStream(fis);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
int i = 0;
while ((strLine = br.readLine()) != null) {
String[] split = strLine.split(",");
outputArray1[i] = Double.parseDouble(split[0]);
outputArray2[i] = Double.parseDouble(split[1]);
outputArray3[i] = Double.parseDouble(split[2]);
outputArray4[i] = Double.parseDouble(split[3]);
i++;
}
} catch (IOException e) {
System.out.println("e for exception is:"+e);
e.printStackTrace();
}
double[][] temp = new double[4][];
temp[0]= outputArray1;
temp[1]= outputArray2;
temp[2]= outputArray3;
temp[3]= outputArray4;
return temp;
}
Thank you for everyone's help. I am going to close this thread now because the question has been answered.

You could read line to line the file and for every line you could control with a regular expression (http://www.vogella.de/articles/JavaRegularExpressions/article.html) if the line presents exactly 4 comma.
If the line presents exactly 4 comma you can split the line with String.split and fill the 4 array otherwise you pass at next line.
public double[][] populateArray(int scan1Count){
double[] outputArray1 = new double[scan1Count];
double[] outputArray2 = new double[scan1Count];
double[] outputArray3 = new double[scan1Count];
double[] outputArray4 = new double[scan1Count];
//Read File Line By Line
try {
File tempfile = new File("samedatafile.txt");
FileInputStream fis = new FileInputStream(tempfile);
DataInputStream in = new DataInputStream(fis);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
int i = 0;
while ((strLine = br.readLine()) != null) {
Pattern pattern = Pattern.compile("-{0,1}\\d+.\\d+,-{0,1}\\d+.\\d+,-{0,1}\\d+.\\d+,-{0,1}\\d+.\\d+,");
Matcher matcher = pattern.matcher(strLine);
if (matcher.matches()){
String[] split = strLine.split(",");
outputArray1[i] = Double.parseDouble(split[0]);
outputArray2[i] = Double.parseDouble(split[1]);
outputArray3[i] = Double.parseDouble(split[2]);
outputArray4[i] = Double.parseDouble(split[3]);
}
i++;
}
} catch (IOException e) {
e.printStackTrace();
}
double[][] temp = new double[4][];
temp[0]= outputArray1;
temp[1]= outputArray2;
temp[2]= outputArray3;
temp[3]= outputArray4;
return temp;
}

You can split up each line using String.split().
To skip the headers, you can either read the first N lines and discard them (if you know how many there are) or you will need to look for a specific marker - difficult to advise without seeing your data.
You may also need to change your approach a little because you currently seem to be sizing the arrays according to the total number of lines (assuming your Scanner returns lines?) rather than omitting the count of header lines.

I'd deal with the problem of the headers by simply attempting to parse every line as four numbers, and throwing away any lines where the parsing doesn't work. If there is a possibility of unparseable lines after the header lines, then you can set a flag the first time you get a "good" line, and then report any subsequent "bad" lines.
Split the lines with String.split(...). It is not the absolute fastest way to do it, but the CPU time of your program will be spent elsewhere ... so it probably doesn't matter.

Related

How to delimit new line when reading CSV file?

I am trying to read a file where each line has data members, separated by commas, that are meant to populate an object's data members, I tried using the regex "|" symbol to separate "," and "\n" along with "\r" for getting to the new line. However, after reading the first line, the first data member of the second line does not get read right away but rather a "" character gets read beforehand. Am I using the wrong regex symbols? or am I not using the right approach? I read that there are many ways to tackle this and opted to use scanner since seemed the most simple, using the buffer reader seemed very confusing since it seems like it returns arrays and not individual strings and ints which is I'm trying to get.
The CSV file looks something like this
stringA,stringB,stringC,1,2,3
stringD,stringE,stringF,4,5,6
stringG,stringH,stringI,7,8,9
My code looks something like this
//In list class
public void load() throws FileNotFoundException
{
Scanner input = new Scanner(new FileReader("a_file.csv"));
object to_add; //To be added to the list
input.useDelimiter(",|\\n|\\r");
while (input.hasNext())
{
String n = input.next(); //After the first loop run, this data gets the value ""
String l = input.next(); //During this second run, this member gets the data that n was supposed to get, "stringD"
String d = input.next(); //This one gets "stringE"
int a = input.nextInt(); //And this one tries to get "stringF", which makes it crash
int c = input.nextInt();
to_add = new object(n, l, d, a, b, c); //Calling copy constructor to populate data members
insert(to_add); //Inserting object to the list
}
input.close();
}
Use Apache Commons CSV. Here is the user guide https://commons.apache.org/proper/commons-csv/user-guide.html
You can do this with OpenCSV and here is a tutorial how to use this library. You can download the library from the Maven Repository.
So following is the code what you need to do,
Reader reader = Files.newBufferedReader(Paths.get("path/to/csvfile.csv"));
CSVReader csvReader = new CSVReader(reader);
List<String[]> dataList = new ArrayList<>();
dataList = csvReader.readAll();
reader.close();
csvReader.close();
Object to_add;
for (String[] rowData : dataList) {
String textOne = rowData[0];
String textTwo = rowData[1];
String textThree = rowData[2];
int numberOne = Integer.parseInt(rowData[3]);
int numberTwo = Integer.parseInt(rowData[4]);
int numberThree = Integer.parseInt(rowData[5]);
to_add = new Object(textOne, textTwo, textThree, numberOne, numberTwo, numberThree);
insert(to_add);
}

Having a bit of trouble reading in a file that contains specific information in each line

So I am having trouble reading in a file. The file contains 2 integers in the first line, and the rest of the file contains Strings in seperate lines. For some reason my logic in this code, it does not seem to consume each line in the file correctly. I tried to troubleshoot this by printing out what was happening, and it seems like the second nextLine() is not even executing.
while(inputFile.hasNext())
{
try
{
String start = inputFile.nextLine();
System.out.println(start); // tried to troubleshoot here
String [] rowsAndCols = start.split(" "); // part where it should read the first two integers
System.out.println(rowsAndCols[0]); // tried to troubleshoot here
int rows = Integer.parseInt(rowsAndCols[0]);
int cols = Integer.parseInt(rowsAndCols[1]);
cell = new MazeCell.CellType[rows+2][cols+2];
String mazeStart = inputFile.nextLine(); // part where it should begin to read the rest of the file containing strings
String [] mazeRowsAndCols = mazeStart.split(" ");
MazeCell.CellType cell2Add;
Based upon your description above, only the first line contains Integers, so your while loop is wrong as it is trying to convert every line into Integers.
Split into the code into
if (inputFile.hasNextLine())
{
String start = inputFile.nextLine();
String [] rowsAndCols = start.split(" "); // part where it should read the first two integers
int rows = Integer.parseInt(rowsAndCols[0]);
int cols = Integer.parseInt(rowsAndCols[1]);
}
then a while loop for the String reading

Read from .csv excel file and compute average

How do I read a .csv excel file with x number of rows and y number of columns, ignore irrelevant cells (things like names), then compute an average of the numbers in each column?
The Excel I have is something like this (, indicates new cell):
ID, week 1, week 2, week 3, .... , week 7
0 , 1 , 0.5 , 0 , , 1.2
1 , 0.5 , 1 , 0.5 , , 0.5
y , ......
so, how do I make it so it reads that kind of .csv file then computes an average in the format Week 1 = (Week 1 average), Week 2 = (week2 average) for all weeks?
Also am I correct in assuming I need to use a 2D Array for this?
Edit
Here's my code so far, it's very crude and I'm not sure if it does things properly yet:
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;
public class ClassAverage {
public static void main(String[] args){
readFile2Array("attendance.csv");
}
public static double[][] readFile2Array(String fileName){
try {
int rowCount = 0;
int colCount = 0;
Scanner rc = new Scanner(new File("attendance.csv"));
while (rc.hasNextLine()) {
rowCount++;
rc.nextLine();
}
rc.close();
System.out.println(rowCount);
Scanner cc = new Scanner(new File("attendance.csv"));
while (cc.hasNext()) {
colCount++;
cc.next();
}
cc.close();
colCount = colCount/rowCount;
System.out.println(colCount);
Scanner sc = new Scanner(new File("attendance.csv"));
double[][] spreadSheet = new double[rowCount][colCount];
while (sc.hasNext()) {
for (int i=0; i<spreadSheet.length; ++i){
for (int j=0; j<spreadSheet[i].length; ++j){
spreadSheet[i][j] = Double.parseDouble(sc.next());
}
}
}
sc.close();
return spreadSheet;
} catch (FileNotFoundException e) {
System.out.println("File cannot be opened");
e.printStackTrace();
}
return null;
}
public static double weeklyAvg(double[][] a){
}
}
So a summary of what it's intended to do
readFile2Array: read the csv file and count the number of rows, then count the total number of cells, divide total number of cells by number of rows to find number of columns. Read again and put each cell into the correct place in a 2D array.
weeklyAvg: I haven't thought up a way to do this yet, but it's supposed to read the array column by column and compute an average for each column, then print out the result.
PS. I'm very new at Java so I have no idea what some suggestions mean so I'd really appreciate suggestions that are pure java based without addons and stuff (I'm not sure if that's what some people are suggesting even). I hope it's not too much to ask for (if it's even possible).
You can use a Java library to handle your CSV file. For example opencsv ( you can find the latest maven version here http://mvnrepository.com/artifact/com.opencsv/opencsv/3.5)
And then you can parse your file like this :
CSVReader reader = new CSVReader(new FileReader("PATH_TO_YOUR_FILE"));
String[] nextLine;
int counter = 0;
while ((nextLine = reader.readNext()) != null) {
// nextLine[] is an array of values from the line
System.out.println(nextLine[0] + nextLine[1]);
}
You have to ignore the header line, you can simply do this by incrementing a counter and skipping the zero value.
To compute the average you can use a hashmap where the key is the column header name (example week 1). Then you increment with the current line value and after the loop is completed you divide by the number of lines (don't forget to substract the ignored lines like header line)
To parse simple CSV files, it's pretty simple to just manually parse through it, as long as you know the format is the same throughout the file and it does not contain errors
Create a storage data structure for each column you wish to compute (use a LinkedList<String>)
Read through the CSV file line by line with a BufferedReader
Use String.split(',') on each line and add the specific columns in the returned array to the correct LinkedList
Loop through the LinkedLists at the end and compute your averages (using Double.parseDouble() to convert the Strings to doubles)
To make sure that the String you're attempting to parse is a double, you can either use a try-catch statement or use a regex. Check Java: how to check that a string is parsable to a double? for more information

Scanning from a certain, random line of a file in java?

I have a .txt file that lists integers in groups like so:
20,15,10,1,2
7,8,9,22,23
11,12,13,9,14
and I want to read in one of those groups randomly and store the integers of that group into an array. How would I go about doing this? Every group has one line of five integers seperated by commas. The only way I could think of doing this is by incrementing a variable in a while loop that would give me the number of lines and then somehow read from one of those lines that is chosen randomly, but I'm not sure how it would read from only one of those lines randomly. Here's the code that I could come up with to sort of explain what I'm thinking:
int line = 0;
Scanner filescan = new Scanner (new File("Coords.txt"));
while (filescan.hasNextLine())
{
line++;
}
Random r = new Random(line);
Now what do I do to make it scan line r and place all of the integers read on line r into a 1-d array?
There is an old answer in StackOverflow about choosing a line randomly. By using the choose() method you can randomly get any line. I take no credit of the answer. If you like my answer upvote the original answer.
String[] numberLine = choose(new File("Coords.txt")).split(",");
int[] numbers = new int[5];
for(int i = 0; i < 5; i++)
numbers[i] = Integer.parseInt(numberLine[i]);
I'm assuming you know how to parse the line and get the integers out (Integer.parseInt, perhaps with a regular expression). If you're sing a scanner, you can specify that in your constructor.
Keep the contents of each line, and use that:
int line = 0;
Scanner filescan = new Scanner (new File("Coords.txt"));
List<String> content = new ArrayList<String>(); // new
while (filescan.hasNextLine())
{
content.add(filescan.next()); // new
line++;
}
Random r = new Random(line);
String numbers = content.get(r.nextInt(content.size()); // new
// Get numbers out of "numbers"
Read lines one by one from the file, store them in a list and generate a random number from the list's size and use it to get the random line.
public static void main(String[] args) throws Exception {
List<String> aList = new ArrayList<String>();
Scanner filescan = new Scanner(new File("Coords.txt"));
while (filescan.hasNextLine()) {
String nxtLn = filescan.nextLine();
//there can be empty lines in your file, ignore them
if (!nxtLn.isEmpty()) {
//add lines to the list
aList.add(nxtLn);
}
}
System.out.println();
Random r = new Random();
int randomIndex=r.nextInt(aList.size());
//get the random line
String line=aList.get(randomIndex);
//make 1 d array
//...
}

Merge sorting multiple files which have variable word counts

I am splitting a 10 GB file into multiple files of 100000 + few hundred words(since I read upto the line when I encounter 100000 words).
private void splitInputFile(String path) {
try{
File file=new File(path);
FileReader fr = new FileReader(file);
BufferedReader br = new BufferedReader(fr);
String temp;
temp = br.readLine();
String fileName="fileName";
int fileCount = 1;
while(temp!=null){
//TODO Read 100000 words, sort and write to a file. Repeat for the entire file
if(wordsToBeSorted.size()<=100000){
startCounting(temp);
temp=br.readLine();
}//end of if -> place 100000+ words inside the list
else{
Collections.sort(wordsToBeSorted);
fileName = "fileName"+fileCount;
fileCount++;
File splitFile = new File(fileName);
PrintWriter pr = new PrintWriter(splitFile);
for(String word:wordsToBeSorted){
pr.write(word);
pr.write("\n");//check if this works -> 1 word per line
}//end of for
}//end of else
}//end of while
mergeSort(fileCount);
}//end of try
catch(Exception e){
e.printStackTrace();
}
}
private void startCounting(String sb) {
StringTokenizer tokenizer = new StringTokenizer(sb);// Split by space
while (tokenizer.hasMoreTokens()) {
String text = tokenizer.nextToken();
text = text.replaceAll("\\W", "");// Remove all symbols
if("".equals(text.trim()))
continue;
wordsToBeSorted.add(text);
}
}
Now I wonder how to do a sorting with these files. I found out that I am supposed to do a Merge Sort. Considering the facts that each splitFile would have variable number of words(100000 + a few extra words), is it possible to do a merge sort involving files of variable word counts? Or should I follow some other approach to split the file?
is it possible to do a merge sort involving files of variable word counts?
Sure. I assume the goal here is external sorting. Just open up all input files (unless there are really really many, in which case you might have to do multiple runs), read the first word from each. Then identify the input with the smallest word, put that into the output and read the next word from that input. Close and remove any inputs which become empty, unless you have no more inputs.
If you have many inputs, you can use a heap to organize your inputs, with the next word as key. You'd remove the minimal object and then reinsert it after you have proceeded to the next word.

Categories