selecting random lines from huge text file

selecting random lines from huge text file - java

I have very huge text file 18000000 line 4Gbyte, and I want to pick some random lines from it, I wrote the following piece of code to do this but it is slow
import java.io.BufferedWriter;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.Random;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public class Main {
public static void main(String[] args) throws IOException {
int sampleSize =3000;
int fileSize = 18000000;
int[] linesNumber = new int[sampleSize];
Random r = new Random();
for (int i = 0; i < linesNumber.length; i++) {
linesNumber[i] = r.nextInt(fileSize);
}
List<Integer> list = Arrays.stream(linesNumber).boxed().collect(Collectors.toList());
Collections.sort(list);
BufferedWriter outputWriter = Files.newBufferedWriter(Paths.get("output.txt"));
for (int i : list) {
try (Stream<String> lines = Files.lines(Paths.get("huge_text_file"))) {
String en=enlines.skip(i-1).findFirst().get();
outputWriter.write(en+"\n");
lines.close();
} catch (Exception e) {
System.err.println(e);
}
}
outputWriter.close();
}
}
is there more elegant faster method to do this?
thanks.

There are several things that I find troublesome about your current code.
You are currently loading the entire file into RAM. I don't know much about your sample file, but the one I used crashed my default JVM.
You are skipping the same lines over and over again, more so for the earlier lines - this is horribly inefficient, like O(n^n) or something. I would be surprised if you could handle even a 500MB file with that approach.
Here's what I came up with:
public static void main(String[] args) throws IOException {
int sampleSize = 3000;
int fileSize = 50000;
int[] linesNumber = new int[sampleSize];
Random r = new Random();
for (int i = 0; i < linesNumber.length; i++) {
linesNumber[i] = r.nextInt(fileSize);
}
List<Integer> list = Arrays.stream(linesNumber).boxed().collect(Collectors.toList());
Collections.sort(list);
BufferedWriter outputWriter = Files.newBufferedWriter(Paths.get("localOutput/output.txt"));
long t1 = System.currentTimeMillis();
try(BufferedReader reader = new BufferedReader(new FileReader("extremely large file.txt")))
{
int index = 0;//keep track of what item we're on in the list
int currentIndex = 0;//keep track of what line we're on in the input file
while(index < sampleSize)//while we still haven't finished the list
{
if(currentIndex == list.get(index))//if we reach a line
{
outputWriter.write(reader.readLine());
outputWriter.write("\n");//readLine doesn't include the newline characters
while(index < sampleSize && list.get(index) <= currentIndex)//have to put this here in case of duplicates in the list
index++;
}
else
reader.readLine();//readLine is dang fast. There may be faster ways to skip a line, but this is still plenty fast.
currentIndex++;
}
} catch (Exception e) {
System.err.println(e);
}
outputWriter.close();
System.out.println(String.format("Took %d milliseconds", System.currentTimeMillis() - t1));
}
This takes ~87 milliseconds for me on a 4.7GB file running with a sample size of 30 and filesize of 50000 and took ~91 milliseconds when I changed the sample size to 3000. It took 122 milliseconds when I increased the filesize to 10,000. Tl;Dr for this paragraph = it scales pretty well, and it scales extremely well with larger sample sizes.
In direct answer to your question "is there more elegant faster method to do this?" Yes, there is. The faster way to do it is to skip lines yourself, don't load the entire file into memory, and make sure to keep using buffered readers and writers. Also, I'd avoid trying to do your own raw Array buffers or anything like that - just don't.
Feel free to step through the method I've included if you want to see more of how it works.

My first cut at an approach would be to have a look at RandomAccess files in Java cf. https://docs.oracle.com/javase/tutorial/essential/io/rafs.html. Typically random seeks will be a lot faster than reading the whole file, but you'd then need read byte by byte to get to the beginning of the next line (for example), then read that line in byte by byte to the next newline, then seek to another random location.
I'm not sure the approach would be more elegant (depends partly on how you code it I guess), but I'd expect it to be faster.

There is no efficient way to seek lines. Only thing I can think of is using a RandomAccessFile, seeking a random possition and then reading the next 200(?) characters into an array. Then do the linebreak finding and form a String.
doc

Related

Sum of Numbers program that prints the total sum of all integers stored on a file

I'm still pretty new to files in java, and I have not yet grasp the understanding of using loops to process file though it may be a simple question to most I'm having difficulties getting the total sum of integer values stored on a text file try as i must. Any help is apricated!
Here is the question I'm working on:
Assume that a file containing a series of integers is named number.dat and exists on the computers disk. Design a program that calculates the average of all the numbers stored in the file.
Here is code I currently have:
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;
public class Sum_of_Numbers {
public static void main(String[] args) throws FileNotFoundException {
File CalFile = new File("C:\\Users\\Tyrese\\JAVA 2\\Chapter 10\\number.dat.txt");
Scanner bot = new Scanner(CalFile);
int sum = 0;
while (bot.hasNextLine()) {
sum += 1;
bot.nextLine();
}
System.out.println("The number in the file sum up to:" + sum);
}
}
I'm unable to get the correct output as intended. Please feel free to modify and change the file path if necessary or the code to aid me in completing the question I'm baffled on, again any help is appreciated.

When you do
int sum = 0;
while (bot.hasNextLine()) {
sum += 1;
bot.nextLine();
}
you're just counting how many lines are in this file. You can do something like that:
int sum = 0;
while (bot.hasNextLine()) {
sum += bot.nextInt();
bot.nextLine();
}

Java "Quicksave" Execution

The problem I seem to have hit is one relating to loading times; I'm not running on a particularly fast machine by any means, but I still want to dabble into neural networks. In short, I have to load 336,600,000 integers into one large array (I'm using the MNIST database; each image is 28x28, which amounts to 748 pixels per image, times 45,000 images). It works fine, and surprisingly I don't run out of RAM, but... it takes 4 and a half hours, just to get the data into an array.
I can supply the rest of the code if you want me to, but here's the function that runs through the file.
public static short[][] readfile(String fileName) throws FileNotFoundException, IOException {
short[][] array = new short[10000][784];
BufferedReader br = new BufferedReader(new FileReader(System.getProperty("user.dir") + "/MNIST/" + fileName + ".csv"));
br.readLine();
try {
for (short i = 1; i < 45000; i++) {
String line = br.readLine();
for (short j = 0; j < 784; j++) {
array[i][j] = Short.parseShort(line.split(",")[j]);
}
}
br.close();
} catch (IOException e) {
e.printStackTrace();
}
return array;
}
What I want to know is, is there some way to "quicksave" the execution of the program so that I don't have to rebuild the array for every small tweak?
Note: I haven't touched Java in a while, and my code is mostly chunked together from a lot of different sources. I wouldn't be surprised if there were some serious errors (or just Java "no-nos"), it would actually help me a lot if you could fix them if you answer.
Edit: Bad question, I'm just blind... sorry for wasting time
Edit 2: I've decided after a while that instead of loading all of the images, and then training with them one by one, I could simply train one by one and load the next. Thank you all for your ideas!

array[i][j] = Short.parseShort(line.split(",")[j]);
You are calling String#split() for every single integer.
Call it once outside the loop and copy the value into your 2d array.

Heap error in fibonacci exercise

I was given a simple task of writing a program that will find all fibonacci's numbers from given file and than return the biggest and the smallest.
Unfortunately while trying to execute an Java heap space error.
I really don’t have clue where is the mistake, so can you help me a little?
And how to avoid repeating same mistake in the future.
package maturaWielkanoc;
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.List;
public class Zad4 {
public static void main(String[] args) {
List<Integer> list = new ArrayList<>();
list.add(0);
list.add(1);
int counter = 1;
while (list.get(list.size() - 1) < 908589244) {
//908589244 is one bigger than the biggest number in dane.txt
list.add(list.get(counter) + list.get(counter));
}
try (BufferedReader br = new BufferedReader(new FileReader("dane.txt"))) {
String line;
List<Integer> answer = new ArrayList<>();
while ((line = br.readLine()) != null) {
int value = Integer.parseInt(line);
if (list.contains(value)) {
answer.add(value);
}
}
answer.sort(null);
System.out.println("Min = " + answer.get(0));
System.out.println("Max = " + answer.get(answer.size() - 1));
} catch (Exception e) {
e.printStackTrace();
}
}
}

Your applications will run with a fixed amount of memory while a List will store and hold the data in memory continuously filling it up. You will continue to add the Integers to the list with no end in sight.
A quick google shows that a boxed integer is 8 bytes, your upper bound is 908589244 number of integers. Instead, let's imagine you try to store 1 billion integers, that's 8 gigs, which will quickly full up your heap.
This task is clearly a lesson on memory management. You will want to find a better way of holding the data and maintaining what is the biggest and smallest fibonacci numbers without storing them all in memory (consider evaluating each while you read each value from the file instead of adding to a list).

Instead of having every number in the list (you're likely running out of memory) store the largest number you have seen and the smallest number you have seen, and if the number read from the file is larger than the largest replace it, and smaller than the smallest:
int largest = Integer.MIN_VALUE;
int smallest = Integer.MAX_VALUE;
while(read integers from file){
int numberFromFile = Integer.parseInt(fromFile);
if(numberFromFile > largest){
largest = numberFromFile;
}
if(numberFromFile < smallest){
smallest = numberFromFile;
}
}
// Now you have largest and smallest without a huge list.

Program is delayed in writing to a .txt file?

So, I've searched around stackoverflow for a bit, but I can't seem to find an answer to this issue.
My current homework for my CS class involves reading from a file of 5000 random numbers and doing various things with the data, like putting it into an array, seeing how many times a number occurs, and finding what the longest increasing sequence is. I've got all that done just fine.
In addition to this, I am (for myself) adding in a method that will allow me to overwrite the file and create 5000 new random numbers to make sure my code works with multiple different test cases.
The method works for the most part, however after I call it it doesn't seem to "activate" until after the rest of the program finishes. If I run it and tell it to change the numbers, I have to run it again to actually see the changed values in the program. Is there a way to fix this?
Current output showing the delay between changing the data:
Not trying to change the data here- control case.
elkshadow5$ ./CompileAndRun.sh
Create a new set of numbers? Y for yes. n
What number are you looking for? 66
66 was found 1 times.
The longest sequence is [606, 3170, 4469, 4801, 5400, 8014]
It is 6 numbers long.
The numbers should change here but they don't.
elkshadow5$ ./CompileAndRun.sh
Create a new set of numbers? Y for yes. y
What number are you looking for? 66
66 was found 1 times.
The longest sequence is [606, 3170, 4469, 4801, 5400, 8014]
It is 6 numbers long.
Now the data shows that it's changed, the run after the data should have been changed.
elkshadow5$ ./CompileAndRun.sh
Create a new set of numbers? Y for yes. n
What number are you looking for? 1
1 was found 3 times.
The longest sequence is [1155, 1501, 4121, 5383, 6000]
It is 5 numbers long.
My code:
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.Scanner;
public class jeftsdHW2 {
static Scanner input = new Scanner(System.in);
public static void main(String args[]) throws Exception {
jeftsdHW2 random = new jeftsdHW2();
int[] data;
data = new int[5000];
random.readDataFromFile(data);
random.overwriteRandNums();
}
public int countingOccurrences(int find, int[] array) {
int count = 0;
for (int i : array) {
if (i == find) {
count++;
}
}
return count;
}
public int[] longestSequence(int[] array) {
int[] sequence;
return sequence;
}
public void overwriteRandNums() throws Exception {
System.out.print("Create a new set of numbers? Y for yes.\t");
String answer = input.next();
char yesOrNo = answer.charAt(0);
if (yesOrNo == 'Y' || yesOrNo == 'y') {
writeDataToFile();
}
}
public void readDataFromFile(int[] data) throws Exception {
try {
java.io.File infile = new java.io.File("5000RandomNumbers.txt");
Scanner readFile = new Scanner(infile);
for (int i = 0; i < data.length; i++) {
data[i] = readFile.nextInt();
}
readFile.close();
} catch (FileNotFoundException e) {
System.out.println("Please make sure the file \"5000RandomNumbers.txt\" is in the correct directory before trying to run this.");
System.out.println("Thank you.");
System.exit(1);
}
}
public void writeDataToFile() throws Exception {
int j;
StringBuilder theNumbers = new StringBuilder();
try {
PrintWriter writer = new PrintWriter("5000RandomNumbers.txt", "UTF-8");
for (int i = 0; i < 5000; i++) {
if (i > 1 && i % 10 == 0) {
theNumbers.append("\n");
}
j = (int) (9999 * Math.random());
if (j < 1000) {
theNumbers.append(j + "\t\t");
} else {
theNumbers.append(j + "\t");
}
}
writer.print(theNumbers);
writer.flush();
writer.close();
} catch (IOException e) {
System.out.println("error");
}
}
}

It is possible that the file has not been physically written to the disk, using flush is not enough for this, from the java documentation here:
If the intended destination of this stream is an abstraction provided by the underlying operating system, for example a file, then flushing the stream guarantees only that bytes previously written to the stream are passed to the operating system for writing; it does not guarantee that they are actually written to a physical device such as a disk drive.
Because of the HDDs read and write speed, it is advisable to depend as little as possible on HDD access.
Perhaps storing the random number strings to a list when re-running and using that would be a solution. You could even write the list to disk, but this way the implementation does not depend on the time the file is being written.
EDIT
After the OP posted more of its code it became apparent that my original answer is not relatede to the problem. Nonetheless it is sound.
The code OP posted is not enough to see when is he reading the file after writing. It seems he is writing to the file after reading, which of course is what is percieved as an error. Reading after writing should produce a program that does what you want.
Id est, this:
random.readDataFromFile(data);
random.overwriteRandNums();
Will be reflected until the next execution. This:
random.overwriteRandNums();
random.readDataFromFile(data);
Will use the updated file in the current execution.

Improving the runtime of Insertion sort

Just to practice and improve my programming skills I decided to solve the questions on InterviewStreet. I decided to start off using simple InsertionSort (I expected it to be simple).
https://www.interviewstreet.com/challenges/dashboard/#problem/4e90477dbd22b
I am able to get correct answers. However the runtime is a problem. The max allowed runtime for the test cases is 5s. However I am going slightly overboard.
I used a few tricks (like removing something out of code. Storing the result of str.lenght() etc). However I am still slightly overboard.
The current runtimes for the ten test cases are:
1 Passed Success 0.160537
2 Passed Success 0.182606
3 Passed Success 0.172744
4 Passed Success 0.186676
5 Failed Time limit exceeded. 5.19279
6 Failed Time limit exceeded. 5.16129
7 Passed Success 2.91226
8 Failed Time limit exceeded. 5.14609
9 Failed Time limit exceeded. 5.14648
10 Failed Time limit exceeded. 5.16734
I am not aware what the test cases are.
Kindly help me improve the runtime.
Thank you.
import java.util.Scanner;
import java.io.*;
//import java.io.BufferedWriter;
//import java.io.FileInputStream;
//import java.io.FileOutputStream;
public class Solution {
public static int[] A=new int[100001];
public static int swap=0;
public static void InsertionSort(int n){
for (int i=1; i<=n; i++){
for (int var=i; var>0; var--){
if (A[var]<A[var-1]){
int temp=A[var-1];
A[var-1]=A[var];
A[var]=temp;
swap++;
}
else {
break;
}
}
}
}
public static void main(String[] args) throws IOException {
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String str = br.readLine();
int number_of_cases =Integer.parseInt(str);
int counter;
int [] spacearray = new int[100000];
for (int j=0; j<number_of_cases; j++){
swap=0;
str = br.readLine();
int arraylength = Integer.parseInt(str);
str = br.readLine();
counter=0;
int strlen=str.length();
for (int i=0; i<strlen-1; i++){
if (str.charAt(i) == ' '){
spacearray[counter]=i;
counter++;
}
}
spacearray[counter]=strlen;
A[0]=Integer.parseInt(str.substring(0, spacearray[0]));
for (int i=1; i<=arraylength-1; i++){
A[i] = Integer.parseInt(str.substring(spacearray[i-1]+1,spacearray[i]));
}
InsertionSort(arraylength-1);
System.out.println(swap);
}
}
}

Use binary indexed trees to solve this problem

The bottleneck here is the insertion sort algorithm. It's time complexity is O(n^2) and with n up to 10^5, one can easily exceed 5 seconds on the interviewstreet judge machine. Also when a TLE signal is thrown, your program stops executing. So the slight overhead to 5 isn't really an indicator of the actual time it takes to run. It is introduced by the delay between detecting TLE and stopping the execution.
For the sake of history, this question appeared originally as a part of codesprint-1. Using insertion sort isn't the way to proceed here, otherwise the question would have been a complete give away.
Hint
Use the fact that all values will be within [1,10^6]. What you are really doing here is finding the number of inversions in the array A, i.e. find all pairs of i < j s.t. A[i] > A[j]. Think of a data structure that allows you to find the number of swaps needed for each insert operation in logarithmic time complexity (like Binary Indexed Trees). Of course, there are other approaches.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

selecting random lines from huge text file - java

There is no efficient way to seek lines. Only thing I can think of is using a RandomAccessFile, seeking a random possition and then reading the next 200(?) characters into an array. Then do the linebreak finding and form a String. doc

Related

Sum of Numbers program that prints the total sum of all integers stored on a file

Java "Quicksave" Execution

Heap error in fibonacci exercise

Program is delayed in writing to a .txt file?

Improving the runtime of Insertion sort

Categories

Resources