Merging two files line by line Java - java

Is there a more efficient way than i'm currently using, to merge two files line by line appending the line from file2 onto file1?
If file1 contains
a1
b1
c1
And file2 contains
a2
b2
c2
Then the output file should contain
a1,a2
b1,b2
c1,c2
The current combineRecords method looks like
private FileSheet combineRecords(ArrayList<FileSheet> toCombine) throws IOException
{
ArrayList<String> filepaths = new ArrayList<String>();
for (FileSheet sheetIterator : toCombine)
{
filepaths.add(sheetIterator.filepath);
}
String filepathAddition = "";
for (String s : filepaths)
{
filepathAddition = filepathAddition + s.split(".select.")[1].replace(".csv", "") + ".";
}
String outputFilepath = subsheetDirectory + fileHandle.getName().split(".csv")[0] + ".select." + filepathAddition + "csv";
Log.log("Output filepath " + outputFilepath);
long mainFileLength = toCombine.get(0).recordCount();
for (FileSheet f : toCombine)
{
int ordinal = toCombine.indexOf(f);
if (toCombine.get(ordinal).recordCount() != mainFileLength)
{
Log.log("Error : Record counts for 0 + " + ordinal);
return null;
}
}
FileSheet finalValues;
Log.log("Starting iteration streams");
BufferedWriter out = new BufferedWriter(new FileWriter(outputFilepath, false));
List<BufferedReader> streams = new ArrayList<>();
for (FileSheet j : toCombine)
{
streams.add(new BufferedReader(new FileReader(j.filepath)));
}
String finalWrite = "";
for (int i = 0; i < toCombine.get(0).recordCount(); i++)
{
for (FileSheet j : toCombine)
{
int ordinal = toCombine.indexOf(j);
finalWrite = finalWrite + streams.get(ordinal).readLine();
if (toCombine.indexOf(j) != toCombine.size() - 1)
{
finalWrite = finalWrite + ",";
}
else
{
finalWrite = finalWrite + "\n";
}
}
if (i % 1000 == 0 || i == toCombine.get(0).recordCount() - 1)
{
// out.write(finalWrite + "\n");
Files.write(Paths.get(outputFilepath),(finalWrite).getBytes(),StandardOpenOption.APPEND);
finalWrite = "";
}
}
out.close();
Log.log("Finished combineRecords");
finalValues = new FileSheet(outputFilepath,0);
return finalValues;
}
I've tried both bufferedwriters and files.write, and they have similar times to create file3, both in the 1:30 minute range, but i'm not sure if the bottleneck is at reading or writing
The sample files i'm using are currently at 36,000 records, but the actual file i'll be using is ~650,000 so taking (if it scales linearly) 1625 seconds is completely unfeasible for this operation
Edit : I've modified the code to only open files once, rather than per iteration, however i'm now getting stream closed when skipping to the nth line
I thought that by doing streams.get(ordinal).skip(i).findFirst().get(); would return a new stream instead of skipping then closing the stream
Edit 2 : Modified the code to use bufferedreaders instead of streams, and write to file every 1000 lines read, and thats determined that the bottleneck is reading, because it still takes ~1:30 to do

First of all concating string using + operator is ok when it is not under loop. But when you want to merge strings in a loop you should use StringBuilder for better performance.
Second thing which you can improve you can write to file at the end like:
StringBuilder finalWrite = new StringBuilder();
for (int i = 0; i < toCombine.get(0).recordCount(); i++)
{
for (FileSheet j : toCombine)
{
int ordinal = toCombine.indexOf(j);
finalWrite.append(streams.get(ordinal).readLine());
if (toCombine.indexOf(j) != toCombine.size() - 1)
{
finalWrite.append(",");
}
else
{
finalWrite.append("\n");
}
}
}
Files.write(Paths.get(outputFilepath), finalWrite.toString().getBytes());

Related

How can I scope three different conditions using the same loop in Java?

I would like to count countX and countX using the same loop instead of creating three different loops. Is there any easy way approaching that?
public class Absence {
private static File file = new File("/Users/naplo.txt");
private static File file_out = new File("/Users/naplo_out.txt");
private static BufferedReader br = null;
private static BufferedWriter bw = null;
public static void main(String[] args) throws IOException {
int countSign = 0;
int countX = 0;
int countI = 0;
String sign = "#";
String absenceX = "X";
String absenceI = "I";
try {
br = new BufferedReader(new FileReader(file));
bw = new BufferedWriter(new FileWriter(file_out));
String st;
while ((st = br.readLine()) != null) {
for (String element : st.split(" ")) {
if (element.matches(sign)) {
countSign++;
continue;
}
if (element.matches(absenceX)) {
countX++;
continue;
}
if (element.matches(absenceI)) {
countI++;
}
}
}
System.out.println("2. exerc.: There are " + countSign + " rows int the file with that sign.");
System.out.println("3. exerc.: There are " + countX + " with sick note, and " + countI + " without sick note!");
} catch (FileNotFoundException ex) {
Logger.getLogger(Absence.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
text file example:
# 03 26
Jujuba Ibolya IXXXXXX
Maracuja Kolos XXXXXXX
I think you meant using less than 3 if statements. You can actually so it with no ifs.
In your for loop write this:
Countsign += (element.matches(sign)) ? 1 : 0;
CountX += (element.matches(absenceX)) ? 1 : 0;
CountI += (element.matches(absenceI)) ? 1 : 0;
Both answers check if the word (element) matches all regular expressions while this can (and should, if you ask me) be avoided since a word can match only one regex. I am referring to the continue part your original code has, which is good since you do not have to do any further checks.
So, I am leaving here one way to do it with Java 8 Streams in "one liner".
But let's assume the following regular expressions:
String absenceX = "X*";
String absenceI = "I.*";
and one more (for the sake of the example):
String onlyNumbers = "[0-9]*";
In order to have some matches on them.
The text is as you gave it.
public class Test {
public static void main(String[] args) throws IOException {
File desktop = new File(System.getProperty("user.home"), "Desktop");
File txtFile = new File(desktop, "test.txt");
String sign = "#";
String absenceX = "X*";
String absenceI = "I.*";
String onlyNumbers = "[0-9]*";
List<String> regexes = Arrays.asList(sign, absenceX, absenceI, onlyNumbers);
List<String> lines = Files.readAllLines(txtFile.toPath());
//#formatter:off
Map<String, Long> result = lines.stream()
.flatMap(line-> Stream.of(line.split(" "))) //map these lines to words
.map(word -> regexes.stream().filter(word::matches).findFirst()) //find the first regex this word matches
.filter(Optional::isPresent) //If it matches no regex, it will be ignored
.collect(Collectors.groupingBy(Optional::get, Collectors.counting())); //collect
System.out.println(result);
}
}
The result:
{X*=1, #=1, I.=2, [0-9]=2}
X*=1 came from word: XXXXXXX
#=1 came from word: #
I.*=2 came from words: IXXXXXX and Ibolya
[0-9]*=2 came from words: 03 and 06
Ignore the fact I load all lines in memory.
So I made it with the following lines to work. It escaped my attention that every character need to be separated from each other. Your ternary operation suggestion also nice so I will use it.
String myString;
while ((myString = br.readLine()) != null) {
String newString = myString.replaceAll("", " ").trim();
for (String element : newString.split(" ")) {
countSign += (element.matches(sign)) ? 1 : 0;
countX += (element.matches(absenceX)) ? 1 : 0;
countI += (element.matches(absenceI)) ? 1 : 0;

Merging sorted Files using multithreading

Multithreading is new to me so sorry for mistakes.
I have written the below program which merges files with mulithreading but I am not able to figure out how to manage the last file and after one iteration how to merge the newly created files.
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.FileWriter;
import java.util.ArrayList;
public class MergerSorter extends Thread {
int fileNumber = 1;
public static void main(String[] args) {
startMergingfiles(9);
}
public MergerSorter(int fileNum) {
fileNumber = fileNum;
}
public static void startMergingfiles(int numberOfFiles) {
int objectcounter = 0;
while (numberOfFiles != 1) {
try {
ArrayList<MergerSorter> objectList = new ArrayList<MergerSorter>();
for (int j = 1; j <= numberOfFiles; j = j + 2) {
if (numberOfFiles == j) {// Last Single remaining File
} else {
objectList.add(new MergerSorter(j));
objectList.get(objectcounter).start();
objectList.get(objectcounter).join();
objectcounter++;
}
}
objectcounter = 0;
numberOfFiles = numberOfFiles / 2;
} catch (Exception e) {
System.out.println(e);
}
}
}
public void run() {
try {
FileReader fileReader1 = new FileReader("src/externalsort/" + Integer.toString(fileNumber));
FileReader fileReader2 = new FileReader("src/externalsort/" + Integer.toString(fileNumber + 1));
BufferedReader bufferedReader1 = new BufferedReader(fileReader1);
BufferedReader bufferedReader2 = new BufferedReader(fileReader2);
String line1 = bufferedReader1.readLine();
String line2 = bufferedReader2.readLine();
FileWriter tmpFile = new FileWriter("src/externalsort/" + Integer.toString(fileNumber) + "op.txt", false);
int whichFileToRead = 0;
boolean file_1_reader = true;
boolean file_2_reader = true;
while (file_1_reader || file_2_reader) {
if (file_1_reader == false) {
tmpFile.write(line2 + "\r\n");
whichFileToRead = 2;
} else if (file_2_reader == false) {
tmpFile.write(line1 + "\r\n");
whichFileToRead = 1;
} else {
String value1 = line1.substring(0, 10);
String value2 = line2.substring(0, 10);
int ans = value1.compareTo(value2);
if (ans < 0) {
tmpFile.write(line1 + "\r\n");
whichFileToRead = 1;
} else if (ans > 0) {
tmpFile.write(line2 + "\r\n");
whichFileToRead = 2;
} else if (ans == 0) {
tmpFile.write(line1 + "\r\n");
whichFileToRead = 1;
}
}
if (whichFileToRead == 1) {
line1 = bufferedReader1.readLine();
if (line1 == null)
file_1_reader = false;
} else {
line2 = bufferedReader2.readLine();
if (line2 == null)
file_2_reader = false;
}
}
tmpFile.close();
bufferedReader1.close();
bufferedReader2.close();
fileReader1.close();
fileReader2.close();
} catch (Exception e) {
System.out.println(e);
}
}
}
I am trying to merge sorted files with multithreading. Say I have 50 files and I want to merge all these individual files into one final sorted file but I want to speed up and utilize every core by multi threading but I am not able to do it. And the files are big so they can't be placed in heap/RAM so I have to read every file and keep writing.
You can do this with merge sort, but instead of lots of little sorted lists, you'll need to use lots of little sorted files. Once you have broken all of the files down into small sorted files, you can start merging them together again until you end up with a single sorted file.
Unfortunately, you likely won't be able to achieve high CPU utilisation as much of the time will be spend waiting for disk I/O to complete.
Edit: just read your response to a comment and it sounds like you are asking for help on the last step of the merge sort. The graphics in the wiki link above will also help you understand. So, assuming all of your files are sorted, here we go:
Read 1 item from each file
Figure out which lowest/smallest/whatever and write that line to the result file
Read a new item from the file which just provided the last item
Repeat steps 2 and 3 until all files have been completely read.

write to separate columns in csv

I am trying to write 2 different arrays to a csv. The first one I want in the first column, and second array in the second column, like so:
array1val1 array2val1
array1val2 array2val2
I am using the following code:
String userHomeFolder2 = System.getProperty("user.home") + "/Desktop";
String csvFile = (userHomeFolder2 + "/" + fileName.getText() + ".csv");
FileWriter writer = new FileWriter(csvFile);
final String NEW_LINE_SEPARATOR = "\n";
FileWriter fileWriter;
CSVPrinter csvFilePrinter;
CSVFormat csvFileFormat = CSVFormat.DEFAULT.withRecordSeparator(NEW_LINE_SEPARATOR);
fileWriter = new FileWriter(fileName.getText());
csvFilePrinter = new CSVPrinter(fileWriter, csvFileFormat);
try (PrintWriter pw = new PrintWriter(csvFile)) {
pw.printf("%s\n", FILE_HEADER);
for(int z = 0; z < compSource.size(); z+=1) {
//below forces the result to get stored in below variable as a String type
String newStr=compSource.get(z);
String newStr2 = compSource2.get(z);
newStr.replaceAll(" ", "");
newStr2.replaceAll(" ", "");
String[] explode = newStr.split(",");
String[] explode2 = newStr2.split(",");
pw.printf("%s\n", explode, explode2);
}
}
catch (Exception e) {
System.out.println("Error in csvFileWriter");
e.printStackTrace();
} finally {
try {
fileWriter.flush();
fileWriter.close();
csvFilePrinter.close();
} catch (IOException e ) {
System.out.println("Error while flushing/closing");
}
}
However I am getting a strange output into the csv file:
[Ljava.lang.String;#17183ab4
I can run
pw.printf("%s\n", explode);
pw.printf("%s\n", explode2);
Instead of : pw.printf("%s\n", explode, explode2);
and it prints the actual strings but all in one same column.
Does anyone know how to solve this?
1.Your explode and explode2 are actually String Arrays. You are printing the arrays and not the values of it. So you get at the end the ADRESS of the array printed.
You should go through the arrays with a loop and print them out.
for(int i = 0; i<explode.length;++i) {
pw.printf("%s%s\n", explode[i], explode2[i]);
}
2.Also the method printf should be look something like
pw.printf("%s%s\n", explode, explode2);
because youre are printing two arguments, but in ("%s\n", explode, explode2) is only one printed.
Try it out and say if it worked
After these lines:
newStr.replaceAll(" ", "");
newStr2.replaceAll(" ", "");
String[] explode = newStr.split(",");
String[] explode2 = newStr2.split(",");
Use this code:
int maxLength = Math.max(explode.length, explode2.length);
for (int i = 0; i < maxLength; i++) {
String token1 = (i < explode.length) ? explode[i] : "";
String token2 = (i < explode2.length) ? explode2[i] : "";
pw.printf("%s %s\n", token1, token2);
}
This also cover the case that the arrays are of different length.
I have removed all unused variables and made some assumptions about content of compSource.
Moreover, don't forget String is immutable. If you just do "newStr.replaceAll(" ", "");", the replacement will be lost.
public class Tester {
#Test
public void test() throws IOException {
// I assumed compSource and compSource2 are like bellow
List<String> compSource = Arrays.asList("array1val1,array1val2");
List<String> compSource2 = Arrays.asList("array2val1,array2val2");
String userHomeFolder2 = System.getProperty("user.home") + "/Desktop";
String csvFile = (userHomeFolder2 + "/test.csv");
try (PrintWriter pw = new PrintWriter(csvFile)) {
pw.printf("%s\n", "val1,val2");
for (int z = 0; z < compSource.size(); z++) {
String newStr = compSource.get(z);
String newStr2 = compSource2.get(z);
// String is immutable --> store the result otherwise it will be lost
newStr = newStr.replaceAll(" ", "");
newStr2 = newStr2.replaceAll(" ", "");
String[] explode = newStr.split(",");
String[] explode2 = newStr2.split(",");
for (int k = 0; k < explode.length; k++) {
pw.println(explode[k] + "\t" + explode2[k]);
}
}
}
}
}

Make an array of words in alphabetical order in java, after reading them from a file

I've got the following code that opens and read a file and separates it to words.
My problem is at making an array of these words in alphabetical order.
import java.io.*;
class MyMain {
public static void main(String[] args) throws IOException {
File file = new File("C:\\Kennedy.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
String line = null;
int line_count=0;
int byte_count;
int total_byte_count=0;
int fromIndex;
while( (line = br.readLine())!= null ){
line_count++;
fromIndex=0;
String [] tokens = line.split(",\\s+|\\s*\\\"\\s*|\\s+|\\.\\s*|\\s*\\:\\s*");
String line_rest=line;
for (int i=1; i <= tokens.length; i++) {
byte_count = line_rest.indexOf(tokens[i-1]);
//if ( tokens[i-1].length() != 0)
//System.out.println("\n(line:" + line_count + ", word:" + i + ", start_byte:" + (total_byte_count + fromIndex) + "' word_length:" + tokens[i-1].length() + ") = " + tokens[i-1]);
fromIndex = fromIndex + byte_count + 1 + tokens[i-1].length();
if (fromIndex < line.length())
line_rest = line.substring(fromIndex);
}
total_byte_count += fromIndex;
}
}
}
I would read the File with a Scanner1 (and I would prefer the File(String,String) constructor to provide the parent folder). And, you should remember to close your resources explicitly in a finally block or you might use a try-with-resources statement. Finally, for sorting you can store your words in a TreeSet in which the elements are ordered using their natural ordering2. Something like,
File file = new File("C:/", "Kennedy.txt");
try (Scanner scanner = new Scanner(file)) {
Set<String> words = new TreeSet<>();
int line_count = 0;
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
line_count++;
String[] tokens = line.split(",\\s+|\\s*\\\"\\s*|\\s+|\\.\\s*|\\s*\\:\\s*");
Stream.of(tokens).forEach(word -> words.add(word));
}
System.out.printf("The file contains %d lines, and in alphabetical order [%s]%n",
line_count, words);
} catch (Exception e) {
e.printStackTrace();
}
1Mainly because it requires less code.
2or by a Comparator provided at set creation time
If you are storing the tokens in a String Array, use Arrays.sort() and get a naturally sorted Array. In this case as its String, you will get a sorted array of tokens.

Resetting Java while loop

I currently have 2 loops, one which gets a timestamp, and another while loop to find the mapped information based off that time stamp and output in a certain way.
Issue I have is I am currently looping through a text, and want it to start reading the file from the beginning again when the isdone="N" for the second loop, however, this does not seem to be the case.
Code so far:
public static void organiseFile() throws FileNotFoundException {
String directory = "C:\\Users\\xxx\\Desktop\\Files\\ex1";
Scanner fileIn = new Scanner(new File(directory + "_temp.txt"));
Scanner readIn = new Scanner(new File(directory + ".txt"));
PrintWriter out = new PrintWriter(directory + "_ordered.txt");
ArrayList<String> lines = new ArrayList<String>();
String readTimeStamp = "";
String timeStampMapping = "";
String outputFirst = "";
String outputSecond = "";
String outputThird = "";
String previousTimeStamp = "";
String doneList = "";
String isdone = "";
int counter = 1;
// Loop to get time stamps
while(fileIn.hasNextLine()) {
readTimeStamp = fileIn.nextLine();
if(readTimeStamp != null && readTimeStamp.trim().length() > 0) {
readTimeStamp = readTimeStamp.substring(12, 25);
System.out.println(readTimeStamp);
// Previous time stamp found, no need to loop through it again
if(doneList.contains(readTimeStamp))
isdone = "Y";
// Counter in place to stop outputting the first record, otherwise output file and clear variables down
else if(!previousTimeStamp.equals(readTimeStamp) && counter > 1) {
out.println(outputFirst + outputSecond + outputThird);
System.out.println("Outputting....");
outputFirst = "";
outputSecond = "";
outputThird = "";
counter = 1;
}
// New time stamp found, start finding values in second loop
else
isdone = "N";
// Secondary loop to find match of record
while(readIn.hasNextLine() && isdone.equals("N")) {
System.out.println("Mapping...");
timeStampMapping = readIn.nextLine();
System.out.println(timeStampMapping);
// When a record has been found with matching time stamps, start ordering
if(timeStampMapping.contains(readTimeStamp)) {
previousTimeStamp = readTimeStamp;
System.out.println(previousTimeStamp);
if(timeStampMapping.contains("[EVENT=agentStateEvent]")) {
outputFirst += timeStampMapping + "\r\n";
} else if(timeStampMapping.contains("[EVENT=TerminalConnectionCreated]")) {
outputSecond += timeStampMapping + "\r\n";
} else {
outputThird += timeStampMapping + "\r\n";
doneList += readTimeStamp + ",";
}
counter++;
}
}
}
}
System.out.println("Outputting final record");
out.println(outputFirst + outputSecond + outputThird);
System.out.println("Complete!");
out.close();
}
You can use Scanner.reset() to reset it to the beginning of the file. For example, after your second while-loop include:
if (isdone.equals("Y")) {
fileIn.reset();
}
Btw: why are you using String for isdone instead of boolean??

Categories