Merging sorted Files using multithreading - java

Multithreading is new to me so sorry for mistakes.
I have written the below program which merges files with mulithreading but I am not able to figure out how to manage the last file and after one iteration how to merge the newly created files.
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.FileWriter;
import java.util.ArrayList;
public class MergerSorter extends Thread {
int fileNumber = 1;
public static void main(String[] args) {
startMergingfiles(9);
}
public MergerSorter(int fileNum) {
fileNumber = fileNum;
}
public static void startMergingfiles(int numberOfFiles) {
int objectcounter = 0;
while (numberOfFiles != 1) {
try {
ArrayList<MergerSorter> objectList = new ArrayList<MergerSorter>();
for (int j = 1; j <= numberOfFiles; j = j + 2) {
if (numberOfFiles == j) {// Last Single remaining File
} else {
objectList.add(new MergerSorter(j));
objectList.get(objectcounter).start();
objectList.get(objectcounter).join();
objectcounter++;
}
}
objectcounter = 0;
numberOfFiles = numberOfFiles / 2;
} catch (Exception e) {
System.out.println(e);
}
}
}
public void run() {
try {
FileReader fileReader1 = new FileReader("src/externalsort/" + Integer.toString(fileNumber));
FileReader fileReader2 = new FileReader("src/externalsort/" + Integer.toString(fileNumber + 1));
BufferedReader bufferedReader1 = new BufferedReader(fileReader1);
BufferedReader bufferedReader2 = new BufferedReader(fileReader2);
String line1 = bufferedReader1.readLine();
String line2 = bufferedReader2.readLine();
FileWriter tmpFile = new FileWriter("src/externalsort/" + Integer.toString(fileNumber) + "op.txt", false);
int whichFileToRead = 0;
boolean file_1_reader = true;
boolean file_2_reader = true;
while (file_1_reader || file_2_reader) {
if (file_1_reader == false) {
tmpFile.write(line2 + "\r\n");
whichFileToRead = 2;
} else if (file_2_reader == false) {
tmpFile.write(line1 + "\r\n");
whichFileToRead = 1;
} else {
String value1 = line1.substring(0, 10);
String value2 = line2.substring(0, 10);
int ans = value1.compareTo(value2);
if (ans < 0) {
tmpFile.write(line1 + "\r\n");
whichFileToRead = 1;
} else if (ans > 0) {
tmpFile.write(line2 + "\r\n");
whichFileToRead = 2;
} else if (ans == 0) {
tmpFile.write(line1 + "\r\n");
whichFileToRead = 1;
}
}
if (whichFileToRead == 1) {
line1 = bufferedReader1.readLine();
if (line1 == null)
file_1_reader = false;
} else {
line2 = bufferedReader2.readLine();
if (line2 == null)
file_2_reader = false;
}
}
tmpFile.close();
bufferedReader1.close();
bufferedReader2.close();
fileReader1.close();
fileReader2.close();
} catch (Exception e) {
System.out.println(e);
}
}
}
I am trying to merge sorted files with multithreading. Say I have 50 files and I want to merge all these individual files into one final sorted file but I want to speed up and utilize every core by multi threading but I am not able to do it. And the files are big so they can't be placed in heap/RAM so I have to read every file and keep writing.

You can do this with merge sort, but instead of lots of little sorted lists, you'll need to use lots of little sorted files. Once you have broken all of the files down into small sorted files, you can start merging them together again until you end up with a single sorted file.
Unfortunately, you likely won't be able to achieve high CPU utilisation as much of the time will be spend waiting for disk I/O to complete.
Edit: just read your response to a comment and it sounds like you are asking for help on the last step of the merge sort. The graphics in the wiki link above will also help you understand. So, assuming all of your files are sorted, here we go:
Read 1 item from each file
Figure out which lowest/smallest/whatever and write that line to the result file
Read a new item from the file which just provided the last item
Repeat steps 2 and 3 until all files have been completely read.

Related

Why does the stream position go to the end

I have a csv file, after I overwrite 1 line with the Write method, after re-writing to the file everything is already added to the end of the file, and not to a specific line
using System.Collections;
using System.Collections.Generic;
using UnityEngine.UI;
using UnityEngine;
using System.Text;
using System.IO;
public class LoadQuestion : MonoBehaviour
{
int index;
string path;
FileStream file;
StreamReader reader;
StreamWriter writer;
public Text City;
public string[] allQuestion;
public string[] addedQuestion;
private void Start()
{
index = 0;
path = Application.dataPath + "/Files/Questions.csv";
allQuestion = File.ReadAllLines(path, Encoding.GetEncoding(1251));
file = new FileStream(path, FileMode.Open, FileAccess.ReadWrite);
writer = new StreamWriter(file, Encoding.GetEncoding(1251));
reader = new StreamReader(file, Encoding.GetEncoding(1251));
writer.AutoFlush = true;
List<string> _questions = new List<string>();
for (int i = 0; i < allQuestion.Length; i++)
{
char status = allQuestion[i][0];
if (status == '0')
{
_questions.Add(allQuestion[i]);
}
}
addedQuestion = _questions.ToArray();
City.text = ParseToCity(addedQuestion[0]);
}
private string ParseToCity(string current)
{
string _city = "";
string[] data = current.Split(';');
_city = data[2];
return _city;
}
private void OnApplicationQuit()
{
writer.Close();
reader.Close();
file.Close();
}
public void IKnow()
{
string[] quest = addedQuestion[index].Split(';');
int indexFromFile = int.Parse(quest[1]);
string questBeforeAnsver = "";
for (int i = 0; i < quest.Length; i++)
{
if (i == 0)
{
questBeforeAnsver += "1";
}
else
{
questBeforeAnsver += ";" + quest[i];
}
}
Debug.Log("indexFromFile : " + indexFromFile);
for (int i = 0; i < allQuestion.Length; i++)
{
if (i == indexFromFile)
{
writer.Write(questBeforeAnsver);
break;
}
else
{
reader.ReadLine();
}
}
reader.DiscardBufferedData();
reader.BaseStream.Seek(0, SeekOrigin.Begin);
if (index < addedQuestion.Length - 1)
{
index++;
}
City.text = ParseToCity(addedQuestion[index]);
}
}
There are lines in the file by type :
0;0;Africa
0;1;London
0;2;Paris
The bottom line is that this is a game, and only those questions whose status is 0, that is, unanswered, are downloaded from the file. And if during the game the user clicks that he knows the answer, then there is a line in the file and is overwritten, only the status is no longer 0, but 1 and when the game is repeated, this question will not load.
It turns out for me that the first question is overwritten successfully, and all subsequent ones are simply added at the end of the file :
1;0;Africa
0;1;London
0;2;Paris1;1;London1;2;Paris
What's wrong ?
The video shows everything in detail

Merging two files line by line Java

Is there a more efficient way than i'm currently using, to merge two files line by line appending the line from file2 onto file1?
If file1 contains
a1
b1
c1
And file2 contains
a2
b2
c2
Then the output file should contain
a1,a2
b1,b2
c1,c2
The current combineRecords method looks like
private FileSheet combineRecords(ArrayList<FileSheet> toCombine) throws IOException
{
ArrayList<String> filepaths = new ArrayList<String>();
for (FileSheet sheetIterator : toCombine)
{
filepaths.add(sheetIterator.filepath);
}
String filepathAddition = "";
for (String s : filepaths)
{
filepathAddition = filepathAddition + s.split(".select.")[1].replace(".csv", "") + ".";
}
String outputFilepath = subsheetDirectory + fileHandle.getName().split(".csv")[0] + ".select." + filepathAddition + "csv";
Log.log("Output filepath " + outputFilepath);
long mainFileLength = toCombine.get(0).recordCount();
for (FileSheet f : toCombine)
{
int ordinal = toCombine.indexOf(f);
if (toCombine.get(ordinal).recordCount() != mainFileLength)
{
Log.log("Error : Record counts for 0 + " + ordinal);
return null;
}
}
FileSheet finalValues;
Log.log("Starting iteration streams");
BufferedWriter out = new BufferedWriter(new FileWriter(outputFilepath, false));
List<BufferedReader> streams = new ArrayList<>();
for (FileSheet j : toCombine)
{
streams.add(new BufferedReader(new FileReader(j.filepath)));
}
String finalWrite = "";
for (int i = 0; i < toCombine.get(0).recordCount(); i++)
{
for (FileSheet j : toCombine)
{
int ordinal = toCombine.indexOf(j);
finalWrite = finalWrite + streams.get(ordinal).readLine();
if (toCombine.indexOf(j) != toCombine.size() - 1)
{
finalWrite = finalWrite + ",";
}
else
{
finalWrite = finalWrite + "\n";
}
}
if (i % 1000 == 0 || i == toCombine.get(0).recordCount() - 1)
{
// out.write(finalWrite + "\n");
Files.write(Paths.get(outputFilepath),(finalWrite).getBytes(),StandardOpenOption.APPEND);
finalWrite = "";
}
}
out.close();
Log.log("Finished combineRecords");
finalValues = new FileSheet(outputFilepath,0);
return finalValues;
}
I've tried both bufferedwriters and files.write, and they have similar times to create file3, both in the 1:30 minute range, but i'm not sure if the bottleneck is at reading or writing
The sample files i'm using are currently at 36,000 records, but the actual file i'll be using is ~650,000 so taking (if it scales linearly) 1625 seconds is completely unfeasible for this operation
Edit : I've modified the code to only open files once, rather than per iteration, however i'm now getting stream closed when skipping to the nth line
I thought that by doing streams.get(ordinal).skip(i).findFirst().get(); would return a new stream instead of skipping then closing the stream
Edit 2 : Modified the code to use bufferedreaders instead of streams, and write to file every 1000 lines read, and thats determined that the bottleneck is reading, because it still takes ~1:30 to do
First of all concating string using + operator is ok when it is not under loop. But when you want to merge strings in a loop you should use StringBuilder for better performance.
Second thing which you can improve you can write to file at the end like:
StringBuilder finalWrite = new StringBuilder();
for (int i = 0; i < toCombine.get(0).recordCount(); i++)
{
for (FileSheet j : toCombine)
{
int ordinal = toCombine.indexOf(j);
finalWrite.append(streams.get(ordinal).readLine());
if (toCombine.indexOf(j) != toCombine.size() - 1)
{
finalWrite.append(",");
}
else
{
finalWrite.append("\n");
}
}
}
Files.write(Paths.get(outputFilepath), finalWrite.toString().getBytes());

How to open a huge excel file efficiently

I have a 150MB one-sheet excel file that takes about 7 minutes to open on a very powerful machine using the following:
# using python
import xlrd
wb = xlrd.open_workbook(file)
sh = wb.sheet_by_index(0)
Is there any way to open the excel file quicker? I'm open to even very outlandish suggestions (such as hadoop, spark, c, java, etc.). Ideally I'm looking for a way to open the file in under 30 seconds if that's not a pipe dream. Also, the above example is using python, but it doesn't have to be python.
Note: this is an Excel file from a client. It cannot be converted into any other format before we receive it. It is not our file
UPDATE: Answer with a working example of code that will open the following 200MB excel file in under 30 seconds will be rewarded with bounty: https://drive.google.com/file/d/0B_CXvCTOo7_2VW9id2VXRWZrbzQ/view?usp=sharing. This file should have string (col 1), date (col 9), and number (col 11).
Most programming languages that work with Office products have some middle layer and this is usually where the bottleneck is, a good example is using PIA's/Interop or Open XML SDK.
One way to get the data at a lower level (bypassing the middle layer) is using a Driver.
150MB one-sheet excel file that takes about 7 minutes.
The best I could do is a 130MB file in 135 seconds, roughly 3 times faster:
Stopwatch sw = new Stopwatch();
sw.Start();
DataSet excelDataSet = new DataSet();
string filePath = #"c:\temp\BigBook.xlsx";
// For .XLSXs we use =Microsoft.ACE.OLEDB.12.0;, for .XLS we'd use Microsoft.Jet.OLEDB.4.0; with "';Extended Properties=\"Excel 8.0;HDR=YES;\"";
string connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source='" + filePath + "';Extended Properties=\"Excel 12.0;HDR=YES;\"";
using (OleDbConnection conn = new OleDbConnection(connectionString))
{
conn.Open();
OleDbDataAdapter objDA = new System.Data.OleDb.OleDbDataAdapter
("select * from [Sheet1$]", conn);
objDA.Fill(excelDataSet);
//dataGridView1.DataSource = excelDataSet.Tables[0];
}
sw.Stop();
Debug.Print("Load XLSX tool: " + sw.ElapsedMilliseconds + " millisecs. Records = " + excelDataSet.Tables[0].Rows.Count);
Win 7x64, Intel i5, 2.3ghz, 8GB ram, SSD250GB.
If I could recommend a hardware solution as well, try to resolve it with an SSD if you're using standard HDD's.
Note: I cant download your Excel spreadsheet example as I'm behind a corporate firewall.
PS. See MSDN - Fastest Way to import xlsx files with 200 MB of Data, the consensus being OleDB is the fastest.
PS 2. Here's how you can do it with python:
http://code.activestate.com/recipes/440661-read-tabular-data-from-excel-spreadsheets-the-fast/
I managed to read the file in about 30 seconds using .NET core and the Open XML SDK.
The following example returns a list of objects containing all rows and cells with the matching types, it supports date, numeric and text cells. The project is available here: https://github.com/xferaa/BigSpreadSheetExample/ (Should work on Windows, Linux and Mac OS and does not require Excel or any Excel component to be installed).
public List<List<object>> ParseSpreadSheet()
{
List<List<object>> rows = new List<List<object>>();
using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(filePath, false))
{
WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);
Dictionary<int, string> sharedStringCache = new Dictionary<int, string>();
int i = 0;
foreach (var el in workbookPart.SharedStringTablePart.SharedStringTable.ChildElements)
{
sharedStringCache.Add(i++, el.InnerText);
}
while (reader.Read())
{
if(reader.ElementType == typeof(Row))
{
reader.ReadFirstChild();
List<object> cells = new List<object>();
do
{
if (reader.ElementType == typeof(Cell))
{
Cell c = (Cell)reader.LoadCurrentElement();
if (c == null || c.DataType == null || !c.DataType.HasValue)
continue;
object value;
switch(c.DataType.Value)
{
case CellValues.Boolean:
value = bool.Parse(c.CellValue.InnerText);
break;
case CellValues.Date:
value = DateTime.Parse(c.CellValue.InnerText);
break;
case CellValues.Number:
value = double.Parse(c.CellValue.InnerText);
break;
case CellValues.InlineString:
case CellValues.String:
value = c.CellValue.InnerText;
break;
case CellValues.SharedString:
value = sharedStringCache[int.Parse(c.CellValue.InnerText)];
break;
default:
continue;
}
if (value != null)
cells.Add(value);
}
} while (reader.ReadNextSibling());
if (cells.Any())
rows.Add(cells);
}
}
}
return rows;
}
I ran the program in a three year old Laptop with a SSD drive, 8GB of RAM and an Intel Core i7-4710 CPU # 2.50GHz (two cores) on Windows 10 64 bits.
Note that although opening and parsing the whole file as strings takes a bit less than 30 seconds, when using objects as in the example of my last edit, the time goes up to almost 50 seconds with my crappy laptop. You will probably get closer to 30 seconds in your server with Linux.
The trick was to use the SAX approach as explained here:
https://msdn.microsoft.com/en-us/library/office/gg575571.aspx
Well, if your excel is going to be as simple as a CSV file like your example (https://drive.google.com/file/d/0B_CXvCTOo7_2UVZxbnpRaEVnaFk/view?usp=sharing), you can try to open the file as a zip file and read directly every xml:
Intel i5 4460, 12 GB RAM, SSD Samsung EVO PRO.
If you have a lot of memory ram:
This code needs a lot of ram, but it takes 20~25 seconds. (You need the parameter -Xmx7g)
package com.devsaki.opensimpleexcel;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.nio.charset.Charset;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.zip.ZipFile;
public class Multithread {
public static final char CHAR_END = (char) -1;
public static void main(String[] args) throws IOException, ExecutionException, InterruptedException {
String excelFile = "C:/Downloads/BigSpreadsheetAllTypes.xlsx";
ZipFile zipFile = new ZipFile(excelFile);
long init = System.currentTimeMillis();
ExecutorService executor = Executors.newFixedThreadPool(4);
char[] sheet1 = readEntry(zipFile, "xl/worksheets/sheet1.xml").toCharArray();
Future<Object[][]> futureSheet1 = executor.submit(() -> processSheet1(new CharReader(sheet1), executor));
char[] sharedString = readEntry(zipFile, "xl/sharedStrings.xml").toCharArray();
Future<String[]> futureWords = executor.submit(() -> processSharedStrings(new CharReader(sharedString)));
Object[][] sheet = futureSheet1.get();
String[] words = futureWords.get();
executor.shutdown();
long end = System.currentTimeMillis();
System.out.println("only read: " + (end - init) / 1000);
///Doing somethin with the file::Saving as csv
init = System.currentTimeMillis();
try (PrintWriter writer = new PrintWriter(excelFile + ".csv", "UTF-8");) {
for (Object[] rows : sheet) {
for (Object cell : rows) {
if (cell != null) {
if (cell instanceof Integer) {
writer.append(words[(Integer) cell]);
} else if (cell instanceof String) {
writer.append(toDate(Double.parseDouble(cell.toString())));
} else {
writer.append(cell.toString()); //Probably a number
}
}
writer.append(";");
}
writer.append("\n");
}
}
end = System.currentTimeMillis();
System.out.println("Main saving to csv: " + (end - init) / 1000);
}
private static final DateTimeFormatter formatter = DateTimeFormatter.ISO_DATE_TIME;
private static final LocalDateTime INIT_DATE = LocalDateTime.parse("1900-01-01T00:00:00+00:00", formatter).plusDays(-2);
//The number in excel is from 1900-jan-1, so every number time that you get, you have to sum to that date
public static String toDate(double s) {
return formatter.format(INIT_DATE.plusSeconds((long) ((s*24*3600))));
}
public static String readEntry(ZipFile zipFile, String entry) throws IOException {
System.out.println("Initialing readEntry " + entry);
long init = System.currentTimeMillis();
String result = null;
try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
br.readLine();
result = br.readLine();
}
long end = System.currentTimeMillis();
System.out.println("readEntry '" + entry + "': " + (end - init) / 1000);
return result;
}
public static String[] processSharedStrings(CharReader br) throws IOException {
System.out.println("Initialing processSharedStrings");
long init = System.currentTimeMillis();
String[] words = null;
char[] wordCount = "Count=\"".toCharArray();
char[] token = "<t>".toCharArray();
String uniqueCount = extractNextValue(br, wordCount, '"');
words = new String[Integer.parseInt(uniqueCount)];
String nextWord;
int currentIndex = 0;
while ((nextWord = extractNextValue(br, token, '<')) != null) {
words[currentIndex++] = nextWord;
br.skip(11); //you can skip at least 11 chars "/t></si><si>"
}
long end = System.currentTimeMillis();
System.out.println("SharedStrings: " + (end - init) / 1000);
return words;
}
public static Object[][] processSheet1(CharReader br, ExecutorService executorService) throws IOException, ExecutionException, InterruptedException {
System.out.println("Initialing processSheet1");
long init = System.currentTimeMillis();
char[] dimensionToken = "dimension ref=\"".toCharArray();
String dimension = extractNextValue(br, dimensionToken, '"');
int[] sizes = extractSizeFromDimention(dimension.split(":")[1]);
br.skip(30); //Between dimension and next tag c exists more or less 30 chars
Object[][] result = new Object[sizes[0]][sizes[1]];
int parallelProcess = 8;
int currentIndex = br.currentIndex;
CharReader[] charReaders = new CharReader[parallelProcess];
int totalChars = Math.round(br.chars.length / parallelProcess);
for (int i = 0; i < parallelProcess; i++) {
int endIndex = currentIndex + totalChars;
charReaders[i] = new CharReader(br.chars, currentIndex, endIndex, i);
currentIndex = endIndex;
}
Future[] futures = new Future[parallelProcess];
for (int i = charReaders.length - 1; i >= 0; i--) {
final int j = i;
futures[i] = executorService.submit(() -> inParallelProcess(charReaders[j], j == 0 ? null : charReaders[j - 1], result));
}
for (Future future : futures) {
future.get();
}
long end = System.currentTimeMillis();
System.out.println("Sheet1: " + (end - init) / 1000);
return result;
}
public static void inParallelProcess(CharReader br, CharReader back, Object[][] result) {
System.out.println("Initialing inParallelProcess : " + br.identifier);
char[] tokenOpenC = "<c r=\"".toCharArray();
char[] tokenOpenV = "<v>".toCharArray();
char[] tokenAttributS = " s=\"".toCharArray();
char[] tokenAttributT = " t=\"".toCharArray();
String v;
int firstCurrentIndex = br.currentIndex;
boolean first = true;
while ((v = extractNextValue(br, tokenOpenC, '"')) != null) {
if (first && back != null) {
int sum = br.currentIndex - firstCurrentIndex - tokenOpenC.length - v.length() - 1;
first = false;
System.out.println("Adding to : " + back.identifier + " From : " + br.identifier);
back.plusLength(sum);
}
int[] indexes = extractSizeFromDimention(v);
int s = foundNextTokens(br, '>', tokenAttributS, tokenAttributT);
char type = 's'; //3 types: number (n), string (s) and date (d)
if (s == 0) { // Token S = number or date
char read = br.read();
if (read == '1') {
type = 'n';
} else {
type = 'd';
}
} else if (s == -1) {
type = 'n';
}
String c = extractNextValue(br, tokenOpenV, '<');
Object value = null;
switch (type) {
case 'n':
value = Double.parseDouble(c);
break;
case 's':
try {
value = Integer.parseInt(c);
} catch (Exception ex) {
System.out.println("Identifier Error : " + br.identifier);
}
break;
case 'd':
value = c.toString();
break;
}
result[indexes[0] - 1][indexes[1] - 1] = value;
br.skip(7); ///v></c>
}
}
static class CharReader {
char[] chars;
int currentIndex;
int length;
int identifier;
public CharReader(char[] chars) {
this.chars = chars;
this.length = chars.length;
}
public CharReader(char[] chars, int currentIndex, int length, int identifier) {
this.chars = chars;
this.currentIndex = currentIndex;
if (length > chars.length) {
this.length = chars.length;
} else {
this.length = length;
}
this.identifier = identifier;
}
public void plusLength(int n) {
if (this.length + n <= chars.length) {
this.length += n;
}
}
public char read() {
if (currentIndex >= length) {
return CHAR_END;
}
return chars[currentIndex++];
}
public void skip(int n) {
currentIndex += n;
}
}
public static int[] extractSizeFromDimention(String dimention) {
StringBuilder sb = new StringBuilder();
int columns = 0;
int rows = 0;
for (char c : dimention.toCharArray()) {
if (columns == 0) {
if (Character.isDigit(c)) {
columns = convertExcelIndex(sb.toString());
sb = new StringBuilder();
}
}
sb.append(c);
}
rows = Integer.parseInt(sb.toString());
return new int[]{rows, columns};
}
public static int foundNextTokens(CharReader br, char until, char[]... tokens) {
char character;
int[] indexes = new int[tokens.length];
while ((character = br.read()) != CHAR_END) {
if (character == until) {
break;
}
for (int i = 0; i < indexes.length; i++) {
if (tokens[i][indexes[i]] == character) {
indexes[i]++;
if (indexes[i] == tokens[i].length) {
return i;
}
} else {
indexes[i] = 0;
}
}
}
return -1;
}
public static String extractNextValue(CharReader br, char[] token, char until) {
char character;
StringBuilder sb = new StringBuilder();
int index = 0;
while ((character = br.read()) != CHAR_END) {
if (index == token.length) {
if (character == until) {
return sb.toString();
} else {
sb.append(character);
}
} else {
if (token[index] == character) {
index++;
} else {
index = 0;
}
}
}
return null;
}
public static int convertExcelIndex(String index) {
int result = 0;
for (char c : index.toCharArray()) {
result = result * 26 + ((int) c - (int) 'A' + 1);
}
return result;
}
}
Old answer (Not need the parameter Xms7g, so take less memory):
It takes to open and read the example file about 35 seconds (200MB) with an HDD, with SDD takes a little less (30 seconds).
Here the code:
https://github.com/csaki/OpenSimpleExcelFast.git
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.nio.charset.Charset;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.zip.ZipFile;
public class Launcher {
public static final char CHAR_END = (char) -1;
public static void main(String[] args) throws IOException, ExecutionException, InterruptedException {
long init = System.currentTimeMillis();
String excelFile = "D:/Downloads/BigSpreadsheet.xlsx";
ZipFile zipFile = new ZipFile(excelFile);
ExecutorService executor = Executors.newFixedThreadPool(4);
Future<String[]> futureWords = executor.submit(() -> processSharedStrings(zipFile));
Future<Object[][]> futureSheet1 = executor.submit(() -> processSheet1(zipFile));
String[] words = futureWords.get();
Object[][] sheet1 = futureSheet1.get();
executor.shutdown();
long end = System.currentTimeMillis();
System.out.println("Main only open and read: " + (end - init) / 1000);
///Doing somethin with the file::Saving as csv
init = System.currentTimeMillis();
try (PrintWriter writer = new PrintWriter(excelFile + ".csv", "UTF-8");) {
for (Object[] rows : sheet1) {
for (Object cell : rows) {
if (cell != null) {
if (cell instanceof Integer) {
writer.append(words[(Integer) cell]);
} else if (cell instanceof String) {
writer.append(toDate(Double.parseDouble(cell.toString())));
} else {
writer.append(cell.toString()); //Probably a number
}
}
writer.append(";");
}
writer.append("\n");
}
}
end = System.currentTimeMillis();
System.out.println("Main saving to csv: " + (end - init) / 1000);
}
private static final DateTimeFormatter formatter = DateTimeFormatter.ISO_DATE_TIME;
private static final LocalDateTime INIT_DATE = LocalDateTime.parse("1900-01-01T00:00:00+00:00", formatter).plusDays(-2);
//The number in excel is from 1900-jan-1, so every number time that you get, you have to sum to that date
public static String toDate(double s) {
return formatter.format(INIT_DATE.plusSeconds((long) ((s*24*3600))));
}
public static Object[][] processSheet1(ZipFile zipFile) throws IOException {
String entry = "xl/worksheets/sheet1.xml";
Object[][] result = null;
char[] dimensionToken = "dimension ref=\"".toCharArray();
char[] tokenOpenC = "<c r=\"".toCharArray();
char[] tokenOpenV = "<v>".toCharArray();
char[] tokenAttributS = " s=\"".toCharArray();
char[] tokenAttributT = " t=\"".toCharArray();
try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
String dimension = extractNextValue(br, dimensionToken, '"');
int[] sizes = extractSizeFromDimention(dimension.split(":")[1]);
br.skip(30); //Between dimension and next tag c exists more or less 30 chars
result = new Object[sizes[0]][sizes[1]];
String v;
while ((v = extractNextValue(br, tokenOpenC, '"')) != null) {
int[] indexes = extractSizeFromDimention(v);
int s = foundNextTokens(br, '>', tokenAttributS, tokenAttributT);
char type = 's'; //3 types: number (n), string (s) and date (d)
if (s == 0) { // Token S = number or date
char read = (char) br.read();
if (read == '1') {
type = 'n';
} else {
type = 'd';
}
} else if (s == -1) {
type = 'n';
}
String c = extractNextValue(br, tokenOpenV, '<');
Object value = null;
switch (type) {
case 'n':
value = Double.parseDouble(c);
break;
case 's':
value = Integer.parseInt(c);
break;
case 'd':
value = c.toString();
break;
}
result[indexes[0] - 1][indexes[1] - 1] = value;
br.skip(7); ///v></c>
}
}
return result;
}
public static int[] extractSizeFromDimention(String dimention) {
StringBuilder sb = new StringBuilder();
int columns = 0;
int rows = 0;
for (char c : dimention.toCharArray()) {
if (columns == 0) {
if (Character.isDigit(c)) {
columns = convertExcelIndex(sb.toString());
sb = new StringBuilder();
}
}
sb.append(c);
}
rows = Integer.parseInt(sb.toString());
return new int[]{rows, columns};
}
public static String[] processSharedStrings(ZipFile zipFile) throws IOException {
String entry = "xl/sharedStrings.xml";
String[] words = null;
char[] wordCount = "Count=\"".toCharArray();
char[] token = "<t>".toCharArray();
try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
String uniqueCount = extractNextValue(br, wordCount, '"');
words = new String[Integer.parseInt(uniqueCount)];
String nextWord;
int currentIndex = 0;
while ((nextWord = extractNextValue(br, token, '<')) != null) {
words[currentIndex++] = nextWord;
br.skip(11); //you can skip at least 11 chars "/t></si><si>"
}
}
return words;
}
public static int foundNextTokens(BufferedReader br, char until, char[]... tokens) throws IOException {
char character;
int[] indexes = new int[tokens.length];
while ((character = (char) br.read()) != CHAR_END) {
if (character == until) {
break;
}
for (int i = 0; i < indexes.length; i++) {
if (tokens[i][indexes[i]] == character) {
indexes[i]++;
if (indexes[i] == tokens[i].length) {
return i;
}
} else {
indexes[i] = 0;
}
}
}
return -1;
}
public static String extractNextValue(BufferedReader br, char[] token, char until) throws IOException {
char character;
StringBuilder sb = new StringBuilder();
int index = 0;
while ((character = (char) br.read()) != CHAR_END) {
if (index == token.length) {
if (character == until) {
return sb.toString();
} else {
sb.append(character);
}
} else {
if (token[index] == character) {
index++;
} else {
index = 0;
}
}
}
return null;
}
public static int convertExcelIndex(String index) {
int result = 0;
for (char c : index.toCharArray()) {
result = result * 26 + ((int) c - (int) 'A' + 1);
}
return result;
}
}
Python's Pandas library could be used to hold and process your data, but using it to directly load the .xlsx file will be quite slow, e.g. using read_excel().
One approach would be to use Python to automate the conversion of your file into CSV using Excel itself and to then use Pandas to load the resulting CSV file using read_csv(). This will give you a good speed up, but not under 30 seconds:
import win32com.client as win32
import pandas as pd
from datetime import datetime
print ("Starting")
start = datetime.now()
# Use Excel to load the xlsx file and save it in csv format
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Open(r'c:\full path\BigSpreadsheet.xlsx')
excel.DisplayAlerts = False
wb.DoNotPromptForConvert = True
wb.CheckCompatibility = False
print('Saving')
wb.SaveAs(r'c:\full path\temp.csv', FileFormat=6, ConflictResolution=2)
excel.Application.Quit()
# Use Pandas to load the resulting CSV file
print('Loading CSV')
df = pd.read_csv(r'c:\full path\temp.csv', dtype=str)
print(df.shape)
print("Done", datetime.now() - start)
Column types
The types for your columns can be specified by passing dtype and converters and parse_dates:
df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[8], infer_datetime_format=True)
You should also specify infer_datetime_format=True, as this will greatly speed up the date conversion.
nfer_datetime_format : boolean, default False
If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be
inferred, switch to a faster method of parsing them. In some cases
this can increase the parsing speed by 5-10x.
Also add dayfirst=True if dates are in the form DD/MM/YYYY.
Selective columns
If you actually only need to work on columns 1 9 11, then you could further reduce resources by specifying usecols=[0, 8, 10] as follows:
df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[1], dayfirst=True, infer_datetime_format=True, usecols=[0, 8, 10])
The resulting dataframe would then only contain those 3 columns of data.
RAM drive
Using a RAM drive to store the temporary CSV file to would further speed up the load time.
Note: This does assume you are using a Windows PC with Excel available.
I have created an sample Java program which is able to load the file in ~40 seconds my laptop ( Intel i7 4 core, 16 GB RAM).
https://github.com/skadyan/largefile
This program uses the Apache POI library to load the .xlsx file using the XSSF SAX API.
The callback interface com.stackoverlfow.largefile.RecordHandler implementation can be used to process the data loaded from the excel. This interface define only one method which take three arguments
sheetname : String, excel sheet name
row number: int, row number of data
and data map: Map: excel cell reference and excel formatted cell value
The class com.stackoverlfow.largefile.Main demonstrate one basic implementation of this interface which just print the row number on console.
Update
woodstox parser seems have better performance than standard SAXReader. (code updated in repo).
Also in order to meet the desired performance requirement, you may consider to re-implement the org.apache.poi...XSSFSheetXMLHandler. In the implementation, more optimized string/text value handling can be implemented and unnecessary text formatting operation may be skipped.
I'm using a Dell Precision T1700 workstation and using c# I was able to open the file and read it's contents in about 24 seconds just using standard code to open a workbook using interop services. Using references to the Microsoft Excel 15.0 Object Library here is my code.
My using statements:
using System.Runtime.InteropServices;
using Excel = Microsoft.Office.Interop.Excel;
Code to open and read workbook:
public partial class MainWindow : Window {
public MainWindow() {
InitializeComponent();
Excel.Application xlApp;
Excel.Workbook wb;
Excel.Worksheet ws;
xlApp = new Excel.Application();
xlApp.Visible = false;
xlApp.ScreenUpdating = false;
wb = xlApp.Workbooks.Open(#"Desired Path of workbook\Copy of BigSpreadsheet.xlsx");
ws = wb.Sheets["Sheet1"];
//string rng = ws.get_Range("A1").Value;
MessageBox.Show(ws.get_Range("A1").Value);
Marshal.FinalReleaseComObject(ws);
wb.Close();
Marshal.FinalReleaseComObject(wb);
xlApp.Quit();
Marshal.FinalReleaseComObject(xlApp);
GC.Collect();
GC.WaitForPendingFinalizers();
}
}
Looks like it is hardly achievable in Python at all. If we unpack a sheet data file then it would take all required 30 seconds just to pass it through the C-based iterative SAX parser (using lxml, a very fast wrapper over libxml2):
from __future__ import print_function
from lxml import etree
import time
start_ts = time.time()
for data in etree.iterparse(open('xl/worksheets/sheet1.xml'), events=('start',),
collect_ids=False, resolve_entities=False,
huge_tree=True):
pass
print(time.time() - start_ts)
The sample output: 27.2134890556
By the way, the Excel itself needs about 40 seconds to load the workbook.
The c# and ole solution still have some bottleneck.So i test it by c++ and ado.
_bstr_t connStr(makeConnStr(excelFile, header).c_str());
TESTHR(pRec.CreateInstance(__uuidof(Recordset)));
TESTHR(pRec->Open(sqlSelectSheet(connStr, sheetIndex).c_str(), connStr, adOpenStatic, adLockOptimistic, adCmdText));
while(!pRec->adoEOF)
{
for(long i = 0; i < pRec->Fields->GetCount(); ++i)
{
_variant_t v = pRec->Fields->GetItem(i)->Value;
if(v.vt == VT_R8)
num[i] = v.dblVal;
if(v.vt == VT_BSTR)
str[i] = v.bstrVal;
++cellCount;
}
pRec->MoveNext();
}
In i5-4460 and HDD machine,i find 500 thousands of cell in xls will take 1.5s.But same data in xlsx will take 2.829s.so it's possible for manipulating your data under 30s.
If you really need under 30s,use RAM Drive to reduce file IO.It will significantly improve your process.
I cannot download your data to test it,so please tell me the result.
Another way that should improve largely the load/operation time is a RAMDrive
create a RAMDrive with enough space for your file and a 10%..20% extra space...
copy the file for the RAMDrive...
Load the file from there... depending on your drive and filesystem
the speed improvement should be huge...
My favourite is IMDisk toolkit
(https://sourceforge.net/projects/imdisk-toolkit/)
here you have a powerfull command line to script everything...
I also recommend SoftPerfect ramdisk
(http://www.majorgeeks.com/files/details/softperfect_ram_disk.html)
but that also depends of your OS...
I would like to have more info about the system where you
are opening the file... anyway:
look in your system for a Windows update called
"Office File Validation Add-In for Office ..."
if you have it... uninstall it...
the file should load much more quickly
specially if is loaded froma share
Have you tried loading the worksheet on demand, which available since version 0.7.1 of xlrd?
To do this you need to pass on_demand=True to open_workbook().
xlrd.open_workbook(filename=None, logfile=<_io.TextIOWrapper
name='' mode='w' encoding='UTF-8'>, verbosity=0, use_mmap=1,
file_contents=None, encoding_override=None, formatting_info=False,
on_demand=False, ragged_rows=False)
Other potential python solutions I found for reading an xlsx file:
Read the raw xml in 'xl/sharedStrings.xml' and 'xl/worksheets/sheet1.xml'
Try the openpyxl library's Read Only mode which claims too be optimized in memory usage for large files.
from openpyxl import load_workbook wb = load_workbook(filename='large_file.xlsx', read_only=True) ws = wb['big_data']
for row in ws.rows:
for cell in row:
print(cell.value)
If you are running on Windows you could use PyWin32 and 'Excel.Application'
import time
import win32com.client as win32
def excel():
xl = win32.gencache.EnsureDispatch('Excel.Application')
ss = xl.Workbooks.Add()
...

how to link a main class to a jframe form in java using netbeans

Good day!
I have created a code using Netbeans and it executes the processes just fine.
Now, i want my input to given and output to be displayed through a user interface. I have then created a 2 Jframes, 1 to collect the user's input and the other to display the results after execution by the code.
But, i am unable to link the interface to the main class(called NgramBetaE) as i am not aware of how i can do so.
I highly welcome suggestions.
The main class in its entirety is;
package ngrambetae;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.*;
/**
*
* #author 201102144
*/
public class NgramBetaE {
static LinkedList<String> allWords = new LinkedList<String>();
static LinkedList<String> distinctWords = new LinkedList<String>();
static String[] hashmapWord = null;
static int wordCount;
public static HashMap<String,HashMap<String, Integer>> hashmap = new HashMap<>();
public static HashMap<String,HashMap<String, Integer>> bigramMap = new HashMap<>();
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
//prompt user input
Scanner input = new Scanner(System.in);
//read words from collected corpus; a number of .txt files
File directory = new File("Corpus");
File[] listOfFiles = directory.listFiles();//To read from all listed iles in the "directory"
int lineNumber = 0;
String line;
String files;
String delimiters = "[()?!:;,.\\s]+";
//reading from a list of text files
for (File file : listOfFiles) {
if (file.isFile()) {
files = file.getName();
try {
if (files.endsWith(".txt") || files.endsWith(".TXT")) { //ensures a file being read is a text file
BufferedReader br = new BufferedReader(new FileReader(file));
while ((line = br.readLine()) != null) {
line = line.toLowerCase();
hashmapWord = line.split(delimiters);
//CALCULATING UNIGRAMS
for(int s = 0; s < hashmapWord.length; s++){
String read = hashmapWord[s];
allWords.add(read);
//count the total number of words in all the text files combined
//TEST
wordCount = 0;
for (int i = 0; i < allWords.size(); i++){
wordCount ++;
}
}
//CALCULATING BIGRAM FREQUENCIES
for(int s = 0; s < hashmapWord.length -1; s++){
String read = hashmapWord[s];
final String read1 = hashmapWord[s + 1];
HashMap<String, Integer> counter = bigramMap.get(read);
if (null == counter) {
counter = new HashMap<String, Integer>();
bigramMap.put(read, counter);
}
Integer count = counter.get(read1);
counter.put(read1, count == null ? 1 : count + 1);
}
//CALCULATING TRIGRAM FREQUENCIES
for(int s = 0; s < hashmapWord.length - 2; s++){
String read = hashmapWord[s];
String read1 = hashmapWord[s + 1];
final String read2 = hashmapWord[s + 2];
String readTrigrams = read + " " + read1;
HashMap<String, Integer> counter = hashmap.get(readTrigrams);
if (null == counter) {
counter = new HashMap<String, Integer>();
hashmap.put(readTrigrams, counter);
}
Integer count = counter.get(read2);
counter.put(read2, count == null ? 1 : count + 1);
}
}
br.close();
}
} catch (NullPointerException | IOException e) {
e.printStackTrace();
System.out.println("Unable to read files: " + e);
}
}
}
//COMPUTING THE TOTAL NUMBER OF WORDS FROM ALL THE TEXT FILES COMBINED
System.out.println("THE TOTAL NUMBER OF WORDS IN COLLECTED CORPUS IS : \t" + wordCount + "\n");
for(int i = 0, size = allWords.size(); i < size; i++){
String distinctWord = allWords.get(i);
//adding a word into the 'distinctWords' list if it doesn't already occur
if(!distinctWords.contains(distinctWord)){
distinctWords.add(distinctWord);
}
}
//PRINTING THE DISTINCT WORDS
System.out.println("THE DISTINCT WORDS IN TOTAL ARE :\t " + distinctWords.size() + "\n");
System.out.println("PRINTING CONTENTS OF THE BIGRAMS HASHMAP... ");
System.out.println(bigramMap);
System.out.println("================================================================================================================================================================================================================================================================================================================\n");
System.out.println("PRINTING CONTENTS OF THE TRIGRAMS HASHMAP... ");
System.out.println(hashmap);
System.out.println("================================================================================================================================================================================================================================================================================================================\n");
//QUITTING APPLICATION
String userInput = null;
while(true) {
System.out.println("\n**********************************************************************************************************************************************************************************************************************************");
System.out.println("\n\n\t\tPLEASE ENTER A WORD OR PHRASE YOU WOULD LIKE A PREDICTION OF THE NEXT WORD FROM:");
System.out.println("\t\t\t\t(OR TYPE IN 'Q' OR 'q' TO QUIT)");
userInput = input.nextLine();
if (userInput.equalsIgnoreCase("Q")) break;
//FORMAT USER INPUT
String[] users = userInput.toLowerCase().split("[?!,.\\s]+");
if (users.length < 2) {
userInput = users[0];
//System.out.println("\nENTRY '" + userInput + "' IS TOO SHORT TO PREDICT NEXT WORD. PLEASE ENTER 2 OR MORE WORDS");
//CALCULATING BIGRAM PROBABILITY
int sum = 0;
try {
for(String s : bigramMap.get(userInput).keySet()) {
sum += bigramMap.get(userInput).get(s);
}
String stringHolder = null;
double numHolder = 0.0;
for(String s : bigramMap.get(userInput).keySet()) {
//System.out.println("TWO");
double x = Math.round(bigramMap.get(userInput).put(s, bigramMap.get(userInput).get(s))/ (double)sum *100 );
if(s != null){
if(numHolder < x ){
stringHolder = s;
numHolder = x;
}
}
}
System.out.println("\nNEXT WORD PREDICTED IS '" + stringHolder + "'");
System.out.println("ITS PROBABILITY OF OCCURRENCE IS " + numHolder + "%");
} catch (Exception NullPointerException) {
System.out.println("\nSORRY. MATCH NOT FOUND.");
}
} else {
userInput = users[users.length - 2] + " " + users[users.length - 1];
// System.out.println("FROM USER WE GET....");
// System.out.println(bigrams.get(userInput).keySet());
/* CALCULATING TRIGRAM PROBABILITY*/
int sum = 0;
try {
for(String s : hashmap.get(userInput).keySet()) {
sum += hashmap.get(userInput).get(s);
}
String stringHolder = null;
double numHolder = 0.0;
for(String s : hashmap.get(userInput).keySet()) {
//System.out.println("TWO");
double x = Math.round(hashmap.get(userInput).put(s, hashmap.get(userInput).get(s))/ (double)sum *100 );
if(s != null){
if(numHolder < x ){
stringHolder = s;
numHolder = x;
}
}
}
System.out.println("\nNEXT WORD PREDICTED IS '" + stringHolder + "'");
System.out.println("ITS PROBABILITY OF OCCURRENCE IS " + numHolder + "%");
} catch (Exception NullPointerException) {
System.out.println("\nSORRY. MATCH NOT FOUND.");
}
}
}
input.close();
}
}
My first Jframe which i would like to appear upon running the project has got a single textbox and a single button;
private void jButton1ActionPerformed(java.awt.event.ActionEvent evt) {
String usersInput = jTextField1.getText();
Interface1 s = new Interface1();
s.setVisible(true);
dispose();
}
i would like for the user to enter data in the textbox and when they click on the button 'predict next word' then the output from the code execution is displayed on the second jframe which has got 3 labels and relative text areas.
NOTE; i couldn't paste the screenshots but if you run the NgramBetaE class you will get an idea of how the interfaces will be as i tried to explain them.
Thank you
Don't even try to link your GUI code to your NgramBetaE code as you've more work to do since the NgramBetaE is little more than one huge static main method that gets user input from the console with a Scanner and outputs to the console via printlns. Melding these two is like trying to put a square peg into a round hole.
Instead re-write the whole thing with an eye towards object-oriented coding, including creation of an OOP-compliant model class with instance fields and methods, and a single GUI that gets the input and displays it, that holds an instance of the model class and that calls instance methods on this instance.
Consider creating non-GUI classes and methods for --
Reading in data from your text files
Analyzing and hashing the data held in the text files including calculating word frequencies etc...
Returning needed data after analysis in whatever data form it may be needed.
A method for allowing input of a String/phrase for testing, with return its predicted probability
Then create GUI code for:
Getting selected text file from the user. A JFileChooser and supporting code works well here.
Button to start analysis
JTextField to allow entering of phrase
JTextArea or perhaps JTable to display results of analysis
Note that you should avoid having more than one JFrame in your GUI. For more on this, please have a look at The Use of Multiple JFrames, Good/Bad Practice?

JAVA read text files, count numbers and write it to Jtable [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am still learning JAVA and have been trying to find a solution for my program for a few days, but I haven't gotten it fixed yet.
I have many text files (my program saves). The files look like this:
text (tab) number (tab) number (tab)...
text (tab) number (tab) number (tab)...
(tab) means that there is tabulation mark,
text means that is text (string),
number means that there is number (integer).
number of files can be from 1 up to 32 and file with names like: january1; january2; january3...
I need to read all of those files (ignore strings) and sum only numbers like so:
while ((line = br.readLine()) != null) {
counter=counter+1;
String[] info = line.split("\\s+");
for(int j = 2; j < 8; j++) {
int num = Integer.parseInt(info[j]);
data[j][counter]=data[j][counter]+num;
}
};
Simply I want sum all that "tables" to array of arrays (or to any similar kind of variable) and then display it as table. If someone knows any solution or can link any similar calculation, that would be awesome!
So, as I see it, you have four questions you need answered, this goes against the site etiquette of asking A question, but will give it a shot
How to list a series of files, presumably using some kind of filter
How to read a file and process the data in some meaningful way
How to manage the data in data structure
Show the data in a JTable.
Listing files
Probably the simplest way to list files is to use File#list and pass a FileFilter which meets your needs
File[] files = new File(".").listFiles(new FileFilter() {
#Override
public boolean accept(File pathname) {
return pathname.getName().toLowerCase().startsWith("janurary");
}
});
Now, I'd write a method which took a File object representing the directory you want to list and a FileFilter to use to search it...
public File[] listFiles(File dir, FileFilter filter) throws IOException {
if (dir.exists()) {
if (dir.isDirectory()) {
return dir.listFiles(filter);
} else {
throw new IOException(dir + " is not a valid directory");
}
} else {
throw new IOException(dir + " does not exist");
}
}
This way you could search for a number of different set of files based on different FileFilters.
Of course, you could also use the newer Paths/Files API to find files as well
Reading files...
Reading multiple files comes down to the same thing, reading a single file...
// BufferedReader has a nice readline method which makes
// it easier to read text with. You could use a Scanner
// but I prefer BufferedReader, but that's me...
try (BufferedReader br = new BufferedReader(new FileReader(new File("...")))) {
String line = null;
// Read each line
while ((line = br.readLine()) != null) {
// Split the line into individual parts, on the <tab> character
String parts[] = line.split("\t");
int sum = 0;
// Staring from the first number, sum the line...
for (int index = 1; index < parts.length; index++) {
sum += Integer.parseInt(parts[index].trim());
}
// Store the key/value pairs together some how
}
}
Now, we need some way to store the results of the calculations...
Have a look at Basic I/O for more details
Managing the data
Now, there are any number of ways you could do this, but since the amount of data is variable, you want a data structure that can grow dynamically.
My first thought would be to use a Map, but this assumes you want to combining rows with the same name, otherwise you should just us a List within a List, where the outer List represents the rows and the Inner list represents the column values...
Map<String, Integer> data = new HashMap<>(25);
File[] files = listFiles(someDir, januraryFilter);
for (File file : files {
readFile(file, data);
}
Where readFile is basically the code from before
protected void readData(File file, Map<String, Integer> data) throws IOException {
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String line = null;
// Read each line
while ((line = br.readLine()) != null) {
//...
// Store the key/value pairs together some how
String name = parts[0];
if (data.containsKey(name)) {
int previous = data.get(name);
sum += previous;
}
data.put(name, sum);
}
}
}
Have a look at the Collections Trail for more details
Showing the data
And finally, we need to show the data. You could simply use a DefaultTableModel, but you already have the data in structure, why not re-use it with a custom TableModel
public class SummaryTableModel extends AbstractTableModel {
private Map<String, Integer> data;
private List<String> keyMap;
public SummaryTableModel(Map<String, Integer> data) {
this.data = new HashMap<>(data);
keyMap = new ArrayList<>(data.keySet());
}
#Override
public int getRowCount() {
return data.size();
}
#Override
public int getColumnCount() {
return 2;
}
#Override
public Class<?> getColumnClass(int columnIndex) {
Class type = Object.class;
switch (columnIndex) {
case 0:
type = String.class;
break;
case 1:
type = Integer.class;
break;
}
return type;
}
#Override
public Object getValueAt(int rowIndex, int columnIndex) {
Object value = null;
switch (columnIndex) {
case 0:
value = keyMap.get(rowIndex);
break;
case 1:
String key = keyMap.get(rowIndex);
value = data.get(key);
break;
}
return value;
}
}
Then you would simply apply it to a JTable...
add(new JScrollPane(new JTable(new SummaryTableModel(data)));
Take a look at How to Use Tables for more details
Conclusion
There are a lot of assumptions that have to be made which are missing from the context of the question; does the order of the files matter? Do you care about duplicate entries?
So it becomes near impossible to provide a single "answer" which will solve all of your problems
I took all the january1 january2... files from the location and used your same function to calculate the value to be stored.
Then I created a table with two headers, Day and Number. Then just added rows according to the values generated.
DefaultTableModel model = new DefaultTableModel();
JTable table = new JTable(model);
String line;
model.addColumn("Day");
model.addColumn("Number");
BufferedReader br = null;
model.addRow(new Object[]{"a","b"});
for(int i = 1; i < 32; i++)
{
try {
String sCurrentLine;
String filename = "january"+i;
br = new BufferedReader(new FileReader("C:\\january"+i+".txt"));
int counter = 0;
while ((sCurrentLine = br.readLine()) != null) {
counter=counter+1;
String[] info = sCurrentLine.split("\\s+");
int sum = 0;
for(int j = 2; j < 8; j++) {
int num = Integer.parseInt(info[j]);
sum += num;
}
model.addRow(new Object[]{filename, sum+""});
}
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if (br != null)br.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
JFrame f = new JFrame();
f.setSize(300, 300);
f.add(new JScrollPane(table));
f.setVisible(true);
Use Labled Loop and Try-Catch. Below piece adds all number in a line.
You could get some hint from here:
String line = "text 1 2 3 4 del";
String splitLine[] = line.split("\t");
int sumLine = 0;
int i = 0;
contSum: for (; i < splitLine.length; i++) {
try {
sumLine += Integer.parseInt(splitLine[i]);
} catch (Exception e) {
continue contSum;
}
}
System.out.println(sumLine);
Here is another example using vectors . in this example directories will be searched for ".txt" files and added to the JTable.
The doIt method will take in the folder where your text files are located.
this will then with recursion, look for files in folders.
each file found will be split and added following you example file.
public class FileFolderReader
{
private Vector<Vector> rows = new Vector<Vector>();
public static void main(String[] args)
{
FileFolderReader fileFolderReader = new FileFolderReader();
fileFolderReader.doIt("D:\\folderoffiles");
}
private void doIt(String path)
{
System.out.println(findFile(new File(path)) + " in total");
JFrame frame = new JFrame();
frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
Vector<String> columnNames = new Vector<String>();
columnNames.addElement("File Name");
columnNames.addElement("Size");
JTable table = new JTable(rows, columnNames);
JScrollPane scrollPane = new JScrollPane(table);
frame.add(scrollPane, BorderLayout.CENTER);
frame.setSize(300, 150);
frame.setVisible(true);
}
private int findFile(File file)
{
int totalPerFile = 0;
int total = 0;
File[] list = file.listFiles(new FilenameFilter()
{
public boolean accept(File dir, String fileName)
{
return fileName.endsWith(".txt");
}
});
if (list != null)
for (File textFile : list)
{
if (textFile.isDirectory())
{
total = findFile(textFile);
}
else
{
totalPerFile = scanFile(textFile);
System.out.println(totalPerFile + " in " + textFile.getName());
Vector<String> rowItem = new Vector<String>();
rowItem.addElement(textFile.getName());
rowItem.addElement(Integer.toString(totalPerFile));
rows.addElement(rowItem);
total = total + totalPerFile;
}
}
return total;
}
public int scanFile(File file)
{
int sum = 0;
Scanner scanner = null;
try
{
scanner = new Scanner(file);
while (scanner.hasNextLine())
{
String line = scanner.nextLine();
String[] info = line.split("\\s+");
int count = 1;
for (String stingInt : info)
{
if (count != 1)
{
sum = sum + Integer.parseInt(stingInt);
}
count++;
}
}
scanner.close();
}
catch (FileNotFoundException e)
{
// you will need to handle this
// don't do this !
e.printStackTrace();
}
return sum;
}
}

Categories