I'm already looking for a solution for quite a while, but I'm still struggling with concurrency and parallelization.
Background: There's an ETL process and I get a big csv (up to over a million rows). In production there will be live updates, too. I want to spell check each row. For that I use an adapted LanguageTool. The check method (with my customization inside) takes quite a while. I want to speed it up.
One aspect is of course the method itself, but I also want to simply check multiple rows at a time. The order of the rows is not important. The result is the corrected text and it should be written to a new csv file for further processing.
I found that ExecutorService might be a reasonable choice, but since I'm not that familiar with it, some help would be appreciated.
That's how I use it so far in the ETL process:
private static SpellChecker spellChecker;
static {
SpellChecker tmp = null;
try {
tmp = new SpellChecker(...);
} catch (Exception e) {
e.printStackTrace();
}
spellChecker = tmp;
}
public static String spellCheck(String input) {
String output = input.replace("</li>", ".");
output = searchAVC.removeHtml(output);
try {
output = spellChecker.correctText(output);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return output;
}
My spellChecker is a custom library here and I create a static object of it (because instanciation of LanguageTool takes some time).
I want to parellize the execution of spellCheck.
I've already read stuff like this:
https://www.airpair.com/java/posts/parallel-processing-of-io-based-data-with-java-streams
What is the easiest way to parallelize a task in java?
Write to text file from multiple threads?
I don't really know to combine all this information. What do I have to concern when reading the file? Writing the file? Processing the rows?
Create Reader class responsible is reading from File.
Create Writer class responsible is writing from file.
Create processor class responsible is processing.
Now create a partitionner which responsible is read chunk by chunk and dispatch the batch of row to reader an reader will use processor to process and sent batch of row to writer.
To run create a thread pool to execute in multi thread environment.
Related
Am writing a Java program in Eclipse to scan keywords from resumes and filter the most suitable resume among them, apart from showing the keywords for each resume. The resumes can be of doc/pdf format.
I've successfully implemented a program to read pdf files and doc files seperately (by using Apache's PDFBox and POI jar packages and importing libraries for the required methods), display the keywords and show resume strength in terms of the number of keywords found.
Now there are two issues am stuck in:
(1) I need to distinguish between a pdf file and a doc file within the program, which is easily achievable by an if statement but am confused how to write the code to detect if a file has a .pdf or .doc extension. (I intend to build an application to select the resumes, but then the program has to decide whether it will implement the doc type file reading block or the pdf type file reading block)
(2) I intend to run the program for a list of resumes, for which I'll need a loop within which I'll run the keyword scanning operations for each resume, but I can't think of a way as because even if the files were named like 'resume1', 'resume2' etc we can't assign the loop's iterable variable in the file location like : 'C:/Resumes_Folder/Resume[i]' as thats the path.
Any help would be appreciated!
You can use a FileFilter to read only one type or another, then respond accordingly. It'll give you a List containing only files of the desired type.
The second requirement is confusing to me. I think you would be well served by creating a class that encapsulates the data and behavior that you want for a parsed Resume. Write a factory class that takes in an InputStream and produces a Resume with the data you need inside.
You are making a classic mistake: You are embedding all the logic in a main method. This will make it harder to test your code.
All problem solving consists of breaking big problems into smaller ones, solving the small problems, and assembling them to finally solve the big problem.
I would recommend that you decompose this problem into smaller classes. For example, don't worry about looping over a directory's worth of files until you can read and parse an individual PDF and DOC file.
Create an interface:
public interface ResumeParser {
Resume parse(InputStream is) throws IOException;
}
Implement different implementations for PDF and Word Doc.
Create a factory to give you the appropriate ResumeParser based on file type:
public class ResumeParserFactory {
public ResumeParser create(String fileType) {
if (fileType.contains(".pdf") {
return new PdfResumeParser();
} else if (fileType.contains(".doc") {
return new WordResumeParser();
} else {
throw new IllegalArgumentException("Unknown document type: " + fileType);
}
}
}
Be sure to write unit tests as you go. You should know how to use JUnit.
Another alternative to using a FileFilter is to use a DirectoryStream, because Files::newDirectoryStream easily allows to specify relevant file endings:
try (DirectoryStream<Path> stream = Files.newDirectoryStream(dir, "*.{doc,pdf}")) {
for (Path entry: stream) {
// process files here
}
} catch (DirectoryIteratorException ex) {
// I/O error encounted during the iteration, the cause is an IOException
throw ex.getCause();
}
}
You can do something basic like:
// Put the path to the folder containing all the resumes here
File f = new File("C:\\");
ArrayList<String> names = new ArrayList<>
(Arrays.asList(Objects.requireNonNull(f.list())));
for (String fileName : names) {
if (fileName.length() > 3) {
String type = fileName.substring(fileName.length() - 3);
if (type.equalsIgnoreCase("doc")) {
// doc file logic here
} else if (type.equalsIgnoreCase("pdf")) {
// pdf file logic here
}
}
}
But as DuffyMo's answer says, you can also use a FileFilter (it's definitely a better option than my quick code).
Hope it helps.
I am building an Android app which records Accelerometer and Gyroscope data to a text file. In most of the tutorials they use a method which involves creating two text files, and opening and closing them each 50 times per second. ie :
private static void writeToFile(File file, String data) {
FileOutputStream stream = null;
try {
stream = new FileOutputStream(file, true);
stream.write(data.getBytes());
} catch (FileNotFoundException e) {
Log.e("History", "In catch");
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
try {
stream.close();
} catch (IOException e) {
e.printStackTrace();
}
ie, on every SensorEvent, you open the file, write the values, then close the file, then open it again 20 milliseconds later.
It all seems to be working fine, I was just wondering if there was a better way of going about doing it? I tried some changes using a boolean flag to say whether the stream is already open or not, and then a different writeToFile if flag is set to true, but clearly the fileOutputStream can sometimes close itself in the 20 millisecond time frame, and the app crashes.
So I guess my Question is: How many system resources does it take to open, write and close a file that many times? Is it fine, and not something I should worry about, or is there a better way of doing things? Bear in mind continous sensor logging already takes a toll on battery life, so I would like to do things as efficiently as possible.
Thanks
It's not a good way of doing it. A better way would be to create the FileOutputStream once, save it as an instance member of whatever class this is, and just write to it (possibly with an occasional call to flush to make sure it writes to disk).
I am always curious how a rolling file is implemented in logs.
How would one even start creating a file writing class in any language in order to ensure that the file size is not exceeded.
The only possible solution I can think of is this:
write method:
size = file size + size of string to write
if(size > limit)
close the file writer
open file reader
read the file
close file reader
open file writer (clears the whole file)
remove the size from the beginning to accommodate for new string to write
write the new truncated string
write the string we received
This seems like a terrible implementation, but I can not think up of anything better.
Specifically I would love to see a solution in java.
EDIT: By remove size from the beginning is, let's say I have 20 byte string (which is the limit), I want to write another 3 byte string, therefore I remove 3 bytes from the beginning, and am left with end 17 bytes, and by appending the new string I have 20 bytes.
Because your question made me look into it, here's an example from the logback logging framework. The RollingfileAppender#rollover() method looks like this:
public void rollover() {
synchronized (lock) {
// Note: This method needs to be synchronized because it needs exclusive
// access while it closes and then re-opens the target file.
//
// make sure to close the hereto active log file! Renaming under windows
// does not work for open files
this.closeOutputStream();
try {
rollingPolicy.rollover(); // this actually does the renaming of files
} catch (RolloverFailure rf) {
addWarn("RolloverFailure occurred. Deferring roll-over.");
// we failed to roll-over, let us not truncate and risk data loss
this.append = true;
}
try {
// update the currentlyActiveFile
currentlyActiveFile = new File(rollingPolicy.getActiveFileName());
// This will also close the file. This is OK since multiple
// close operations are safe.
// COMMENT MINE this also sets the new OutputStream for the new file
this.openFile(rollingPolicy.getActiveFileName());
} catch (IOException e) {
addError("setFile(" + fileName + ", false) call failed.", e);
}
}
}
As you can see, the logic is pretty similar to what you posted. They close the current OutputStream, perform the rollover, then open a new one (openFile()). Obviously, this is all done in a synchronized block since many threads are using the logger, but only one rollover should occur at a time.
A RollingPolicy is a policy on how to perform a rollover and a TriggeringPolicy is when to perform a rollover. With logback, you usually base these policies on file size or time.
I'm creating a personal movie database thingy and i want to populate a combo box with movie titles from IMDB, IMDB releases this information in text files, so i'm trying to populate it from those text files. Ive got it working, but since the text file is VERY large, almost 80 000 rows with a title on every row... it takes way to long to load.
This might be the wrong way to go about doing this, someone knows how to solve it or what I should do?
The code for reading the file and return the String [] for the combo box
public String [] getMoviesFromFile() throws IOException{
BufferedReader input = new BufferedReader(new FileReader(filePath));
try {
String line = null;
while (( line = input.readLine()) != null){
strings.add(line);
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
finally {
input.close();
}
String[] lineArray = strings.toArray(new String[]{});
return lineArray;
}
The problem your having is your blocking the Event Dispatching Thread, which will make your application come to a grinding halt while the file is begin read. You should never perform time consuming or blocking actions in the EDT.
You need to off load the loading to a background thread and load the list there, then re-sync the values back to the EDT (you should never create or modify any UI element out side of the EDT)
Have a look at Concurrency in Swing. In your case, I'd recommend taking a look at SwingWorker as it's designed to meet your actual requirements.
File I/O may be to slow for your needs, I might suggest you look at loading the text file into a SQL style database, which may give faster results.
I'd suggest looking at HyperSQL or H2 which are both pure Java SQL databases designed to be small and lightweight, but which also run in single user mode, meaning you don't need to install a fully fledged SQL server in order to use them
I've made two apps designed to run concurrently (I do not want to combine them), and one reads from a certain file and the other writes to it. When one or the other are running no errors, however if they are both running a get an access is denied error.
Relevant code of the first:
class MakeImage implements Runnable {
#Override
public void run() {
File file = new File("C:/Users/jeremy/Desktop/New folder (3)/test.png");
while (true) {
try{
//make image
if(image!=null)
{
file.createNewFile();
ImageIO.write(image, "png", file);
hello.repaint();}}
catch(Exception e)
{
e.printStackTrace();
}
}
}
}
Relevant code of the second:
BufferedImage image = null;
try {
// Read from a file
image = ImageIO.read(new File("C:/Users/jeremy/Desktop/New folder (3)/test.png"));
}
catch(Exception e){
e.printStackTrace();
}
if(image!=null)
{
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ImageIO.write( image, "png", baos );
baos.flush();
byte[] imageInByte = baos.toByteArray();
baos.close();
returns=Base64.encodeBase64String(imageInByte);
}
I looked at this: Java: how to handle two process trying to modify the same file, but that is when both are writting to the file where here only one is. I tried the retry later method as suggested in the former's answer without any luck. Any help would be greatly appreciated.
Unless you use OS level file locking of some sort and check for the locks you're not going to be able to reliably do this very easily. A fairly reliable way to manage this would be to use another file in the directory as a semaphore, "touch" it when you're writing or reading and remove it when you're done. Check for the existence of the semaphore before accessing the file. Otherwise you will need to use a database of some sort to store the file lock (guaranteed consistency) and check for it there.
That said, you really should just combine this into 1 program.
Try RandomAccessFile.
This is a useful but very dangerous feature. It goes like this "if you create different instances of RandomAccessFile for a same file you can concurrently write to the different parts of the file."
You can create multiple threads pointing to different parts of the file using seek method and multiple threads can update the file at the same time. Seek allow you to move to any part of the file even if it doesn't exist (after EOF), hence you can move to any location in the newly created file and write bytes on that location. You can open multiple instances of the same file and seek to different locations and write to multiple locations at the same time.
Use synchronized on the method that modify the file.
Edited:
As per the Defination of a Thread safe class, its this way.. " A class is said to be thread safe, which it works correctly in the presence of the underlying OS interleaving and scheduling with NO means of synchronization mechanism from the client side".
I believe there is a File which is to be accessed on to a different machine, so there must be some client-server mechanism, if its there.. then Let the Server side have the synchronization mechanism, and then it doesnt matters how many client access it...
If not, synchronized is more than enough........