TrueZip compression taking too much time - java

I am using TrueZip for compression. Here is what my code looks like
public String compress() throws IOException {
if (logLocations.isEmpty()) {
throw new IllegalStateException("no logs provided to compress");
}
removeDestinationIfExists(desiredArchive);
final TFile destinationArchive = new TFile(desiredArchive + "/diagnostics");
for (final String logLocation : logLocations) {
final TFile log = new TFile(logLocation);
if (!log.exists()) {
LOGGER.debug("{} does not exist, ignoring.");
continue;
}
if (log.isDirectory()) {
log.cp_r(destinationArchive);
} else {
final String newLogLocation =
new TFile(destinationArchive.getAbsolutePath()) + SLASH +
getLogNameFromPath(logLocation);
log.cp(new TFile(newLogLocation));
}
}
return destinationArchive.getEnclArchive().getAbsolutePath();
}
and my test
#Test
public void testBenchMarkWithHprof() throws IOException {
final FileWriter logLocations;
String logLocationPath = "/Users/harit/Downloads/tmp/logLocations.txt";
{
logLocations = new FileWriter(logLocationPath);
logLocations.write("Test3");
logLocations.write("\n");
logLocations.close();
}
final LPLogCompressor compressor = new LPLogCompressor("/Users/harit/Downloads/tmp",
new File(logLocationPath),
"/Users/harit/Downloads/tmp/TestOut");
final long startTime = System.currentTimeMillis();
compressor.compress();
System.out.println("Time taken (msec): " + (System.currentTimeMillis() - startTime));
}
and my data directory Test3 looks like
Test3/
java_pid1748.hprof
The file size is 2.83GB
When I ran the test, it took over 22 minutes.
However when I compress the same file using Native OSX compress (right click -> compress), it takes only 2 minutes
Why there is so much of difference?
Thanks
UPDATE
Based on #Satnam recommendation, I attached a debugger to see whats going on and this is what I find
None of the TrueZip Threads are running? really? Apologies I am using profiler for the first time

The reason in this case was using default Deflater which is Deflater.BEST_COMPRESSION.
I override the ZipDriver class to over the level as
import de.schlichtherle.truezip.fs.archive.zip.ZipDriver;
import de.schlichtherle.truezip.socket.IOPoolProvider;
import java.util.zip.Deflater;
public class OverrideZipDriver extends ZipDriver {
public OverrideZipDriver(final IOPoolProvider ioPoolProvider) {
super(ioPoolProvider);
}
#Override
public int getLevel() {
return Deflater.DEFAULT_COMPRESSION;
}
}
and then in my Compressor class, I did
public LPLogCompressor(final String logProcessorInstallPath, final File logLocationsSource,
final String desiredArchive) throws IOException {
this.desiredArchive = desiredArchive + DOT + getDateTimeStampFormat() + ZIP;
logLocations = getLogLocations(logProcessorInstallPath, logLocationsSource);
enableLogCompression();
}
private static void enableLogCompression() {
TConfig.get().setArchiveDetector(
new TArchiveDetector(TArchiveDetector.NULL, new Object[][]{
{"zip", new OverrideZipDriver(IOPoolLocator.SINGLETON)},}));
TConfig.push();
}
You can read the thread here

Related

How to apply multithreading in java to find a word in a directory(main directory and the word given) and recursively its subdirectories

I am quite new on Stack Overflow and a beginner in Java so please forgive me if I have asked this question in an improper way.
PROBLEM
I have an assignment which tells me to make use of multi-threading to search files for a given word, which might be present in any file of type .txt and .html, on any-level in the given directory (So basically the entire directory). The absolute file path of the file has to be displayed on the console if the file contains the given word.
WHAT HAVE I TRIED
So I thought of dividing the task into 2 sections, Searching and Multithreading respectively,
I was able to get the Searching part( File_search.java ). This file has given satisfactory results by searching through the directory and finding all the files in it for the given word.
File_search.java
public class File_search{
String fin_output = "";
public String searchInTextFiles(File dir,String search_word) {
File[] a = dir.listFiles();
for(File f : a){
if(f.isDirectory()) {
searchInTextFiles(f,search_word);
}
else if(f.getName().endsWith(".txt") || f.getName().endsWith(".html") || f.getName().endsWith(".htm") ) {
try {
searchInFile(f,search_word);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
}
return fin_output;
}
public void searchInFile(File f,String search_word) throws FileNotFoundException {
final Scanner sc = new Scanner(f);
while(sc.hasNextLine()) {
final String lineFromFile = sc.nextLine();
if(lineFromFile.contains(search_word)) {
fin_output += "FILE : "+f.getAbsolutePath().toString()+"\n";
}
}
}
Now, I want to be able to use multiple threads to execute the task File_search.java using ThreadPoolExecuter service. I'm not sure If I can do it using Runnable ,Callable or by using a Thread class or by any other method?
Can you please help me with the code to do the multi-threading part? Thanks :)
I agree to the comment of #chrylis -cautiouslyoptimistic, but for the purpose of understanding below will help you.
One simpler approach could be to do the traversal of directories in the main Thread, I mean the logic which you have added in function searchInTextFiles and do the searching logic as you did in function searchInFile in a Threadpool of size let's say 10.
Below sample code will help you to understand it better.
public class Traverser {
private List<Future<String>> futureList = new ArrayList<Future<String>>();
private ExecutorService executorService;
public Traverser() {
executorService = Executors.newFixedThreadPool(10);
}
public static void main(String[] args) throws InterruptedException, ExecutionException {
System.out.println("Started");
long start = System.currentTimeMillis();
Traverser traverser = new Traverser();
traverser.searchInTextFiles(new File("Some Directory Path"), "Some Text");
for (Future<String> future : traverser.futureList) {
System.out.println(future.get());
}
traverser.executorService.shutdown();
while(!traverser.executorService.isTerminated()) {
System.out.println("Not terminated yet, sleeping");
Thread.sleep(1000);
}
long end = System.currentTimeMillis();
System.out.println("Time taken :" + (end - start));
}
public void searchInTextFiles(File dir,String searchWord) {
File[] filesList = dir.listFiles();
for(File file : filesList){
if(file.isDirectory()) {
searchInTextFiles(file,searchWord);
}
else if(file.getName().endsWith(".txt") || file.getName().endsWith(".html") || file.getName().endsWith(".htm") ) {
try {
futureList.add(executorService.submit(new SearcherTask(file,searchWord)));
} catch (Exception e) {
e.printStackTrace();
}
}
}
}}
public class SearcherTask implements Callable<String> {
private File inputFile;
private String searchWord;
public SearcherTask(File inputFile, String searchWord) {
this.inputFile = inputFile;
this.searchWord = searchWord;
}
#Override
public String call() throws Exception {
StringBuilder result = new StringBuilder();
Scanner sc = null;
try {
sc = new Scanner(inputFile);
while (sc.hasNextLine()) {
final String lineFromFile = sc.nextLine();
if (lineFromFile.contains(searchWord)) {
result.append("FILE : " + inputFile.getAbsolutePath().toString() + "\n");
}
}
} catch (Exception e) {
//log error
throw e;
} finally {
sc.close();
}
return result.toString();
}}

Android Track progress of Azure CloudBlockBlob upload

How can I print the number of bytes that have been uploaded after calling blob.upload(new FileInputStream(imageFile), imageFile.length()); I want to log something like "100/totalBytes bytes have been uploaded, 224/totalBytes bytes have been uploaded..." So I can create a progress bar of the upload progress.
this is the code:
//AzureBlobLoader extends AsyncTask
public class AzureBlobUploader extends AzureBlobLoader {
private Activity act;
private String userName;
private TaggedImageObject img;
private Fragment histFragment;
public AzureBlobUploader(Fragment f, Activity act, String userName, TaggedImageObject img) {
super();
this.act = act;
this.userName = userName;
this.img = img;
this.histFragment = f;
}
#Override
protected Object doInBackground(Object[] params) {
File imageFile = new File(this.img.getImgPath());
try {
// Define the path to a local file.
final String filePath = imageFile.getPath();
// Create or overwrite the blob with contents from the local file.
String[] imagePathArray = filePath.split("/");
String imageName = imagePathArray[imagePathArray.length-1];
System.out.println("Image Name: " + imageName);
String containerName = userName + "/" + imageName;
System.out.println("Container Name: " + containerName);
CloudBlockBlob blob= this.getContainer().getBlockBlobReference(containerName);
//UPLOAD!
blob.upload(new FileInputStream(imageFile), imageFile.length());
//-----DATABASE-----//
//create client
this.setDBClient(
new MobileServiceClient(
"URL",
this.act.getApplicationContext()
)
);
this.setImageTable(this.getDBClient().getTable(Image.class));
this.setIcavTable(this.getDBClient().getTable(ICAV.class));
//IMG TABLE QUERY
String validImageID = containerName.replace("/", "_");
Log.d("Azure", "Valid Image ID: " + validImageID);
Image img = new Image(validImageID, this.img.getUser(), this.img.getLat(), this.img.getLon());
this.getImageTable().insert(img);
for(String context : this.img.getContextAttributeMap().keySet()){
Map<String,String> attributeValueMap = this.img.getContextAttributeMap().get(context);
for(String attribute : attributeValueMap.keySet()){
String value = attributeValueMap.get(attribute);
ICAV icavRow = new ICAV();
icavRow.setImageID(validImageID);
icavRow.setContextID(context);
icavRow.setAttributeID(attribute);
icavRow.setValue(value);
this.getIcavTable().insert(icavRow);
}
}
} catch (Exception e) {
System.out.println(e.toString());
}
return null;
}
#Override
protected void onProgressUpdate(Object... object) {
super.onProgressUpdate(object);
Log.d("progressUpdate", "progress: "+((Integer)object[0] * 2) + "%");
}
#Override
protected void onPostExecute(Object o) {
// to do
}
}
As you can see the Azure SDK doesn't directly allow for that, but it should be fairly easy to wrap your inputstream in another input stream that can give callbacks for bytes read. Something like that:
public class ListenableInputStream extends InputStream {
private final InputStream wraped;
private final ReadListener listener;
private final long minimumBytesPerCall;
private long bytesRead;
public ListenableInputStream(InputStream wraped, ReadListener listener, int minimumBytesPerCall) {
this.wraped = wraped;
this.listener = listener;
this.minimumBytesPerCall = minimumBytesPerCall;
}
#Override
public int read() throws IOException {
int read = wraped.read();
if (read >= 0) {
bytesRead++;
}
if (bytesRead > minimumBytesPerCall || read == -1) {
listener.onRead(bytesRead);
bytesRead = 0;
}
return read;
}
#Override
public int available() throws IOException {
return wraped.available();
}
#Override
public void close() throws IOException {
wraped.close();
}
#Override
public synchronized void mark(int readlimit) {
wraped.mark(readlimit);
}
#Override
public synchronized void reset() throws IOException {
wraped.reset();
}
#Override
public boolean markSupported() {
return wraped.markSupported();
}
interface ReadListener {
void onRead(long bytes);
}
}
minimumBytesPerCall should be initialised with some sensible number, as you probably don't want to be called on every single byte, maybe every half a megabyte should be good.
And remember that this all gets called on the doInBackground thread, so act accordingly.
edit:
I've edited the class above, there was a small error on computing the bytesRead value.
The official documentation explains your follow-up questions https://developer.android.com/reference/java/io/InputStream.html#read()
Reads the next byte of data from the input stream
So read() reads 1 byte of data (or return -1) if reached the end. So yes, it must be called several several times to read a whole image.
Then the method onRead(long) get's called every time at least minimumBytesPerCall have been read (that's to avoid of calling back for every single byte) and once more at the end of the stream (when it returns -1)
The value passed to onRead(long) is the amount that have been read since the last call. So implementing this on your AsyncTask you would have to accumulate this value and compare with the total size of the file.
Something like the following code inside your asynctask should work fine (assuming the Progress generic parameter is a Long):
private long fileLength;
private long totalBytes;
private final ListenableInputStream.ReadListener readListener = new ListenableInputStream.ReadListener() {
#Override
public void onRead(long bytes) {
totalBytes += bytes;
publishProgress(totalBytes);
}
};
and on inside your upload part you replace with:
FileInputStream fis = new FileInputStream(imageFile);
fileLength = imageFile.length();
ListenableInputStream lis = new ListenableInputStream(fi, readListener, 256 * 1024); // this will call onRead(long) every 256kb
blob.upload(lis, fileLength);
and as a last remark, remember that internally the CloudBlockBlob just caching the file on its own memory for later upload, or doing any other weird stuff that is out of your control. All this code does is check that the complete file was read.
happy coding!
Just another way for your needs, there is a MS blog which introduce about uploading a blob to Azure Storage with progress bar and variable upload block size. That code was written in C#, but it's very simple for reading by Java/Android Developer, I think you can easily rewrite it in Java for Android to compute the uploading processbar ratio to share via some public variables.
Hope it helps.

How to generate PDF documents from rpt in a multi-threaded approach?

I have a rpt file, using which i will be generating multiple reports in pdf format. Using the Engine class from inet clear reports. The process takes very long as I have nearly 10000 reports to be generated. Can I use the Mutli-thread or some other approach to speed up the process?
Any help of how it can be done would be helpful
My partial code.
//Loops
Engine eng = new Engine(Engine.EXPORT_PDF);
eng.setReportFile(rpt); //rpt is the report name
if (cn.isClosed() || cn == null ) {
cn = ds.getConnection();
}
eng.setConnection(cn);
System.out.println(" After set connection");
eng.setPrompt(data[i], 0);
ReportProperties repprop = eng.getReportProperties();
repprop.setPaperOrient(ReportProperties.DEFAULT_PAPER_ORIENTATION, ReportProperties.PAPER_FANFOLD_US);
eng.execute();
System.out.println(" After excecute");
try {
PDFExportThread pdfExporter = new PDFExportThread(eng, sFileName, sFilePath);
pdfExporter.execute();
} catch (Exception e) {
e.printStackTrace();
}
PDFExportThread execute
public void execute() throws IOException {
FileOutputStream fos = null;
try {
String FileName = sFileName + "_" + (eng.getPageCount() - 1);
File file = new File(sFilePath + FileName + ".pdf");
if (!file.getParentFile().exists()) {
file.getParentFile().mkdirs();
}
if (!file.exists()) {
file.createNewFile();
}
fos = new FileOutputStream(file);
for (int k = 1; k <= eng.getPageCount(); k++) {
fos.write(eng.getPageData(k));
}
fos.flush();
fos.close();
} catch (Exception e) {
e.printStackTrace();
} finally {
if (fos != null) {
fos.close();
fos = null;
}
}
}
This is a very basic code. A ThreadPoolExecutor with a fixed size threads in a pool is the backbone.
Some considerations:
The thread pool size should be equal or less than the DB connection pool size. And, it should be of an optimal number which is reasonable for parallel Engines.
The main thread should wait for sufficient time before killing all threads. I have put 1 hour as the wait time, but that's just an example.
You'll need to have proper Exception handling.
From the API doc, I saw stopAll and shutdown methods from the Engine class. So, I'm invoking that as soon as our work is done. That's again, just an example.
Hope this helps.
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.sql.Connection;
import java.util.concurrent.Executors;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;
public class RunEngine {
public static void main(String[] args) throws Exception {
final String rpt = "/tmp/rpt/input/rpt-1.rpt";
final String sFilePath = "/tmp/rpt/output/";
final String sFileName = "pdfreport";
final Object[] data = new Object[10];
ThreadPoolExecutor executor = (ThreadPoolExecutor) Executors.newFixedThreadPool(10);
for (int i = 0; i < data.length; i++) {
PDFExporterRunnable runnable = new PDFExporterRunnable(rpt, data[i], sFilePath, sFileName, i);
executor.execute(runnable);
}
executor.shutdown();
executor.awaitTermination(1L, TimeUnit.HOURS);
Engine.stopAll();
Engine.shutdown();
}
private static class PDFExporterRunnable implements Runnable {
private final String rpt;
private final Object data;
private final String sFilePath;
private final String sFileName;
private final int runIndex;
public PDFExporterRunnable(String rpt, Object data, String sFilePath,
String sFileName, int runIndex) {
this.rpt = rpt;
this.data = data;
this.sFilePath = sFilePath;
this.sFileName = sFileName;
this.runIndex = runIndex;
}
#Override
public void run() {
// Loops
Engine eng = new Engine(Engine.EXPORT_PDF);
eng.setReportFile(rpt); // rpt is the report name
Connection cn = null;
/*
* DB connection related code. Check and use.
*/
//if (cn.isClosed() || cn == null) {
//cn = ds.getConnection();
//}
eng.setConnection(cn);
System.out.println(" After set connection");
eng.setPrompt(data, 0);
ReportProperties repprop = eng.getReportProperties();
repprop.setPaperOrient(ReportProperties.DEFAULT_PAPER_ORIENTATION,
ReportProperties.PAPER_FANFOLD_US);
eng.execute();
System.out.println(" After excecute");
FileOutputStream fos = null;
try {
String FileName = sFileName + "_" + runIndex;
File file = new File(sFilePath + FileName + ".pdf");
if (!file.getParentFile().exists()) {
file.getParentFile().mkdirs();
}
if (!file.exists()) {
file.createNewFile();
}
fos = new FileOutputStream(file);
for (int k = 1; k <= eng.getPageCount(); k++) {
fos.write(eng.getPageData(k));
}
fos.flush();
fos.close();
} catch (Exception e) {
e.printStackTrace();
} finally {
if (fos != null) {
try {
fos.close();
} catch (IOException e) {
e.printStackTrace();
}
fos = null;
}
}
}
}
/*
* Dummy classes to avoid compilation errors.
*/
private static class ReportProperties {
public static final String PAPER_FANFOLD_US = null;
public static final String DEFAULT_PAPER_ORIENTATION = null;
public void setPaperOrient(String defaultPaperOrientation, String paperFanfoldUs) {
}
}
private static class Engine {
public static final int EXPORT_PDF = 1;
public Engine(int exportType) {
}
public static void shutdown() {
}
public static void stopAll() {
}
public void setPrompt(Object singleData, int i) {
}
public byte[] getPageData(int k) {
return null;
}
public int getPageCount() {
return 0;
}
public void execute() {
}
public ReportProperties getReportProperties() {
return null;
}
public void setConnection(Connection cn) {
}
public void setReportFile(String reportFile) {
}
}
}
I will offer this "answer" as a possible quick & dirty solution to get you started on a parallelization effort.
One way or another you're going to build a render farm.
I don't think there is a trivial way to do this in java; I would love to have someone post an answer that show how to parallelize your example in just a few lines of code. But until that happens this will hopefully help you make some progress.
You're going to have limited scaling in the same JVM instance.
But... let's see how far you get with that and see if it helps enough.
Design challenge #1: restarting.
You will probably want a place to keep the status for each of your reports e.g. "units of work".
You want this in case you need to re-start everything (maybe your server crashes) and you don't want to re-run all of the reports thus far.
Lots of ways you can do this; database, check to see if a "completed" file exists in your report folder (not sufficient for the *.pdf to exist, as that may be incomplete... for xyz_200.pdf you could maybe make an empty xyz_200.done or xyz_200.err file to help with re-running any problem children... and by the time you code up that file manipulation/checking/initialization logic, seems like it may have been easier to add a column to your database which holds the list of work to-be-done).
Design consideration #2: maximizing throughput (avoiding overload).
You don't want to saturate you system and run one thousand reports in parallel.
Maybe 10.
Maybe 100.
Probably not 5,000.
You will need to do some sizing research and see what gets you near 80 to 90% system utilization.
Design consideration #3: scaling across multiple servers
Overly complex, outside the scope of a Stack Exchange answer.
You'd have to spin up JVM's on multiple systems that are running something like the workers below, and a report-manager that can pull work items from a shared "queue" structure, again a database table is probably easier here than doing something file-based (or a network feed).
Sample Code
Caution: None of this code is well tested, it almost certainly has an abundance of typos, logic errors and poor design. Use at your own risk.
So anyway... I do want to give you the basic idea of a rudimentary task runner.
Replace your "// Loops" example in the question with code like the following:
main loop (original code example)
This is more or less doing what your example code did, modified to push most of the work into ReportWorker (new class, see below). Lots of stuff seems to be packed into your original question's example of "// Loop", so I'm not trying to reverse engineer that.
fwiw, it was unclear to me where "rpt" and "data[i]" are coming from so I hacked up some test data.
public class Main {
public static boolean complete( String data ) {
return false; // for testing nothing is complete.
}
public static void main(String args[] ) {
String data[] = new String[] {
"A",
"B",
"C",
"D",
"E" };
String rpt = "xyz";
// Loop
ReportManager reportMgr = new ReportManager(); // a new helper class (see below), it assigns/monitors work.
long startTime = System.currentTimeMillis();
for( int i = 0; i < data.length; ++i ) {
// complete is something you should write that knows if a report "unit of work"
// finished successfully.
if( !complete( data[i] ) ) {
reportMgr.assignWork( rpt, data[i] ); // so... where did values for your "rpt" variable come from?
}
}
reportMgr.waitForWorkToFinish(); // out of new work to assign, let's wait until everything in-flight complete.
long endTime = System.currentTimeMillis();
System.out.println("Done. Elapsed time = " + (endTime - startTime)/1000 +" seconds.");
}
}
ReportManager
This class is not thread safe, just have your original loop keep calling assignWork() until you're out of reports to assign then keep calling it until all work is done, e.g. waitForWorkToFinish(), as shown above. (fwiw, I don't think you could say any of the classes here are especially thread safe).
public class ReportManager {
public int polling_delay = 500; // wait 0.5 seconds for testing.
//public int polling_delay = 60 * 1000; // wait 1 minute.
// not high throughput millions of reports / second, we'll run at a slower tempo.
public int nWorkers = 3; // just 3 for testing.
public int assignedCnt = 0;
public ReportWorker workers[];
public ReportManager() {
// initialize our manager.
workers = new ReportWorker[ nWorkers ];
for( int i = 0; i < nWorkers; ++i ) {
workers[i] = new ReportWorker( i );
System.out.println("Created worker #"+i);
}
}
private ReportWorker handleWorkerError( int i ) {
// something went wrong, update our "report" status as one of the reports failed.
System.out.println("handlerWokerError(): failure in "+workers[i]+", resetting worker.");
workers[i].teardown();
workers[i] = new ReportWorker( i ); // just replace everything.
return workers[i]; // the new worker will, incidentally, be avaialble.
}
private ReportWorker handleWorkerComplete( int i ) {
// this unit of work was completed, update our "report" status tracker as success.
System.out.println("handleWorkerComplete(): success in "+workers[i]+", resetting worker.");
workers[i].teardown();
workers[i] = new ReportWorker( i ); // just replace everything.
return workers[i]; // the new worker will, incidentally, be avaialble.
}
private int activeWorkerCount() {
int activeCnt = 0;
for( int i = 0; i < nWorkers; ++i ) {
ReportWorker worker = workers[i];
System.out.println("activeWorkerCount() i="+i+", checking worker="+worker);
if( worker.hasError() ) {
worker = handleWorkerError( i );
}
if( worker.isComplete() ) {
worker = handleWorkerComplete( i );
}
if( worker.isInitialized() || worker.isRunning() ) {
++activeCnt;
}
}
System.out.println("activeWorkerCount() activeCnt="+activeCnt);
return activeCnt;
}
private ReportWorker getAvailableWorker() {
// check each worker to see if anybody recently completed...
// This (rather lazily) creates completely new ReportWorker instances.
// You might want to try pooling (salvaging and reinitializing them)
// to see if that helps your performance.
System.out.println("\n-----");
ReportWorker firstAvailable = null;
for( int i = 0; i < nWorkers; ++i ) {
ReportWorker worker = workers[i];
System.out.println("getAvailableWorker(): i="+i+" worker="+worker);
if( worker.hasError() ) {
worker = handleWorkerError( i );
}
if( worker.isComplete() ) {
worker = handleWorkerComplete( i );
}
if( worker.isAvailable() && firstAvailable==null ) {
System.out.println("Apparently worker "+worker+" is 'available'");
firstAvailable = worker;
System.out.println("getAvailableWorker(): i="+i+" now firstAvailable = "+firstAvailable);
}
}
return firstAvailable; // May (or may not) be null.
}
public void assignWork( String rpt, String data ) {
ReportWorker worker = getAvailableWorker();
while( worker == null ) {
System.out.println("assignWork: No workers available, sleeping for "+polling_delay);
try { Thread.sleep( polling_delay ); }
catch( InterruptedException e ) { System.out.println("assignWork: sleep interrupted, ignoring exception "+e); }
// any workers avaialble now?
worker = getAvailableWorker();
}
++assignedCnt;
worker.initialize( rpt, data ); // or whatever else you need.
System.out.println("assignment #"+assignedCnt+" given to "+worker);
Thread t = new Thread( worker );
t.start( ); // that is pretty much it, let it go.
}
public void waitForWorkToFinish() {
int active = activeWorkerCount();
while( active >= 1 ) {
System.out.println("waitForWorkToFinish(): #active workers="+active+", waiting...");
// wait a minute....
try { Thread.sleep( polling_delay ); }
catch( InterruptedException e ) { System.out.println("assignWork: sleep interrupted, ignoring exception "+e); }
active = activeWorkerCount();
}
}
}
ReportWorker
public class ReportWorker implements Runnable {
int test_delay = 10*1000; //sleep for 10 seconds.
// (actual code would be generating PDF output)
public enum StatusCodes { UNINITIALIZED,
INITIALIZED,
RUNNING,
COMPLETE,
ERROR };
int id = -1;
StatusCodes status = StatusCodes.UNINITIALIZED;
boolean initialized = false;
public String rpt = "";
public String data = "";
//Engine eng;
//PDFExportThread pdfExporter;
//DataSource_type cn;
public boolean isInitialized() { return initialized; }
public boolean isAvailable() { return status == StatusCodes.UNINITIALIZED; }
public boolean isRunning() { return status == StatusCodes.RUNNING; }
public boolean isComplete() { return status == StatusCodes.COMPLETE; }
public boolean hasError() { return status == StatusCodes.ERROR; }
public ReportWorker( int id ) {
this.id = id;
}
public String toString( ) {
return "ReportWorker."+id+"("+status+")/"+rpt+"/"+data;
}
// the example code doesn't make clear if there is a relationship between rpt & data[i].
public void initialize( String rpt, String data /* data[i] in original code */ ) {
try {
this.rpt = rpt;
this.data = data;
/* uncomment this part where you have the various classes availble.
* I have it commented out for testing.
cn = ds.getConnection();
Engine eng = new Engine(Engine.EXPORT_PDF);
eng.setReportFile(rpt); //rpt is the report name
eng.setConnection(cn);
eng.setPrompt(data, 0);
ReportProperties repprop = eng.getReportProperties();
repprop.setPaperOrient(ReportProperties.DEFAULT_PAPER_ORIENTATION, ReportProperties.PAPER_FANFOLD_US);
*/
status = StatusCodes.INITIALIZED;
initialized = true; // want this true even if we're running.
} catch( Exception e ) {
status = StatusCodes.ERROR;
throw new RuntimeException("initialze(rpt="+rpt+", data="+data+")", e);
}
}
public void run() {
status = StatusCodes.RUNNING;
System.out.println("run().BEGIN: "+this);
try {
// delay for testing.
try { Thread.sleep( test_delay ); }
catch( InterruptedException e ) { System.out.println(this+".run(): test interrupted, ignoring "+e); }
/* uncomment this part where you have the various classes availble.
* I have it commented out for testing.
eng.execute();
PDFExportThread pdfExporter = new PDFExportThread(eng, sFileName, sFilePath);
pdfExporter.execute();
*/
status = StatusCodes.COMPLETE;
System.out.println("run().END: "+this);
} catch( Exception e ) {
System.out.println("run().ERROR: "+this);
status = StatusCodes.ERROR;
throw new RuntimeException("run(rpt="+rpt+", data="+data+")", e);
}
}
public void teardown() {
if( ! isInitialized() || isRunning() ) {
System.out.println("Warning: ReportWorker.teardown() called but I am uninitailzied or running.");
// should never happen, fatal enough to throw an exception?
}
/* commented out for testing.
try { cn.close(); }
catch( Exception e ) { System.out.println("Warning: ReportWorker.teardown() ignoring error on connection close: "+e); }
cn = null;
*/
// any need to close things on eng?
// any need to close things on pdfExporter?
}
}

Java logging to File: Alternative persistence scheme

New to Java. I would like to use the logger, but with a different file persistence scheme. Instead of rotating files and overriding, I would like the logs to be created in a time based file system hierarchy, where log files contain logs of the past minute: Example: if a log is generated on 2015-03-08 13:05, it will be placed in log_05.txt under /home/myUser/logs/2015/03/08/13
in other words, the file full path would be /home/myUser/logs/2015/03/08/13/log_05.txt.
Any suggestions?
I ended up implementing a library. Tested on Linux & Windows. It provides the desired file persistence scheme, and allows asynchronous logging. Would appreciate comments.
package com.signin.ems;
/**
* The EMSLogger JAR wraps the Java Logger for two purposes:
* 1. Implement a custom file persistence scheme (other than a single file, or a rotating scheme).
* In particular, the scheme implemented is one minute files, placed in hourly directories.
* The file name format is <mm>.log (mm=00..59), and the directory name format is YYYYMMDD24HH.
*
* 2. Logging should be done asynchronously. For this, a dedicated thread is created. When a message is logged,
* the LogRecord is placed in a BlockingQueue instead of writing the LogRecord to file. The dedicated thread
* performs a blocking wait on the queue. Upon retrieving a LogRecord object, it writes the LogRecord to the
* proper file
*
*
*/
public class EMSLogger
{
private static final int m_iQueSize = 100000;
private static BlockingQueue<LogRecord> m_LogRecordQueue;
private static EMSLoggerThread m_EMSLoggerThread;
private static Thread m_thread;
private static final Logger m_instance = createInstance();
protected EMSLogger()
{
}
public static Logger getInstance() {
return m_instance;
}
private static Logger createInstance()
{
MyFileHandler fileHandler = null;
Logger LOGGER = null;
try
{
// initialize the Log queue
m_LogRecordQueue = new ArrayBlockingQueue<LogRecord>(m_iQueSize);
// get top level logger
LOGGER = Logger.getLogger("");
LOGGER.setLevel(Level.ALL);
// create our file handler
fileHandler = new MyFileHandler(m_LogRecordQueue);
fileHandler.setLevel(Level.ALL);
LOGGER.addHandler(fileHandler);
// create the logging thread
m_EMSLoggerThread = new EMSLoggerThread(m_LogRecordQueue, fileHandler);
m_thread = new Thread(m_EMSLoggerThread);
m_thread.start();
}
catch (IOException e)
{
e.printStackTrace();
}
return LOGGER;
}
public static void Terminate ()
{
m_thread.interrupt();
}
}
public class MyFileHandler extends FileHandler
{
private final BlockingQueue<LogRecord> m_queue;
private BufferedOutputStream m_BufferedOutputStream;
private String m_RootFolderName;
private String m_CurrentDirectoryName;
private String m_CurrentFileName;
private SimpleDateFormat m_SDfh;
private SimpleDateFormat m_SDfm;
public MyFileHandler (BlockingQueue<LogRecord> q) throws IOException, SecurityException
{
super ();
// use simple formatter. Do not use the default XML
super.setFormatter (new SimpleFormatter ());
// get root folder from which to create the log directory hierarchy
m_RootFolderName = System.getProperty ("user.home") + "/logs";
// Service can optionally set its name. All hourly directories will
// be created below the provided name. If no name is given, "Default"
// is used
String sName = System.getProperty ("EMS.ServiceName");
if (sName != null)
{
System.out.println ("EMS.ServiceName = " + sName);
}
else
{
sName = "Default";
System.out.println ("Using \"" + sName + "\" as service name");
}
m_RootFolderName += "/" + sName;
// make sure the root folder is created
new File (m_RootFolderName).mkdirs ();
// initialize format objects
m_SDfh = new SimpleDateFormat ("yyyyMMddHH");
m_SDfm = new SimpleDateFormat ("mm");
m_CurrentDirectoryName = "";
m_CurrentFileName = "";
m_BufferedOutputStream = null;
m_queue = q;
}
// post the record the the queue. Actual writing to the log is done in a dedicated thread
// note that placing in the queue is done without blocking while waiting for available space
#Override
public void publish (LogRecord record)
{
m_queue.offer (record);
}
// check if a new file needs to be created
private void SetCurrentFile ()
{
boolean bChangeFile = false;
Date d = new Date (System.currentTimeMillis());
String newDirectory = m_RootFolderName + "/" + m_SDfh.format(d);
String newFile = m_SDfm.format(d);
if (!newDirectory.equals(m_CurrentDirectoryName))
{
// need to create a new directory and a new file
m_CurrentDirectoryName = newDirectory;
new File(m_CurrentDirectoryName).mkdirs();
bChangeFile = true;
}
if (!newFile.equals(m_CurrentFileName))
{
// need to create a new file
m_CurrentFileName = newFile;
bChangeFile = true;
}
if (bChangeFile)
{
try
{
if (m_BufferedOutputStream != null)
{
m_BufferedOutputStream.close ();
}
System.out.println("Creating File: " + m_CurrentDirectoryName + "/" + m_CurrentFileName + ".log");
m_BufferedOutputStream = new BufferedOutputStream
(new FileOutputStream (m_CurrentDirectoryName + "/" + m_CurrentFileName + ".log", true),2048);
this.setOutputStream(m_BufferedOutputStream);
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
// method _published is called from the dedicated thread
public void _publish(LogRecord record)
{
// check if a new file needs to be created
SetCurrentFile ();
super.publish(record);
}
}
class EMSLoggerThread implements Runnable
{
private final BlockingQueue<LogRecord> m_queue;
private final MyFileHandler m_MyFileHandler;
// Constructor
EMSLoggerThread(BlockingQueue<LogRecord> q, MyFileHandler fh)
{
m_queue = q;
m_MyFileHandler = fh;
}
public void run()
{
try
{
while (true)
{
m_MyFileHandler._publish(m_queue.take());
}
}
catch (InterruptedException ex)
{
}
}
}
This is covered in How to create the log file for each record in a specific format using java util logging framework. You have to modify those examples to create directories since the FileHandler will not create directories. If you are going to create an asynchronous handler, you should follow the advice in Using java.util.logger with a separate thread to write on file.

Splitting huge CSV by custom filter?

I have huge (>5GB) CSV file in format:
username,transaction
I want to have as an output separate CSV file for each user with only all of his transactions in the same format. I have few ideas in mind, but i want to hear other ideas for effective (fast and memory efficient) implementation.
Here is what i done up to now. First test is read/process/write in single thread, second test is with many threads. Performance is not that good, so i think i'm doing something wrong. Please correct me.
public class BatchFileReader {
private ICsvBeanReader beanReader;
private double total;
private String[] header;
private CellProcessor[] processors;
private DataTransformer<HashMap<String, List<LoginDto>>> processor;
private boolean hasMoreRecords = true;
public BatchFileReader(String file, DataTransformer<HashMap<String, List<LoginDto>>> processor) {
try {
this.processor = processor;
this.beanReader = new CsvBeanReader(new FileReader(file), CsvPreference.STANDARD_PREFERENCE);
header = CSVUtils.getHeader(beanReader.getHeader(true));
processors = CSVUtils.getProcessors();
} catch (IOException e) {
e.printStackTrace();
}
}
public void read() {
try {
readFile();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (beanReader != null) {
try {
beanReader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
private void readFile() throws IOException {
while (hasMoreRecords) {
long start = System.currentTimeMillis();
HashMap<String, List<LoginDto>> usersBatch = readBatch();
long end = System.currentTimeMillis();
System.out.println("Reading batch for " + ((end - start) / 1000f) + " seconds.");
total +=((end - start)/ 1000f);
if (processor != null && !usersBatch.isEmpty()) {
processor.transform(usersBatch);
}
}
System.out.println("total = " + total);
}
private HashMap<String, List<LoginDto>> readBatch() throws IOException {
HashMap<String, List<LoginDto>> users = new HashMap<String, List<LoginDto>>();
int readLoginCount = 0;
while (readLoginCount < CONFIG.READ_BATCH_SIZE) {
LoginDto login = beanReader.read(LoginDto.class, header, processors);
if (login != null) {
if (!users.containsKey(login.getUsername())) {
List<LoginDto> logins = new LinkedList<LoginDto>();
users.put(login.getUsername(), logins);
}
users.get(login.getUsername()).add(login);
readLoginCount++;
} else {
hasMoreRecords = false;
break;
}
}
return users;
}
}
public class BatchFileWriter {
private final String file;
private final List<T> processedData;
public BatchFileWriter(final String file, List<T> processedData) {
this.file = file;
this.processedData = processedData;
}
public void write() {
try {
writeFile(file, processedData);
} catch (IOException e) {
e.printStackTrace();
} finally {
}
}
private void writeFile(final String file, final List<T> processedData) throws IOException {
System.out.println("START WRITE " + " " + file);
FileWriter writer = new FileWriter(file, true);
long start = System.currentTimeMillis();
for (T record : processedData) {
writer.write(record.toString());
writer.write("\n");
}
writer.flush();
writer.close();
long end = System.currentTimeMillis();
System.out.println("Writing in file " + file + " complete for " + ((end - start) / 1000f) + " seconds.");
}
}
public class LoginsTest {
private static final ExecutorService executor = Executors.newSingleThreadExecutor();
private static final ExecutorService procExec = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() + 1);
#Test
public void testSingleThreadCSVtoCSVSplit() throws InterruptedException, ExecutionException {
long start = System.currentTimeMillis();
DataTransformer<HashMap<String, List<LoginDto>>> simpleSplitProcessor = new DataTransformer<HashMap<String, List<LoginDto>>>() {
#Override
public void transform(HashMap<String, List<LoginDto>> data) {
for (String field : data.keySet()) {
new BatchFileWriter<LoginDto>(field + ".csv", data.get(field)).write();
}
}
};
BatchFileReader reader = new BatchFileReader("loadData.csv", simpleSplitProcessor);
reader.read();
long end = System.currentTimeMillis();
System.out.println("TOTAL " + ((end - start)/ 1000f) + " seconds.");
}
#Test
public void testMultiThreadCSVtoCSVSplit() throws InterruptedException, ExecutionException {
long start = System.currentTimeMillis();
System.out.println(start);
final DataTransformer<HashMap<String, List<LoginDto>>> simpleSplitProcessor = new DataTransformer<HashMap<String, List<LoginDto>>>() {
#Override
public void transform(HashMap<String, List<LoginDto>> data) {
System.out.println("transform");
processAsync(data);
}
};
final CountDownLatch readLatch = new CountDownLatch(1);
executor.execute(new Runnable() {
#Override
public void run() {
BatchFileReader reader = new BatchFileReader("loadData.csv", simpleSplitProcessor);
reader.read();
System.out.println("read latch count down");
readLatch.countDown();
}});
System.out.println("read latch before await");
readLatch.await();
System.out.println("read latch after await");
procExec.shutdown();
executor.shutdown();
long end = System.currentTimeMillis();
System.out.println("TOTAL " + ((end - start)/ 1000f) + " seconds.");
}
private void processAsync(final HashMap<String, List<LoginDto>> data) {
procExec.execute(new Runnable() {
#Override
public void run() {
for (String field : data.keySet()) {
writeASync(field, data.get(field));
}
}
});
}
private void writeASync(final String field, final List<LoginDto> data) {
procExec.execute(new Runnable() {
#Override
public void run() {
new BatchFileWriter<LoginDto>(field + ".csv", data).write();
}
});
}
}
Would it not be better to use unix commands to sort and then split the original file?
Something like: cat txn.csv | sort > txn-sorted.csv
From there get a listing of the unique usernames via grep and then grep the sorted file for each username
If you know Camel already, I'd write a simple Camel route to:
Read line from file
Parse the line
Write to the correct output file
Its a very simple route but if you want it as fast as possible it is then trivially easy make it multithreaded
eg your route would look something like:
from("file:/myfile.csv")
.beanRef("lineParser")
.to("seda:internal-queue");
from("seda:internal-queue")
.concurrentConsumers(5)
.to("fileWriter");
If you don't know Camel then its not worth learning some this one task. However you are probably going to need to make it multithreaded to get the maximum performance. You'll have to experiment where best to put the threading as it will depend on what parts of the operation are slowest.
The multithreading will use up more memory so you'll need to balance memory efficiency against performance.
I would open/append a new output file for each user. If you wanted to minimize memory usage and incur more I/O overhead, you could do something like the following, though you'd probably want to use a real CSV parser like Super CSV (http://supercsv.sourceforge.net/index.html):
Scanner s = new Scanner(new File("/my/dir/users-and-transactions.txt"));
while (s.hasNextLine()) {
String line = s.nextLine();
String[] tokens = line.split(",");
String user = tokens[0];
String transaction = tokens[1];
PrintStream out = new PrintStream(new FileOutputStream("/my/dir/" + user, true));
out.println(transaction);
out.close();
}
s.close();
If you've got a reasonable amount of memory, you could create a Map of user name to OutputStream. Each time you see a user string, you could get the existing OutputStream for that user name or create a new one if none exists.

Categories