I'm trying to get some reliable method of measuring disk read speed, but failing at removal of cache out of the equation.
In How to measure Disk Speed in Java for Benchmarking is in answer from simgineer utility for exactly this, but for some reason, I failed to replicate its behaviour, and running the utility does not yield anything precise either (for read).
From suggestion in different answer, setting test file to something bigger than main memory size seems to work, but I cannot afford to spend whole four minutes for system to allocate 130GB file. (not writing anything in the file results in sparse file and returns bogus times)
Sufficient file size seems to be somewhere between
Runtime.getRuntime().maxMemory()
and
Runtime.getRuntime().maxMemory()*2
The source code of my current solution:
File file = new File(false ? "D:/work/bench.dat" : "./work/bench.dat");
RandomAccessFile wFile = null, rFile = null;
try {
System.out.println("Allocating test file ...");
int blockSize = 1024*1024;
long size = false ? 10L*1024L*(long)blockSize : Runtime.getRuntime().maxMemory()*2;
byte[] block = new byte[blockSize];
for(int i = 0; i<blockSize; i++) {
if(i % 2 == 0) block[i] = (byte) (i & 0xFF);
}
System.out.println("Writing ...");
wFile = new RandomAccessFile(file,"rw");
wFile.setLength(size);
for(long i = 0; i<size-blockSize; i+= blockSize) {
wFile.write(block);
}
wFile.close();
System.out.println("Running read test ...");
long t0 = System.nanoTime();
rFile = new RandomAccessFile(file,"r");
int blockCount = (int)(size/blockSize)-1;
Random rnd = new Random();
for(int i = 0; i<testCount; i++) {
rFile.seek((long)rnd.nextInt(blockCount)*(long)blockSize);
rFile.readFully(block, 0, blockSize);
}
rFile.close();
long t1 = System.nanoTime();
double readB = ((double)testCount*(double)blockSize);
double timeNs = (double)(t1-t0);
return (readB/(1024*1024))/(timeNs/(1000*1000*1000));
} catch (Exception e) {
Logger.logError("Failed to benchmark drive speed!", e);
return 0;
} finally {
if(wFile != null) {try {wFile.close();} catch (IOException e) {}}
if(rFile != null) {try {rFile.close();} catch (IOException e) {}}
if(file.exists()) {file.delete();}
}
I somewhat hoped to get a benchmark that will finish within seconds (caching results for following runs) having only first execution a bit slower.
I could technically crawl the filesystem and bench the read on files that are already on the drive, but that smells like a lot of undefined behaviour and firewalls are not happy about it either.
Any other options left? (platform dependent libraries are off the table)
In the end decided to solve the problem by scouring local work folder for files and load those, hoping we packaged enough with application to get specs speeds. In my current test case, the answer is luckily yes, but there are no guarantees, so I keep the approach from question as a backup plan.
This is not exactly perfect solution, but it somewhat works, getting specs speed at about 2000 test files. Bear in mind that this test cannot be rerun with same results, as all test files from previous execution are now probably cached.
You can always call flushmem ( https://chadaustin.me/flushmem/ ) by Chad Austin, but that takes about as much time to execute as the original approach, so I would advise to simply cache result of the first run and hope for the best.
Used code:
final int MIN_FILE_SIZE = 1024*10;
final int MAX_READ = 1024*1024*50;
final int FILE_COUNT_FRACTION = 4;
// Scour the location of the runtime for any usable files.
ArrayList<File> found = new ArrayList<>();
ArrayList<File> queue = new ArrayList<>();
queue.add(new File("./"));
while(!queue.isEmpty() && found.size() < testCount) {
File tested = queue.remove(queue.size()-1);
if(tested.isDirectory()) {
queue.addAll(Arrays.asList(tested.listFiles()));
} else if(tested.length()>MIN_FILE_SIZE){
found.add(tested);
}
}
// If amount of found files is not sufficient, perform test with new file.
if(found.size() < testCount/FILE_COUNT_FRACTION) {
Logger.logInfo("Disk to CPU transfer benchmark failed to find "
+ "sufficient amount of files to read, slow version "
+ "will be performed!", found.size());
return benchTransferSlowDC(testCount);
}
System.out.println(found.size());
byte[] block = new byte[MAX_READ];
Collections.shuffle(found);
RandomAccessFile raf = null;
long readB = 0;
try {
long t0 = System.nanoTime();
for(int i = 0; i<Math.min(found.size(), testCount); i++) {
File file = found.get(i);
int size = (int) Math.min(file.length(), MAX_READ);
raf = new RandomAccessFile(file,"r");
raf.read(block, 0, size);
raf.close();
readB += size;
}
long t1 = System.nanoTime();
return ((double)readB/(1024*1024))/((double)(t1-t0)/(1000*1000*1000));
//return (double)(t1-t0) / (double)readB;
} catch (Exception e) {
Logger.logError("Failed to benchmark drive speed!", e);
if(raf != null) try {raf.close();} catch(Exception ex) {}
return 0;
}
Related
We're using Apache Spark for processing. We have several steps where it is necessary to use collect() to to a JavaRDD to a list, but we are wanting to avoid doing this in order to operate on a list. We know we want to avoid this because it brings everything back to the driver. It ends up and we run out of memory because we are processing anywhere from 5million - 200 million records. Here's an example of what we have so far.
private InputStream createCSVObject(JavaRDD<Object[]> args) {
System.out.println("inside createCSVObject");
try {
StringBuilder value = new StringBuilder(CHUNK_SIZE);
args.collect().forEach(i -> {
value.append(i[0].toString());
for (int j = 1; j < i.length; ++j) {
value.append("," + i[j]);
}
value.append("\n");
});
System.out.println("Out of createCSVObject for loops");
byte[] strBytes = value.toString().getBytes();
InputStream myInputStream = new ByteArrayInputStream(strBytes);
return (myInputStream);
} catch (Exception e) {
System.err.println(String.format("ERROR: FileWriterService - writeFile: %s", e.getMessage()));
return null;
}
}
I've searched for this over and over across SO and google, and can't come up with anything definitive. Does anyone have any ideas???
Note: the COLLECT at args.collect()
EDIT:
After looking into the proposed answer below we devised a simple proof of concept for it, and what we came up with does one iteration through every 40s. The logic is not complex, why is it so slow?
System.out.println("inside createCSVObject");
try {
StringBuilder value = new StringBuilder();
System.out.println("args length " + args.toLocalIterator().next().length);
while (args.toLocalIterator().hasNext()) {
Object[] objects = args.toLocalIterator().next();
System.out.println("Inside iterator");
value.append(objects[0].toString());
for (int j = 1; j < objects.length; ++j) {
value.append("," + objects[j]);
}
value.append("\n");
}
System.out.println("Out of createCSVObject for loops");
byte[] strBytes = value.toString().getBytes();
InputStream myInputStream = new ByteArrayInputStream(strBytes);
return (myInputStream);
} catch (Exception e) {
System.err.println(String.format("ERROR: FileWriterService - writeFile: %s", e.getMessage()));
e.printStackTrace();
return null;
}
You can use JavaRDD.toLocalIterator() to iterate through the entire RDD on the driver without collecting it all into a list. Instead, it brings each partition over to the driver one at a time, so doesn't use more memory than the size of the largest partition (documentation).
Obviously, in the exmple you've given, you still have the problem that you're collecting everything into a massive byte array, which will still use quite a lot of memory. Instead, you could write a custom InputStream class that wraps an Iterator (as returned by toLocalIterator), and only buffers one element at a time, calling next() on the iterator only when InputStream.read() demands more data.
I'm modifying the source code of H2 MVStore 1.4.191 to write files by doing some thread sleep.
The big change is that the file is not written in one time anymore, but by 2^16 bytes chunks.
MVStore uses java nio FileChannel and ByteBuffer to write its file. The problem is that the result is different from the original version. It seems that FileChannel add space characters (0x20 in ASCII), like, more than 40 in a row. Or maybe it doesn't remove this spaces, on the contrary to the original version, I don't know.
I suppose it's due to file writing.
The method file.write(buffer,position), where file is FileChannel object, and that returns the number of bytes written, sometimes returns a smaller number than the buffer size, in the original version of H2. In my version, it never happens.
Have you tips about ByteBuffer, FileChannel and my problem ?
The original code call writefully function few times (it writes a header, a footer and the datas)
int off = 0;
do {
int len = file.write(src, pos + off);
off += len;
} while (src.remaining() > 0);
src is the ByteBuffer and file is a FileChannelImpl from sun.io. Buffer can contain more than 50MB of datas.
From this code, I developped a solution that split the ByteBuffer in 2^16-sized buffers that I write, by adding sleep function between each of them:
int off = 0;
byte[] buffer = src.array();
int size = src.array().length;
int chunkSize = 128;
List<byte[]> splittedBuffer = new ArrayList<byte[]>();
int i = 0;
while (i < size) {
int start = i;
int end = i + chunkSize;
if (end > size)
{
//if buffer size is not a multiple of 2^16, the last
//chunk will be smaller
end = size;
}
splittedBuffer.add(Arrays.copyOfRange(src.array(), start, end));
try {
Thread.sleep(5);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
i += chunkSize;
}
int offset = 0;
for (byte[] chunk : splittedBuffer) {
int len=file.write(ByteBuffer.wrap(chunk),pos+offset);
offset+=len;
try {
Thread.sleep(5);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Finally, the problem is maybe not that whitespaces are added, but that a part of datas is written in a wrong place. I'm going to check it.
Ok,
The problem was that I used the size of ByteBuffer to split it instead of its limit which is smaller (set by H2 during its process)
Thanks for the help
Regards
I have a problem when the user upload large files (> 1 GB) (I'm using flow.js library), it creates hundred of thousand small chunked files (e.g 100KB each) inside temporary directory but failed to merge into single file, due to MemoryOutOfException. This is not happened when the file is under 1 GB. I know it sound tedious and you probably suggest me to increase the XmX in my container-but I want to have another angle besides that.
Here is my code
private void mergeFile(String identifier, int totalFile, String outputFile) throws AppException{
File[] fileDatas = new File[totalFile]; //we know the size of file here and create specific amount of the array
byte fileContents[] = null;
int totalFileSize = 0;
int filePartUploadSize = 0;
int tempFileSize = 0;
//I'm creating array of file and append the length
for (int i = 0; i < totalFile; i++) {
fileDatas[i] = new File(identifier + "." + (i + 1)); //indentifier is the name of the file
totalFileSize += fileDatas[i].length();
}
try {
fileContents = new byte[totalFileSize];
InputStream inStream;
for (int j = 0; j < totalFile; j++) {
inStream = new BufferedInputStream(new FileInputStream(fileDatas[j]));
filePartUploadSize = (int) fileDatas[j].length();
inStream.read(fileContents, tempFileSize, filePartUploadSize);
tempFileSize += filePartUploadSize;
inStream.close();
}
} catch (FileNotFoundException ex) {
throw new AppException(AppExceptionCode.FILE_NOT_FOUND);
} catch (IOException ex) {
throw new AppException(AppExceptionCode.ERROR_ON_MERGE_FILE);
} finally {
write(fileContents, outputFile);
for (int l = 0; l < totalFile; l++) {
fileDatas[l].delete();
}
}
}
Please show the "inefficient" of this method, once again... only large files that cannot be merge using this method, smaller one ( < 1 GB) no problem at all....
I appreciate if you do not suggest me to increase the heap memory instead show me the fundamental error of this method... thanks...
Thanks
It's unnecessary to allocate the entire file size in memory by declaring a byte array of the entire size. Building the concatenated file in memory in general is totally unnecessary.
Just open up an outputstream for your target file, and then for each file that you are combining to make it, just read each one as an input stream and write the bytes to outputstream, closing each one as you finish. Then when you're done with them all, close the output file. Total memory use will be a few thousand bytes for the buffer.
Also, don't do I/O operations in finally block (except closing and stuff).
Here is a rough example you can play with.
ArrayList<File> files = new ArrayList<>();// put your files here
File output = new File("yourfilename");
BufferedOutputStream boss = null;
try
{
boss = new BufferedOutputStream(new FileOutputStream(output));
for (File file : files)
{
BufferedInputStream bis = null;
try
{
bis = new BufferedInputStream(new FileInputStream(file));
boolean done = false;
while (!done)
{
int data = bis.read();
boss.write(data);
done = data < 0;
}
}
catch (Exception e)
{
//do error handling stuff, log it maybe?
}
finally
{
try
{
bis.close();//do this in a try catch just in case
}
catch (Exception e)
{
//handle this
}
}
}
} catch (Exception e)
{
//handle this
}
finally
{
try
{
boss.close();
}
catch (Exception e) {
//handle this
}
}
... show me the fundamental error of this method
The implementation flaw is that you are creating a byte array (fileContents) whose size is the total file size. If the total file size is too big, that will cause an OOME. Inevitably.
Solution - don't do that! Instead "stream" the file by reading from the "chunk" files and writing to the final file using a modest sized buffer.
There are other problems with your code too. For instance, it could leak file descriptors because you are not ensure that inStream is closed under all circumstances. Read up on the "try-with-resources" construct.
I'm trying to create a java program that downloads certain asset files from an FTP server to a local file. Because my (free) FTP server doesn't support file sizes over a few megabytes, I decided to split up the files when they are uploaded and recombine them when the program downloads them. This works, but it is rather slow, because for each file, it has to get the InputStream, which takes some time.
The FTP server I use has a way to download the files without actually logging into the server, so I'm using this code to get the InputStream:
private static final InputStream getInputStream(String file) throws IOException {
return new URL("http://site.website.com/path/" + file).openStream();
}
To get the InputStream of a part of the asset file I'm using this code:
public static InputStream getAssetInputStream(String asset, int num) throws IOException, FTPException {
try {
return getInputStream("assets/" + asset + "_" + num + ".raf");
} catch (Exception e) {
// error handling
}
}
Because the getAssetInputStreams(String, int) method takes some time to run (especially if the file size is more then a megabyte), I decided to make the code that actually downloads the file multi-threaded. Here is where my problem lies.
final Map<Integer, Boolean> done = new HashMap<Integer, Boolean>();
final Map<Integer, byte[]> parts = new HashMap<Integer, byte[]>();
for (int i = 0; i < numParts; i++) {
final int part = i;
done.put(part, false);
new Thread(new Runnable() {
#Override
public void run() {
try {
InputStream is = FTP.getAssetInputStream(asset, part);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] buf = new byte[DOWNLOAD_BUFFER_SIZE];
int len = 0;
while ((len = is.read(buf)) > 0) {
baos.write(buf, 0, len);
curDownload.addAndGet(len);
totAssets.addAndGet(len);
}
parts.put(part, baos.toByteArray());
done.put(part, true);
} catch (IOException e) {
// error handling
} catch (FTPException e) {
// error handling
}
}
}, "Download-" + asset + "-" + i).start();
}
while (done.values().contains(false)) {
try {
Thread.sleep(100);
} catch(InterruptedException e) {
e.printStackTrace();
}
}
File assetFile = new File(dir, "assets/" + asset + ".raf");
assetFile.createNewFile();
FileOutputStream fos = new FileOutputStream(assetFile);
for (int i = 0; i < numParts; i++) {
fos.write(parts.get(i));
}
fos.close();
This code works, but not always. When I run it on my desktop computer, it works almost always. Not 100% of the time, but often it works just fine. On my laptop, which has a far worse internet connection, it almost never works. The result is a file that is incomplete. Sometimes, it downloads 50% of the file. Sometimes, it downloads 90% of the file, it differs every time.
Now, if I replace the .start() by .run(), the code works just fine, 100% of the time, even on my laptop. It is, however, incredibly slow, so I'd rather not use .run().
Is there a way I could change my code so it does work multi-threaded? Any help will be appreciated.
Firstly, get your FTP server replaced, there are plenty of free FTP servers that support arbitrary file size serving with additional features, but I digress...
Your code seems to have many unrelated problems that could potentially all cause the behavior you are seeing, addressed below:
You have race conditions from accessing the done and parts maps from unprotected/unsynchronized access from multiple threads. This could cause data corruption and loss of synchronization for these variables between threads, potentially causing done.values().contains(false) to return true even when it's really not.
You are calling done.values().contains() repeatedly at a high frequency. Whilst the javadoc doesn't explicitly state, a hash map likely traverses every value in a O(n) fashion to check if a given map contains a value. Coupled with the fact that other threads are modifying the map, you'll get undefined behavior. According to values() javadoc:
If the map is modified while an iteration over the collection is in progress (except through the iterator's own remove operation), the results of the iteration are undefined.
You are somehow calling new URL("http://site.website.com/path/" + file).openStream(); but stating you are using FTP. The http:// in the link defines the protocol openStream() tries to open in and http:// is not ftp://. Not sure if this is a typo or did you mean HTTP (or do you have an HTTP server serving identical files).
Any thread raising any type of Exception will cause the code to fail given that not all parts will have "completed" (based on your busy-wait loop design). Granted, you may be redacted some other logic to guard against this, but otherwise this is a potential problem with the code.
You aren't closing any streams that you've opened. This could mean that the underlying socket itself is also left open. Not only does this constitute resource leakage, if the server itself has some sort of maximum number of simultaneous connection limit, you are only causing new connections to fail because the old, completed transfers are not closed.
Based on the issues above, I propose moving the download logic into a Callable task and running them through an ExecutorService as follows:
LinkedList<Callable<byte[]>> tasksToExecute = new LinkedList<>();
// Populate tasks to run
for(int i = 0; i < numParts; i++){
final int part = i;
// Lambda to
tasksToExecute.add(() -> {
InputStream is = null;
try{
is = FTP.getAssetInputStream(asset, part);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] buf = new byte[DOWNLOAD_BUFFER_SIZE];
int len = 0;
while((len = is.read(buf)) > 0){
baos.write(buf, 0, len);
curDownload.addAndGet(len);
totAssets.addAndGet(len);
}
return baos.toByteArray();
}catch(IOException e){
// handle exception
}catch(FTPException e){
// handle exception
}finally{
if(is != null){
try{
is.close();
}catch(IOException ignored){}
}
}
return null;
});
}
// Retrieve an ExecutorService instance, note the use of work stealing pool is Java 8 only
// This can be substituted for newFixedThreadPool(nThreads) for Java < 8 as well for tight control over number of simultaneous links
ExecutorService executor = Executors.newWorkStealingPool(4);
// Tells the executor to execute all the tasks and give us the results
List<Future<byte[]>> resultFutures = executor.invokeAll(tasksToExecute);
// Populates the file
File assetFile = new File(dir, "assets/" + asset + ".raf");
assetFile.createNewFile();
try(FileOutputStream fos = new FileOutputStream(assetFile)){
// Iterate through the futures, writing them to file in order
for(Future<byte[]> result : resultFutures){
byte[] partData = result.get();
if(partData == null){
// exception occured during downloading this part, handle appropriately
}else{
fos.write(partData);
}
}
}catch(IOException ex(){
// handle exception
}
Using the executor service, you further optimize your multi-threading scenario since the output file will start writing as soon as pieces (in order) are available and that threads themselves are reused to save on thread creation costs.
As mentioned, there could be the case where too many simultaneous links causes the server to reject connections (or even more dangerously, write an EOF to make you think the part was downloaded). In this case, the number of worker threads can be tweaked by newFixedThreadPool(nThreads) to ensure at any given time, only nThreads amount of downloads can happen concurrently.
I'm working on a homework assignment that has the purpose of showing how increasing the number of threads can help or hurt a program's performance. The basic idea is to thread individual requests for data from a website, then determine how long it takes to perform all the queries when one runs n queries simultaneously.
I think I have the threading and the clocking done properly, but something odd is going on with the requests. I am using java.net.URLConnection to get connect to the databases. My first three thousand or so connections will succeed and load. Then, several hundred or so calls fail without any evidence of Java having tried for the specified timeout period.
The code I run in a thread is as follows:
/* This code to get the contents from an URL was adapted from a
* StackOverflow question found at http://goo.gl/QPqR4 .
*/
private static String loadContent(String address) throws Exception {
String toReturn = "";
try {
URL url = new URL(address);
URLConnection con = url.openConnection();
con.setConnectTimeout(5000);
con.setReadTimeout(5000);
InputStream stream = con.getInputStream();
Reader r = new InputStreamReader(stream, "ISO-8859-1");
while (true) {
int ch = r.read();
if (ch < 0) {
break;
}
toReturn += (char) ch;
}
r.close();
stream.close();
} catch (Exception e) {
System.out.println(address + ": " + e.getMessage());
throw e;
}
return toReturn;
}
The code for running the threads is as follows. The NormalPerformance class is one I wrote to simplify calculating the mean and variance of a series of observations.
/* This code is patterned after code provided by my professor.
*/
private static NormalPerformance performExperiment(int threads, int runs)
throws Exception
{
NormalPerformance toReturn = new NormalPerformance();
for (int i = 0; i < runs; i++) {
final List<Callable<Void>> tasks = new ArrayList<Callable<Void>>();
for (int j = 0; j < URLS.length; j++) {
final String url = URLS[i];
tasks.add(new Callable<Void>() {
public Void call() throws Exception {
loadContent(url);
return null;
}
});
}
long start = System.nanoTime();
final ExecutorService exectuorPool = Executors.newFixedThreadPool(threads);
executorPool.invokeAll(tasks);
executorPool.shutdown();
double time = (System.nano() - start) / 1000000000.;
toReturn.addObservation(time);
System.out.println("" + threads + " " + (i + 1) + ": " + time);
}
return toReturn;
}
Why am I seeing this odd pattern of success and failure? Even stranger, there are times when killing the program and restarting does nothing to stop the run of failures. I've tried things like forcing threads to sleep, calling System.gc(), and increasing the connection and reading timeout values, but none of these, alone or combined, have fixed this.
How can I guarantee that my connections have the best chance possible of connecting?
Environment:
Windows 7 64-bit,
Eclipse Juno 64-bit,
JRE 7