Avoid using a collect on large dataset

Avoid using a collect on large dataset - java

We're using Apache Spark for processing. We have several steps where it is necessary to use collect() to to a JavaRDD to a list, but we are wanting to avoid doing this in order to operate on a list. We know we want to avoid this because it brings everything back to the driver. It ends up and we run out of memory because we are processing anywhere from 5million - 200 million records. Here's an example of what we have so far.
private InputStream createCSVObject(JavaRDD<Object[]> args) {
System.out.println("inside createCSVObject");
try {
StringBuilder value = new StringBuilder(CHUNK_SIZE);
args.collect().forEach(i -> {
value.append(i[0].toString());
for (int j = 1; j < i.length; ++j) {
value.append("," + i[j]);
}
value.append("\n");
});
System.out.println("Out of createCSVObject for loops");
byte[] strBytes = value.toString().getBytes();
InputStream myInputStream = new ByteArrayInputStream(strBytes);
return (myInputStream);
} catch (Exception e) {
System.err.println(String.format("ERROR: FileWriterService - writeFile: %s", e.getMessage()));
return null;
}
}
I've searched for this over and over across SO and google, and can't come up with anything definitive. Does anyone have any ideas???
Note: the COLLECT at args.collect()
EDIT:
After looking into the proposed answer below we devised a simple proof of concept for it, and what we came up with does one iteration through every 40s. The logic is not complex, why is it so slow?
System.out.println("inside createCSVObject");
try {
StringBuilder value = new StringBuilder();
System.out.println("args length " + args.toLocalIterator().next().length);
while (args.toLocalIterator().hasNext()) {
Object[] objects = args.toLocalIterator().next();
System.out.println("Inside iterator");
value.append(objects[0].toString());
for (int j = 1; j < objects.length; ++j) {
value.append("," + objects[j]);
}
value.append("\n");
}
System.out.println("Out of createCSVObject for loops");
byte[] strBytes = value.toString().getBytes();
InputStream myInputStream = new ByteArrayInputStream(strBytes);
return (myInputStream);
} catch (Exception e) {
System.err.println(String.format("ERROR: FileWriterService - writeFile: %s", e.getMessage()));
e.printStackTrace();
return null;
}

You can use JavaRDD.toLocalIterator() to iterate through the entire RDD on the driver without collecting it all into a list. Instead, it brings each partition over to the driver one at a time, so doesn't use more memory than the size of the largest partition (documentation).
Obviously, in the exmple you've given, you still have the problem that you're collecting everything into a massive byte array, which will still use quite a lot of memory. Instead, you could write a custom InputStream class that wraps an Iterator (as returned by toLocalIterator), and only buffers one element at a time, calling next() on the iterator only when InputStream.read() demands more data.

Related

Spring endpoint that provides data in chunks

In my team, we have an issue with a specific endpoint which, when called with some specific parameters, provides a huge JSON in chunks. So, for example, if the JSON had 1,000 rows, after about 30 seconds of opening the URL with our browser (it's a GET endpoint) we get 100 rows, then wait a few more and we get the next 200, etc until the JSON is exhausted. This is a problem for us because our application times out before retrieving the JSON. We want to emulate the behavior of the endpoint with an example endpoint of our own, for debugging purposes.
So far, the following is what I have. For simplicity, I'm not even reading a JSON, just a randomly generated string. The logs show me that I'm reading the data a few bytes at a time, writing it and flushing the OutputStream. The crucial difference is that my browser (or POSTMAN) show me the data at the very end, not in chunks. Is there anything I can do to make it so that I can see the data coming back in chunks?
private static final int readBufSize = 10;
private static final int generatedStringSize = readBufSize * 10000;
#GetMapping(path = "/v2/payload/mocklargepayload")
public void simulateLargePayload(HttpServletResponse response){
try(InputStream inputStream = IOUtils.toInputStream(RandomStringUtils.randomAlphanumeric(generatedStringSize));
OutputStream outputStream = response.getOutputStream()) {
final byte[] buffer = new byte[readBufSize];
for(int i = 0; i < generatedStringSize; i+= readBufSize){
inputStream.read(buffer, 0, readBufSize - 1);
buffer[buffer.length - 1] = '\n';
log.info("Read bytes: {}", buffer);
outputStream.write(buffer);
log.info("Wrote bytes {}", buffer);
Thread.sleep(500);
log.info("Flushing stream");
outputStream.flush();
}
} catch (IOException | InterruptedException e) {
log.error("Received exception: {}", e.getClass().getSimpleName());
}
}

Your endpoint should return a header "content-length" where you will specify the total size of the info that your endpoint will return. That will inform your client of how much info to expect. Also, you can read info chunk by chunk as it becomes available. I had a reverse problem where I wrote a large input into my end-point (POST). And end-point was reading it faster than I was writing, so at some point when it read all the available info so far it stopped reading thinking it was it. So, I wrote this code which you can implement the same way on your client side:
#PostMapping
public ResponseEntity<String> uploadTest(HttpServletRequest request) {
try {
String lengthStr = request.getHeader("content-length");
int length = TextUtils.parseStringToInt(lengthStr, -1);
if(length > 0) {
byte[] buff = new byte[length];
ServletInputStream sis =request.getInputStream();
int counter = 0;
while(counter < length) {
int chunkLength = sis.available();
byte[] chunk = new byte[chunkLength];
sis.read(chunk);
for(int i = counter, j= 0; i < counter + chunkLength; i++, j++) {
buff[i] = chunk[j];
}
counter += chunkLength;
if(counter < length) {
TimeUtils.sleepFor(5, TimeUnit.MILLISECONDS);
}
}
Files.write(Paths.get("C:\\Michael\\tmp\\testPic.jpg"), buff);
}
} catch (Exception e) {
System.out.println(TextUtils.getStacktrace(e));
}
return ResponseEntity.ok("Success");
}
Also, I wrote a general feature for read/write with the same problem (again for server-side) but again you can implement the same logic on client side as well. The feature reads the info in chunks as it becomes available. This feature comes with Open-source library MgntUtils (written and maintained by me). See class WebUtils. The library with source code and Javadoc is available on Github here. Javadoc is here. It is also available as Maven artifact here

Java benchmark disk speed

I'm trying to get some reliable method of measuring disk read speed, but failing at removal of cache out of the equation.
In How to measure Disk Speed in Java for Benchmarking is in answer from simgineer utility for exactly this, but for some reason, I failed to replicate its behaviour, and running the utility does not yield anything precise either (for read).
From suggestion in different answer, setting test file to something bigger than main memory size seems to work, but I cannot afford to spend whole four minutes for system to allocate 130GB file. (not writing anything in the file results in sparse file and returns bogus times)
Sufficient file size seems to be somewhere between
Runtime.getRuntime().maxMemory()
and
Runtime.getRuntime().maxMemory()*2
The source code of my current solution:
File file = new File(false ? "D:/work/bench.dat" : "./work/bench.dat");
RandomAccessFile wFile = null, rFile = null;
try {
System.out.println("Allocating test file ...");
int blockSize = 1024*1024;
long size = false ? 10L*1024L*(long)blockSize : Runtime.getRuntime().maxMemory()*2;
byte[] block = new byte[blockSize];
for(int i = 0; i<blockSize; i++) {
if(i % 2 == 0) block[i] = (byte) (i & 0xFF);
}
System.out.println("Writing ...");
wFile = new RandomAccessFile(file,"rw");
wFile.setLength(size);
for(long i = 0; i<size-blockSize; i+= blockSize) {
wFile.write(block);
}
wFile.close();
System.out.println("Running read test ...");
long t0 = System.nanoTime();
rFile = new RandomAccessFile(file,"r");
int blockCount = (int)(size/blockSize)-1;
Random rnd = new Random();
for(int i = 0; i<testCount; i++) {
rFile.seek((long)rnd.nextInt(blockCount)*(long)blockSize);
rFile.readFully(block, 0, blockSize);
}
rFile.close();
long t1 = System.nanoTime();
double readB = ((double)testCount*(double)blockSize);
double timeNs = (double)(t1-t0);
return (readB/(1024*1024))/(timeNs/(1000*1000*1000));
} catch (Exception e) {
Logger.logError("Failed to benchmark drive speed!", e);
return 0;
} finally {
if(wFile != null) {try {wFile.close();} catch (IOException e) {}}
if(rFile != null) {try {rFile.close();} catch (IOException e) {}}
if(file.exists()) {file.delete();}
}
I somewhat hoped to get a benchmark that will finish within seconds (caching results for following runs) having only first execution a bit slower.
I could technically crawl the filesystem and bench the read on files that are already on the drive, but that smells like a lot of undefined behaviour and firewalls are not happy about it either.
Any other options left? (platform dependent libraries are off the table)

In the end decided to solve the problem by scouring local work folder for files and load those, hoping we packaged enough with application to get specs speeds. In my current test case, the answer is luckily yes, but there are no guarantees, so I keep the approach from question as a backup plan.
This is not exactly perfect solution, but it somewhat works, getting specs speed at about 2000 test files. Bear in mind that this test cannot be rerun with same results, as all test files from previous execution are now probably cached.
You can always call flushmem ( https://chadaustin.me/flushmem/ ) by Chad Austin, but that takes about as much time to execute as the original approach, so I would advise to simply cache result of the first run and hope for the best.
Used code:
final int MIN_FILE_SIZE = 1024*10;
final int MAX_READ = 1024*1024*50;
final int FILE_COUNT_FRACTION = 4;
// Scour the location of the runtime for any usable files.
ArrayList<File> found = new ArrayList<>();
ArrayList<File> queue = new ArrayList<>();
queue.add(new File("./"));
while(!queue.isEmpty() && found.size() < testCount) {
File tested = queue.remove(queue.size()-1);
if(tested.isDirectory()) {
queue.addAll(Arrays.asList(tested.listFiles()));
} else if(tested.length()>MIN_FILE_SIZE){
found.add(tested);
}
}
// If amount of found files is not sufficient, perform test with new file.
if(found.size() < testCount/FILE_COUNT_FRACTION) {
Logger.logInfo("Disk to CPU transfer benchmark failed to find "
+ "sufficient amount of files to read, slow version "
+ "will be performed!", found.size());
return benchTransferSlowDC(testCount);
}
System.out.println(found.size());
byte[] block = new byte[MAX_READ];
Collections.shuffle(found);
RandomAccessFile raf = null;
long readB = 0;
try {
long t0 = System.nanoTime();
for(int i = 0; i<Math.min(found.size(), testCount); i++) {
File file = found.get(i);
int size = (int) Math.min(file.length(), MAX_READ);
raf = new RandomAccessFile(file,"r");
raf.read(block, 0, size);
raf.close();
readB += size;
}
long t1 = System.nanoTime();
return ((double)readB/(1024*1024))/((double)(t1-t0)/(1000*1000*1000));
//return (double)(t1-t0) / (double)readB;
} catch (Exception e) {
Logger.logError("Failed to benchmark drive speed!", e);
if(raf != null) try {raf.close();} catch(Exception ex) {}
return 0;
}

Java OutOfMemoryError while merge large file parts from chunked files

I have a problem when the user upload large files (> 1 GB) (I'm using flow.js library), it creates hundred of thousand small chunked files (e.g 100KB each) inside temporary directory but failed to merge into single file, due to MemoryOutOfException. This is not happened when the file is under 1 GB. I know it sound tedious and you probably suggest me to increase the XmX in my container-but I want to have another angle besides that.
Here is my code
private void mergeFile(String identifier, int totalFile, String outputFile) throws AppException{
File[] fileDatas = new File[totalFile]; //we know the size of file here and create specific amount of the array
byte fileContents[] = null;
int totalFileSize = 0;
int filePartUploadSize = 0;
int tempFileSize = 0;
//I'm creating array of file and append the length
for (int i = 0; i < totalFile; i++) {
fileDatas[i] = new File(identifier + "." + (i + 1)); //indentifier is the name of the file
totalFileSize += fileDatas[i].length();
}
try {
fileContents = new byte[totalFileSize];
InputStream inStream;
for (int j = 0; j < totalFile; j++) {
inStream = new BufferedInputStream(new FileInputStream(fileDatas[j]));
filePartUploadSize = (int) fileDatas[j].length();
inStream.read(fileContents, tempFileSize, filePartUploadSize);
tempFileSize += filePartUploadSize;
inStream.close();
}
} catch (FileNotFoundException ex) {
throw new AppException(AppExceptionCode.FILE_NOT_FOUND);
} catch (IOException ex) {
throw new AppException(AppExceptionCode.ERROR_ON_MERGE_FILE);
} finally {
write(fileContents, outputFile);
for (int l = 0; l < totalFile; l++) {
fileDatas[l].delete();
}
}
}
Please show the "inefficient" of this method, once again... only large files that cannot be merge using this method, smaller one ( < 1 GB) no problem at all....
I appreciate if you do not suggest me to increase the heap memory instead show me the fundamental error of this method... thanks...
Thanks

It's unnecessary to allocate the entire file size in memory by declaring a byte array of the entire size. Building the concatenated file in memory in general is totally unnecessary.
Just open up an outputstream for your target file, and then for each file that you are combining to make it, just read each one as an input stream and write the bytes to outputstream, closing each one as you finish. Then when you're done with them all, close the output file. Total memory use will be a few thousand bytes for the buffer.
Also, don't do I/O operations in finally block (except closing and stuff).
Here is a rough example you can play with.
ArrayList<File> files = new ArrayList<>();// put your files here
File output = new File("yourfilename");
BufferedOutputStream boss = null;
try
{
boss = new BufferedOutputStream(new FileOutputStream(output));
for (File file : files)
{
BufferedInputStream bis = null;
try
{
bis = new BufferedInputStream(new FileInputStream(file));
boolean done = false;
while (!done)
{
int data = bis.read();
boss.write(data);
done = data < 0;
}
}
catch (Exception e)
{
//do error handling stuff, log it maybe?
}
finally
{
try
{
bis.close();//do this in a try catch just in case
}
catch (Exception e)
{
//handle this
}
}
}
} catch (Exception e)
{
//handle this
}
finally
{
try
{
boss.close();
}
catch (Exception e) {
//handle this
}
}

... show me the fundamental error of this method
The implementation flaw is that you are creating a byte array (fileContents) whose size is the total file size. If the total file size is too big, that will cause an OOME. Inevitably.
Solution - don't do that! Instead "stream" the file by reading from the "chunk" files and writing to the final file using a modest sized buffer.
There are other problems with your code too. For instance, it could leak file descriptors because you are not ensure that inStream is closed under all circumstances. Read up on the "try-with-resources" construct.

Inconsistent output from multithreaded FTP InputStreams

I'm trying to create a java program that downloads certain asset files from an FTP server to a local file. Because my (free) FTP server doesn't support file sizes over a few megabytes, I decided to split up the files when they are uploaded and recombine them when the program downloads them. This works, but it is rather slow, because for each file, it has to get the InputStream, which takes some time.
The FTP server I use has a way to download the files without actually logging into the server, so I'm using this code to get the InputStream:
private static final InputStream getInputStream(String file) throws IOException {
return new URL("http://site.website.com/path/" + file).openStream();
}
To get the InputStream of a part of the asset file I'm using this code:
public static InputStream getAssetInputStream(String asset, int num) throws IOException, FTPException {
try {
return getInputStream("assets/" + asset + "_" + num + ".raf");
} catch (Exception e) {
// error handling
}
}
Because the getAssetInputStreams(String, int) method takes some time to run (especially if the file size is more then a megabyte), I decided to make the code that actually downloads the file multi-threaded. Here is where my problem lies.
final Map<Integer, Boolean> done = new HashMap<Integer, Boolean>();
final Map<Integer, byte[]> parts = new HashMap<Integer, byte[]>();
for (int i = 0; i < numParts; i++) {
final int part = i;
done.put(part, false);
new Thread(new Runnable() {
#Override
public void run() {
try {
InputStream is = FTP.getAssetInputStream(asset, part);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] buf = new byte[DOWNLOAD_BUFFER_SIZE];
int len = 0;
while ((len = is.read(buf)) > 0) {
baos.write(buf, 0, len);
curDownload.addAndGet(len);
totAssets.addAndGet(len);
}
parts.put(part, baos.toByteArray());
done.put(part, true);
} catch (IOException e) {
// error handling
} catch (FTPException e) {
// error handling
}
}
}, "Download-" + asset + "-" + i).start();
}
while (done.values().contains(false)) {
try {
Thread.sleep(100);
} catch(InterruptedException e) {
e.printStackTrace();
}
}
File assetFile = new File(dir, "assets/" + asset + ".raf");
assetFile.createNewFile();
FileOutputStream fos = new FileOutputStream(assetFile);
for (int i = 0; i < numParts; i++) {
fos.write(parts.get(i));
}
fos.close();
This code works, but not always. When I run it on my desktop computer, it works almost always. Not 100% of the time, but often it works just fine. On my laptop, which has a far worse internet connection, it almost never works. The result is a file that is incomplete. Sometimes, it downloads 50% of the file. Sometimes, it downloads 90% of the file, it differs every time.
Now, if I replace the .start() by .run(), the code works just fine, 100% of the time, even on my laptop. It is, however, incredibly slow, so I'd rather not use .run().
Is there a way I could change my code so it does work multi-threaded? Any help will be appreciated.

Firstly, get your FTP server replaced, there are plenty of free FTP servers that support arbitrary file size serving with additional features, but I digress...
Your code seems to have many unrelated problems that could potentially all cause the behavior you are seeing, addressed below:
You have race conditions from accessing the done and parts maps from unprotected/unsynchronized access from multiple threads. This could cause data corruption and loss of synchronization for these variables between threads, potentially causing done.values().contains(false) to return true even when it's really not.
You are calling done.values().contains() repeatedly at a high frequency. Whilst the javadoc doesn't explicitly state, a hash map likely traverses every value in a O(n) fashion to check if a given map contains a value. Coupled with the fact that other threads are modifying the map, you'll get undefined behavior. According to values() javadoc:
If the map is modified while an iteration over the collection is in progress (except through the iterator's own remove operation), the results of the iteration are undefined.
You are somehow calling new URL("http://site.website.com/path/" + file).openStream(); but stating you are using FTP. The http:// in the link defines the protocol openStream() tries to open in and http:// is not ftp://. Not sure if this is a typo or did you mean HTTP (or do you have an HTTP server serving identical files).
Any thread raising any type of Exception will cause the code to fail given that not all parts will have "completed" (based on your busy-wait loop design). Granted, you may be redacted some other logic to guard against this, but otherwise this is a potential problem with the code.
You aren't closing any streams that you've opened. This could mean that the underlying socket itself is also left open. Not only does this constitute resource leakage, if the server itself has some sort of maximum number of simultaneous connection limit, you are only causing new connections to fail because the old, completed transfers are not closed.
Based on the issues above, I propose moving the download logic into a Callable task and running them through an ExecutorService as follows:
LinkedList<Callable<byte[]>> tasksToExecute = new LinkedList<>();
// Populate tasks to run
for(int i = 0; i < numParts; i++){
final int part = i;
// Lambda to
tasksToExecute.add(() -> {
InputStream is = null;
try{
is = FTP.getAssetInputStream(asset, part);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] buf = new byte[DOWNLOAD_BUFFER_SIZE];
int len = 0;
while((len = is.read(buf)) > 0){
baos.write(buf, 0, len);
curDownload.addAndGet(len);
totAssets.addAndGet(len);
}
return baos.toByteArray();
}catch(IOException e){
// handle exception
}catch(FTPException e){
// handle exception
}finally{
if(is != null){
try{
is.close();
}catch(IOException ignored){}
}
}
return null;
});
}
// Retrieve an ExecutorService instance, note the use of work stealing pool is Java 8 only
// This can be substituted for newFixedThreadPool(nThreads) for Java < 8 as well for tight control over number of simultaneous links
ExecutorService executor = Executors.newWorkStealingPool(4);
// Tells the executor to execute all the tasks and give us the results
List<Future<byte[]>> resultFutures = executor.invokeAll(tasksToExecute);
// Populates the file
File assetFile = new File(dir, "assets/" + asset + ".raf");
assetFile.createNewFile();
try(FileOutputStream fos = new FileOutputStream(assetFile)){
// Iterate through the futures, writing them to file in order
for(Future<byte[]> result : resultFutures){
byte[] partData = result.get();
if(partData == null){
// exception occured during downloading this part, handle appropriately
}else{
fos.write(partData);
}
}
}catch(IOException ex(){
// handle exception
}
Using the executor service, you further optimize your multi-threading scenario since the output file will start writing as soon as pieces (in order) are available and that threads themselves are reused to save on thread creation costs.
As mentioned, there could be the case where too many simultaneous links causes the server to reject connections (or even more dangerously, write an EOF to make you think the part was downloaded). In this case, the number of worker threads can be tweaked by newFixedThreadPool(nThreads) to ensure at any given time, only nThreads amount of downloads can happen concurrently.

How to count number of objects stored in a *.ser file

I'm trying to read all the objects stored in a *.ser file and store them in a array of objects. How can I get the number of objects stored in that file(So that I can declare the array to be number_of_objects long)?
I've checked the API and was unable to find a Desirable function.
-edit-
A Part of the code:
Ser[] objTest2 = new Ser[number_of_objects];
for(int i=0; i<=number_of_objects, i++) {
objTest2[i] = (Ser)testOS2.readObject();
objTest2[i].printIt();
}

What you want to look at is the ArrayList class.
It is basically a dynamically growing Array.
You can add items to it like so:
ArrayList list = new ArrayList();
list.add(someObject);
list.add(anotherBoject);
The list will grow as you add new items to it. So you don't have to know the size ahead of time.
If you need to get an array out if the List at the end you can use the toArray() method of List.
Object[] arr = list.toArray(new Object[list.size()]);
Edit:
Here is a general implementation of what you need:
List<Ser> objTest2 = new ArrayList<Ser>();
while (testOS2.available > 0) {
Ser toAdd = ((Ser)testOS2.readObject());
toAdd.printIt();
objTest2.add(toAdd);
}
*I don't think available() is a reliable test for whether or not there are more bytes to read.

Year's later this post is still relevant. I was looking for a way to loop through a .ser file while de-serializing each file, and to some extent, Rohit Singh's post helped. This is my version of the same though:
ArrayList<Profile> availableProfiles = new ArrayList<Profile>();
try {
FileInputStream fileStream = new FileInputStream("profiles.ser");
ObjectInputStream os = new ObjectInputStream(fileStream);
Object profileObject = null;
while((profileObject = os.readObject()) != null) {
Profile castObject = (Profile) profileObject;
availableProfiles.add(castObject);
}
os.close();
} catch(Exception ex) {
if(ex instanceof EOFException) {
out.println("End of file reached!");
out.println("Total profiles found is: " + availableProfiles.size());
} else if(ex instanceof FileNotFoundException) {
out.println("File not found! \n Answer the following to create your profile");
createProfile();
}
}
The most important part is the position of the while-loop. In my version, that loop does not create a new FileInputStream or ObjectInputStream like Singh's does. That will make the ObjectInputStream read the .ser file afresh each time those two are created, and as a result, you only add() one Profile object to the ArrayList- the first one to be serialized- each time the loop restarts.
Instead, we only loop the with the readObject() method until it produces a null signifying no other object was found in the file, and it triggers the EOFException.

while(true)
{
try
{
Employee e=(Employee) ois.readObject();
System.out.println("successfully deserialized.........showing details of object.");
e.display();
}
catch(Exception e)
{
if(e instanceof java.io.EOFException)
{
System.out.println("All objects read and displayed");
break;
}
else
{
System.out.println("Some Exception Occured.");
e.printStackTrace();
}
}
}

Just keep reading objects until you get EOFException. That's what it's for. And use a List instead of an array so you don't need the count in advance.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Avoid using a collect on large dataset - java

Related

Spring endpoint that provides data in chunks

Java benchmark disk speed

Java OutOfMemoryError while merge large file parts from chunked files

Inconsistent output from multithreaded FTP InputStreams

How to count number of objects stored in a *.ser file

Categories

Resources