Uploading large gzipped data files to HDFS - java

I have a use case where I want to upload big gzipped text data files (~ 60 GB) on HDFS.
My code below is taking about 2 hours to upload these files in chunks of 500 MB. Following is the pseudo code. I was chekcing if somebody could help me reduce this time:
i) int fileFetchBuffer = 500000000;
System.out.println("file fetch buffer is: " + fileFetchBuffer);
int offset = 0;
int bytesRead = -1;
try {
fileStream = new FileInputStream (file);
if (fileName.endsWith(".gz")) {
stream = new GZIPInputStream(fileStream);
BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
String[] fileN = fileName.split("\\.");
System.out.println("fil 0 : " + fileN[0]);
System.out.println("fil 1 : " + fileN[1]);
//logger.info("First line is: " + streamBuff.readLine());
byte[] buffer = new byte[fileFetchBuffer];
FileSystem fs = FileSystem.get(conf);
int charsLeft = fileFetchBuffer;
while (true) {
charsLeft = fileFetchBuffer;
logger.info("charsLeft outside while: " + charsLeft);
FSDataOutputStream dos = null;
while (charsLeft != 0) {
bytesRead = stream.read(buffer, 0, charsLeft);
if (bytesRead < 0) {
dos.flush();
dos.close();
break;
}
offset = offset + bytesRead;
charsLeft = charsLeft - bytesRead;
logger.info("offset in record: " + offset);
logger.info("charsLeft: " + charsLeft);
logger.info("bytesRead in record: " + bytesRead);
//prettyPrintHex(buffer);
String outFileStr = Utils.getOutputFileName(
stagingDir,
fileN[0],
outFileNum);
if (dos == null) {
Path outFile = new Path(outFileStr);
if (fs.exists(outFile)) {
fs.delete(outFile, false);
}
dos = fs.create(outFile);
}
dos.write(buffer, 0, bytesRead);
}
logger.info("done writing: " + outFileNum);
dos.flush();
dos.close();
if (bytesRead < 0) {
dos.flush();
dos.close();
break;
}
outFileNum++;
} // end of if
} else {
// Assume uncompressed file
stream = fileStream;
}
} catch(FileNotFoundException e) {
logger.error("File not found" + e);
}

You should consider using the super package IO from Apache.
It has a method
IOUtils.copy( InputStream, OutputStream )
that would tremendously reduce time needed to copy your files.

I tried with buffered input stream and saw no real difference.
I suppose a file channel implementation could be even more efficient. Tell me if it's not fast enough.
package toto;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
public class Slicer {
private static final int BUFFER_SIZE = 50000;
public static void main(String[] args) {
try
{
slice( args[ 0 ], args[ 1 ], Long.parseLong( args[2]) );
}//try
catch (IOException e)
{
e.printStackTrace();
}//catch
catch( Exception ex )
{
ex.printStackTrace();
System.out.println( "Usage : toto.Slicer <big file> <chunk name radix > <chunks size>" );
}//catch
}//met
/**
* Slices a huge files in chunks.
* #param inputFileName the big file to slice.
* #param outputFileRadix the base name of slices generated by the slicer. All slices will then be numbered outputFileRadix0,outputFileRadix1,outputFileRadix2...
* #param chunkSize the size of chunks in bytes
* #return the number of slices.
*/
public static int slice( String inputFileName, String outputFileRadix, long chunkSize ) throws IOException
{
//I would had some code to pretty print the output file names
//I mean adding a couple of 0 before chunkNumber in output file name
//so that they all have same number of chars
//use java.io.File for that, estimate number of chunks, take power of 10, got number of leading 0s
//just to get some stats
long timeStart = System.currentTimeMillis();
long timeStartSlice = timeStart;
long timeEnd = 0;
//io streams and chunk counter
int chunkNumber = 0;
FileInputStream fis = null;
FileOutputStream fos = null;
try
{
//open files
fis = new FileInputStream( inputFileName );
fos = new FileOutputStream( outputFileRadix + chunkNumber );
//declare state variables
boolean finished = false;
byte[] buffer = new byte[ BUFFER_SIZE ];
int bytesRead = 0;
long bytesInChunk = 0;
while( !finished )
{
//System.out.println( "bytes to read " +(int)Math.min( BUFFER_SIZE, chunkSize - bytesInChunk ) );
bytesRead = fis.read( buffer,0, (int)Math.min( BUFFER_SIZE, chunkSize - bytesInChunk ) );
if( bytesRead == -1 )
finished = true;
else
{
fos.write( buffer, 0, bytesRead );
bytesInChunk += bytesRead;
if( bytesInChunk == chunkSize )
{
if( fos != null )
{
fos.close();
timeEnd = System.currentTimeMillis();
System.out.println( "Chunk "+chunkNumber + " has been generated in "+ (timeEnd - timeStartSlice) +" ms");
chunkNumber ++;
bytesInChunk = 0;
timeStartSlice = timeEnd;
System.out.println( "Creating slice number " + chunkNumber );
fos = new FileOutputStream( outputFileRadix + chunkNumber );
}//if
}//if
}//else
}//while
}
catch (Exception e)
{
System.out.println( "A problem occured during slicing : " );
e.printStackTrace();
}//catch
finally
{
//whatever happens close all files
System.out.println( "Closing all files.");
if( fis != null )
fis.close();
if( fos != null )
fos.close();
}//fin
timeEnd = System.currentTimeMillis();
System.out.println( "Total slicing time : " + (timeEnd - timeStart) +" ms" );
System.out.println( "Total number of slices "+ (chunkNumber +1) );
return chunkNumber+1;
}//met
}//class
Greetings,
Stéphane

Related

Name the images that i get from a web site in Java

I am scraping a web site and as a last part, I get their product images to the folder. I want to name these images like (product_id + numberOfImages) I mean if product has a 2 images, there will be 2 png like (productId_1) (productId_2).
I have productId and also images there is no problem. I just want to know how to name it as I want. Here is my code.
for(Element imageElement : imageElements){
String strImageURL = imageElement.attr("src");
String strImageName =product_id + "_" + ??;
try {
URL urlImage = new URL(strImageURL);
InputStream in = urlImage.openStream();
byte[] buffer = new byte[4096];
int n = -1;
OutputStream os = new FileOutputStream( IMAGE_DESTINATION_FOLDER + "/" + strImageName );
while ( (n = in.read(buffer)) != -1 ){
os.write(buffer, 0, n);
}
//close the stream
os.close();
} catch (IOException e) {
System.out.println("sponsored product");
}
// for loop images
}
I assume you are asking what to write instead of the ?? in the code in your question. Just create a counter variable.
int counter = 0;
for(Element imageElement : imageElements){
String strImageURL = imageElement.attr("src");
String strImageName = product_id + "_" + (++counter);
try {
URL urlImage = new URL(strImageURL);
InputStream in = urlImage.openStream();
byte[] buffer = new byte[4096];
int n = -1;
OutputStream os = new FileOutputStream( IMAGE_DESTINATION_FOLDER + "/" + strImageName );
while ( (n = in.read(buffer)) != -1 ){
os.write(buffer, 0, n);
}
//close the stream
os.close();
} catch (IOException e) {
System.out.println("sponsored product");
}
// for loop images
}

Reading/Processing one CSV File using Multithreads in Java

In this example files reader the solution focuses on just reading any file any file and loading it into the memory.
I've been working on it to improve it so it processes a csv file with keeping the header in each thread, so each thread can output a separate and a correctly-formatted csv file.
Unfortunately I'm not able to do so since it reads from random locations (lines), this means it might read from the middle of the line and I'll get lines mixed up.
Is there a way to utilize this code and make is csv specific?
Here is the code I changed:
public static void main(String[] args) throws IOException {
long start = System.currentTimeMillis();
CSVReader reader = new CSVReader(new FileReader("file.csv"));
String[] columnsNames = reader.readNext();
reader.close();
FileInputStream fileInputStream = new FileInputStream("file.csv");
FileChannel channel = fileInputStream.getChannel();
long remaining_size = channel.size(); //get the total number of bytes in the file
long chunk_size = remaining_size / Integer.parseInt("4"); //file_size/threads
//Max allocation size allowed is ~2GB
if (chunk_size > (Integer.MAX_VALUE - 5))
{
chunk_size = (Integer.MAX_VALUE - 5);
}
//thread pool
ExecutorService executor = Executors.newFixedThreadPool(Integer.parseInt("4"));
long start_loc = 0;//file pointer
int i = 0; //loop counter
while (remaining_size >= chunk_size)
{
//launches a new thread
executor.execute(new FileRead(start_loc, toIntExact(chunk_size), channel, i, String.join(",", columnsNames)));
remaining_size = remaining_size - chunk_size;
start_loc = start_loc + chunk_size;
i++;
}
//load the last remaining piece
executor.execute(new FileRead(start_loc, toIntExact(remaining_size), channel, i, String.join(",", columnsNames)));
//Tear Down
executor.shutdown();
//Wait for all threads to finish
while (!executor.isTerminated())
{
//wait for infinity time
}
System.out.println("Finished all threads");
fileInputStream.close();
long finish = System.currentTimeMillis();
System.out.println( "Time elapsed: " + (finish - start) );
}
class FileRead implements Runnable {
private FileChannel _channel;
private long _startLocation;
private int _size;
int _sequence_number;
String _columns;
public FileRead(long loc, int size, FileChannel chnl, int sequence, String header) {
_startLocation = loc;
_size = size;
_channel = chnl;
_sequence_number = sequence;
_columns = header;
}
#Override
public void run() {
try {
System.out.println( "Reading the channel: " + _startLocation + ":" + _size );
//allocate memory
ByteBuffer buff = ByteBuffer.allocate( _size );
//Read file chunk to RAM
_channel.read( buff, _startLocation );
//chunk to String
String string_chunk = new String( buff.array(), Charset.forName( "UTF-8" ) );
string_chunk = _columns + System.getProperty( "line.separator" ) + string_chunk;
if (string_chunk.length() > 0) {
BufferedWriter out = new BufferedWriter( new FileWriter( "output_" + System.currentTimeMillis() + ".csv" ) );
try {
out.write( string_chunk ); //Replace with the string
//you are trying to write
} catch (IOException e) {
System.out.println( "Exception " );
} finally {
out.close();
}
}
System.out.println( "Done Reading the channel: " + _startLocation + ":" + _size );
} catch (Exception e) {
e.printStackTrace();
}
}
}

sending file over socket only works one time

I made a code that sending files from one computer to another,
the problem is that after one sending its not working anymore.
I know that the problem is when i'm writing to the writer but I don't know why its not working.
client:
File file =new File(path);
long fileSize = file.length();
long completed = 0;
int step = 150000;
Request req = new Request(RequetType.DOWNLOAD_FILE,file.getName());
writer.writeObject(req);
writer.flush();
// creates the file stream
FileInputStream fileStream = new FileInputStream(file);
// sending a message before streaming the file
// writer.writeObject("SENDING_FILE|" + file.getName() +"|" + fileSize);
writer.reset();
byte[] buffer = new byte[step];
while (completed <= fileSize) {
fileStream.read(buffer);
writer.write(buffer);
completed += step;
}
System.out.println(completed);
//writer.writeObject("SEND_COMPLETE");
fileStream.close();
server:
String filename = (String)req.getContent();
try {
FileOutputStream outStream =new FileOutputStream(Startdir+""+filename);
byte[] buffer = new byte[200000];
int bytesRead = 0, counter = 0;
bytesRead = this.reader.read(buffer);
if (bytesRead >= 0) {
outStream.write(buffer, 0, bytesRead);
counter += bytesRead;
System.out.println("total bytes read: " +
counter);
}
if (bytesRead < 1024) {
outStream.flush();
}
while (true)
{
bytesRead = this.reader.read(buffer);
if (bytesRead >= 0) {
outStream.write(buffer, 0, bytesRead);
counter += bytesRead;
System.out.println("total bytes read: " +
counter);
}
if (bytesRead ==0)
{
outStream.flush();
break;
}
}
System.out.println("Sent:"+filename+" from:"+MainApp.computersconnection.getIp());
} catch (Exception e) {
System.out.println("Error on downloading file!");
}
You need to flush the streams in the end even if the file isn't 0 bytes long. Try implementing that change and tell me if it still gives you trouble.
(Flush the output stream when your done sending a file).

Error in my mp3 splitter program?

So, I made a program that splits a .mp3 file in Java. Basically, it works fine on some files but on some, the first split file encounters an error after playing some part. The other files work completely fine though.
I think it has something to do with how a file cannot be a multiple of the size of my array and there should be some mod value left. Can anybody please identify the error in this code and correct it?
(here, splitval = no. of splits to be made, filename1= the selected file)
int splitsize=filesize/splitval;
String filecalled;
try
{
byte []b=new byte[splitsize];
FileInputStream fis = new FileInputStream(filename1);
name1=filename2.replaceAll(".mp3", "");
for(int j=1;j<=splitval;j++)
{
filecalled=name1+"_split_"+j+".mp3";
FileOutputStream fos = new FileOutputStream(filecalled);
int i=fis.read(b);
fos.write(b, 0, i);
//System.out.println("no catch");
}
JOptionPane.showMessageDialog(this, "split process successful");
}
catch(IOException e)
{
System.out.println(e.getMessage());
}
Thanks in advance!
EDIT:
I edited the code as suggested, ran it. Here:
C:\Users\dell5050\Desktop\Julien.mp3 5383930 bytes
C:\Users\dell5050\Desktop\ Julien_split_1.mp3 1345984 bytes
C:\Users\dell5050\Desktop\ Julien_split_2.mp3 1345984 bytes
C:\Users\dell5050\Desktop\ Julien_split_3.mp3 1345984 bytes
C:\Users\dell5050\Desktop\ Julien_split_4.mp3 1345978 bytes
There is change in the last few bytes which means that the filesize%splitval is solved.. but still the first file in this.. containing '_split_1' has error while playing some of the last part.
The second file containing '_split_2' starts exactly where the first ended. So the split process is correct. Then, what exactly is the extra empty in the end of the first file?
Also, I noticed that the artwork and info of the original file carries over into the first file ONLY. No other files. Does it have something to do with that? Same thing doesnt happen in some other mp3 files.
CODE:
FileInputStream fis;
FileOutputStream fos;
int splitsize = (int)(filesize / splitval) + (int)(filesize % splitval);
byte[] b = new byte[splitsize];
System.out.println(filename1 + " " + filesize + " bytes");
try
{
fis = new FileInputStream(file);
name1 = filename2.replaceAll(".mp3", "");
for (int j = 1; j <= splitval; j++)
{
String filecalled = name1 + "_split_" + j + ".mp3";
fos = new FileOutputStream(filecalled);
int i = fis.read(b);
fos.write(b, 0, i);
fos.close();
System.out.println(filecalled + " " + i + " bytes");
}
}
catch(IOException ie)
{
System.out.println(ie.getMessage());
}
I doubt you could split a mp3 file just by copying n-bytes to a file and go to the next. Mp3 has a specific format and you'll probably need a library to handle this format.
EDIT regarding the size of the part files being all equal:
You are not writing all the bytes of the file to the split files. If you sum the sizes of all split files and compare it to the size of the original file you'll find out that your missing some bytes. This is because your loop runs from 1 to splitval and always writes the exact number of bytes to each part file i.e. splitsize. So the number of bytes your are missing is filesize % splitval.
To resolve this problem simply add filesize % splitval to splitsize. This way you'll not be missing any bytes. The files from 1 to splitval - 1 will have the same size, the last file will be smaller.
Here is a corrected version of your code with some additions to merge the split files in order to perform an assertion using SHA1-checksum.
Disclaimer - The output files are not expected to be proper mp3 files
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import junit.framework.Assert;
import org.junit.Test;
public class SplitFile {
#Test
public void splitFile() throws IOException, NoSuchAlgorithmException {
String filename1 = "mp3/Innocence_-_Nero.mp3";
File file = new File(filename1);
FileInputStream fis = null;
FileOutputStream fos = null;
long filesize = file.length();
long filesizeActual = 0L;
int splitval = 5;
int splitsize = (int)(filesize / splitval) + (int)(filesize % splitval);
byte[] b = new byte[splitsize];
System.out.println(filename1 + " " + filesize + " bytes");
try {
fis = new FileInputStream(file);
String name1 = filename1.replaceAll(".mp3", "");
String mergeFile = name1 + "_merge.mp3";
for (int j = 1; j <= splitval; j++) {
String filecalled = name1 + "_split_" + j + ".mp3";
fos = new FileOutputStream(filecalled);
int i = fis.read(b);
fos.write(b, 0, i);
fos.close();
fos = null;
System.out.println(filecalled + " " + i + " bytes");
filesizeActual += i;
}
Assert.assertEquals(filesize, filesizeActual);
mergeFileParts(filename1, splitval);
check(filename1, mergeFile);
} finally {
if(fis != null) {
fis.close();
}
if(fos != null) {
fos.close();
}
}
}
private void mergeFileParts(String filename1, int splitval) throws IOException {
FileInputStream fis = null;
FileOutputStream fos = null;
try {
String name1 = filename1.replaceAll(".mp3", "");
String mergeFile = name1 + "_merge.mp3";
fos = new FileOutputStream(mergeFile);
for (int j = 1; j <= splitval; j++) {
String filecalled = name1 + "_split_" + j + ".mp3";
File partFile = new File(filecalled);
fis = new FileInputStream(partFile);
int partFilesize = (int) partFile.length();
byte[] b = new byte[partFilesize];
int i = fis.read(b, 0, partFilesize);
fos.write(b, 0, i);
fis.close();
fis = null;
}
} finally {
if(fis != null) {
fis.close();
}
if(fos != null) {
fos.close();
}
}
}
private void check(String expectedPath, String actualPath) throws IOException, NoSuchAlgorithmException {
System.out.println("check...");
FileInputStream fis = null;
try {
File expectedFile = new File(expectedPath);
long expectedSize = expectedFile.length();
File actualFile = new File(actualPath);
long actualSize = actualFile.length();
System.out.println("exp=" + expectedSize);
System.out.println("act=" + actualSize);
Assert.assertEquals(expectedSize, actualSize);
fis = new FileInputStream(expectedFile);
String expected = makeMessageDigest(fis);
fis.close();
fis = null;
fis = new FileInputStream(actualFile);
String actual = makeMessageDigest(fis);
fis.close();
fis = null;
System.out.println("exp=" + expected);
System.out.println("act=" + actual);
Assert.assertEquals(expected, actual);
} finally {
if(fis != null) {
fis.close();
}
}
}
public String makeMessageDigest(InputStream is) throws NoSuchAlgorithmException, IOException {
byte[] data = new byte[1024];
MessageDigest md = MessageDigest.getInstance("SHA1");
int bytesRead = 0;
while(-1 != (bytesRead = is.read(data, 0, 1024))) {
md.update(data, 0, bytesRead);
}
return toHexString(md.digest());
}
private String toHexString(byte[] digest) {
StringBuffer sha1HexString = new StringBuffer();
for(int i = 0; i < digest.length; i++) {
sha1HexString.append(String.format("%1$02x", Byte.valueOf(digest[i])));
}
return sha1HexString.toString();
}
}
Output (for my test file)
mp3/Innocence_-_Nero.mp3 5048528 bytes
mp3/Innocence_-_Nero_split_1.mp3 1009708 bytes
mp3/Innocence_-_Nero_split_2.mp3 1009708 bytes
mp3/Innocence_-_Nero_split_3.mp3 1009708 bytes
mp3/Innocence_-_Nero_split_4.mp3 1009708 bytes
mp3/Innocence_-_Nero_split_5.mp3 1009696 bytes
check...
exp=5048528
act=5048528
exp=e81cf2dc65ab84e3df328e52d63a55301232b917
act=e81cf2dc65ab84e3df328e52d63a55301232b917

how to add ProgressMonitorInputStream to ftp upload?

Can anybody see what is wrong with this code. it does not show up progress-bar but uploades all the files.
I did checkout sun tutorial and swingworkers also but i couldn't fix it yet.
private static boolean putFile(String m_sLocalFile, FtpClient m_client) {
boolean success = false;
int BUFFER_SIZE = 10240;
if (m_sLocalFile.length() == 0) {
System.out.println("Please enter file name");
}
byte[] buffer = new byte[BUFFER_SIZE];
try {
File f = new File(m_sLocalFile);
int size = (int) f.length();
System.out.println("File " + m_sLocalFile + ": " + size + " bytes");
System.out.println(size);
FileInputStream in = new FileInputStream(m_sLocalFile);
//test
InputStream inputStream = new BufferedInputStream(
new ProgressMonitorInputStream(null,"Uploading " + f.getName(),in));
//test
OutputStream out = m_client.put(f.getName());
int counter = 0;
while (true) {
int bytes = inputStream.read(buffer); //in
if (bytes < 0)
break;
out.write(buffer, 0, bytes);
counter += bytes;
System.out.println(counter);
}
out.close();
in.close();
inputStream.close();
success =true;
} catch (Exception ex) {
System.out.println("Error: " + ex.toString());
}
return true;
}
I think your code is fine.
Maybe the task isn't taking long enough for the progress bar to be needed?
Here's a modified version of your code which reads from a local file and writes to another local file.
I have also added a delay to the write so that it gives the progress bar time to kick in.
This works fine on my system with a sample 12MB PDF file, and shows the progress bar.
If you have a smaller file then just increase the sleep from 5 milliseconds to 100 or something - you would need to experiment.
And I didn't even know that the ProgressMonitorInputStream class existed, so I've learnt something myself ;].
/**
* main
*/
public static void main(String[] args) {
try {
System.out.println("start");
final String inf = "d:/testfile.pdf";
final String outf = "d:/testfile.tmp.pdf";
final FileOutputStream out = new FileOutputStream(outf) {
#Override
public void write(byte[] b, int off, int len) throws IOException {
super.write(b, off, len);
try {
// We delay the write by a few millis to give the progress bar time to kick in
Thread.sleep(5);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
};
putFile(inf, out);
System.out.println("end");
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
private static boolean putFile(String m_sLocalFile, OutputStream out /*FtpClient m_client*/) {
boolean success = false;
int BUFFER_SIZE = 10240;
if (m_sLocalFile.length() == 0) {
System.out.println("Please enter file name");
}
byte[] buffer = new byte[BUFFER_SIZE];
try {
File f = new File(m_sLocalFile);
int size = (int) f.length();
System.out.println("File " + m_sLocalFile + ": " + size + " bytes");
System.out.println(size);
FileInputStream in = new FileInputStream(m_sLocalFile);
//test
InputStream inputStream = new BufferedInputStream(
new ProgressMonitorInputStream(null,"Uploading " + f.getName(),in));
//test
//OutputStream out = m_client.put(f.getName());
int counter = 0;
while (true) {
int bytes = inputStream.read(buffer); //in
if (bytes < 0)
break;
out.write(buffer, 0, bytes);
counter += bytes;
System.out.println(counter);
}
out.close();
in.close();
inputStream.close();
success =true;
} catch (Exception ex) {
System.out.println("Error: " + ex.toString());
}
return true;
}

Categories