Write into Hadoop File System in parallel

Write into Hadoop File System in parallel - java

I'm quite new with Hadoop, and I have a issue...
I have a output file (the result of a task) and I would like to modify it. As it can be a very big file, i want to do this operation in parralel.
Note : I don't want to simply append data, i want to modify structurally (even the size) so I have to read it completely and write it back
Read the file isn't a problem, i give to each worker a part of the file, they simply have to read it and make the changes they want.
But for writing the new file back to hdfs it seems more tricky.
My question is: how can I Create a big file into the hdfs and make my workers write into it simultaneously (i know the size of each part so two workers will never attempt to write at the same position).
Thanks in advance :)

Since the job is to read the input file and write select content from the input files to a output location in parallel, this is a mapper only job.
Create a Mapper class to read the file and perform your operations on
the file.
set the number for mappers in your driver class.
job.setNumMapTasks(n); n-number of mappers

Related

Saving spark dataset to an existing csv file

I am trying to save the contents of dataset to csv using
df.coalesce(1)
.write()
.format("csv")
.mode("append")
.save(PATH+"/trial.csv");
My aim is to keep appending the results of dataset to trial.csv file. However, it creates a folder called trial.csv and creates csv inside of that. When I run it again, it creates another csv file inside the trail.csv folder. But I just want it to keep appending to one csv file, which I unable to do.
I know we can do some script from outside of code(program) and do it, but can we achieve it from inside our code? I am using Java.

Appending to an existing file its something hard to do for a distributed, multi-thread application, it will turn something parallelised into a sequential task. The way that is usually achieved, is to persist per thread or task in spark, a single file in the specified path, and this path will be a folder containing all the files. To read them, you can read the complete folder.
If your data is not big, and you really need a single file, try with repartition method to 1, this will make a single task to write the new data, but it will never append the data to previous files.
You have to be careful, but you can do something like this:
df.union(spark.read(PATH+"/trial.csv"))
.coalesce(1)
.write
.format("csv")
.mode("append")
.save(PATH+"/trial_auxiliar.csv")
Then move it to the previous folder, with spark or a move command of Hadoop. Never write and read in the same job to the same folder, and keep in mind that this won't guarantee the data order.

Update file after FlatFileItemReader in Spring Batch

I currently have the following processing in a Spring Batch job:
FlatFileItemReader reads a CSV file
Processor does some work
FlatFileItemWriter creates a mirror of the read file, but updates the file to reflect processing
I don't want to write to a new file, but I want to update the same file that is being read during processing.
My question is, is there a typical method in Spring to use the FlatFileItemReader and then update that same file per row in the processor at runtime?
Thanks for any help.

You can always write a custom writer in spring batch, just like an example below. Where you read data form the file into memory, and then update the same file with the data that you are intended to.
https://github.com/pkainulainen/spring-batch-examples/tree/master/spring/src/main/java/net/petrikainulainen/springbatch/custom/in
more than that FlatFileItemReader is not thread safe. Of course there are hacks to achieve thread safety but not a good practice to use such hacks, its always good to create custom writer.

Short answer is no, SB doesn't allow you to overwrite the same file you are reading from.
A better pratice is to write an intermediate file and then perform a delete/rename.
Write a temporary file is not a 'bad thing' especially if you are working with huge input file and OutOfMemoryException is round the corner; also using a temporary file can make your step restartable and allow you to manually retrive translated file also if delete/rename process fails.

Several FileOutputStreams at a time?

The situation is that:
I have a csv file with records (usually 10k but up to 1m records)
I will process each record (very basic arithmetic with 5 basic select queries to the DB for every record)
Each record (now processed) will then be written to a file BUT not the same file every time. A record CAN be written to another file instead.
Basically I have 1 input file but several possible output files (around 1-100 possible output files).
The process itself is basic so I am focusing on how I should handle the records.
Which option is appropriate for this situation?
Store several List s that will represent per possible output file, and then write each List one by one in the end?
To avoid several very large Lists, every after processing each record, I will immediately write it to its respective output file. But this will require that I have streams open at a time.
Please enlighten me on this. Thanks.

The second option is ok: create the file output streams on demand, and keep them open as long as it takes (track them in a Map for example).
The operating system may have a restriction on how many open file handles it allows, but those numbers are usually well beyond a couple hundreds of files.
A third option:
You could also just append to files, FileOutputStream allows that option in the constructor:
new FileOutputStream(File file, boolean append)
This is less performant than keeping the FileOutputStreams open, but works as well.

How to Override InputFormat and OutputFormat In hadoop Application

I have an application which need to read a file which is a serialized result of ArrayList.(ArrayList<String>, 50000 records in this list, size: 20MB)
I don't know exactly how to read the data in to hadoop platform. I only have some sense I need to override InputFormat and OutpurFormat.
I'm a beginner in hadoop platform. Could you give me some advise?
Thanks,
Zheng.

To start with you'll need to extend the FileInputFormat, notable implementing the abstract FileInputFormat.createRecordReader method.
You can look through the source of something like the LineRecordReader (which is what TextInputFormat uses to process text files).
From there you're pretty much on your own (i.e. it depends on how your ArrayList has been serialized). Look through the source for the LineRecordReader and try and relate that to how your ArrayList has been serialized.
Some other points of note, is your file format splittable? I.e. can you seek to an offset in the file and recover the stream from there (Text files can as they just scan forward to the end of the current line and then start from there). If your file format uses compression, you also need to take this into account (you cannot for example randomly seek to a position in a gzip file). By default FileInputFormat.isSplittable will return true, which you may want to initially override to be false. If you do stick with 'unsplittable' then note that your file will be processed by a single mapper (not matter it's size).

Before processing data on Hadoop you should upload data to HDFS or another supported file system of cause if it wasn't upload here by something else. If you are controlling the uploading process you can convert data on the uploading stage to something you can easily process, like:
simple text file (line per array's item)
SequenceFile if array can contain lines with '\n'
This is the simplest solution since you don't have to interfere to Hadoop's internals.

Hadoop MapReduce - one output file for each input

I'm new to Hadoop and I'm trying to figure out how it works. As for an exercise I should implement something similar to the WordCount-Example. The task is to read in several files, do the WordCount and write an output file for each input file.
Hadoop uses a combiner and shuffles the output of the map-part as an input for the reducer, then writes one output file (I guess for each instance that is running). I was wondering if it is possible to write one output file for each input file (so keep the words of inputfile1 and write result to outputfile1 and so on). Is it possible to overwrite the Combiner-Class or is there another solution for this (I'm not sure if this should even be solved in a Hadoop-Task but this is the exercise).
Thanks...

map.input.file environment parameter has the file name which the mapper is processing. Get this value in the mapper and use this as the output key for the mapper and then all the k/v from a single file to go to one reducer.
The code in the mapper. BTW, I am using the old MR API
#Override
public void configure(JobConf conf) {
this.conf = conf;
}
#Override.
public void map(................) throws IOException {
String filename = conf.get("map.input.file");
output.collect(new Text(filename), value);
}
And use MultipleOutputFormat, this allows to write multiple output files for the job. The file names can be derived from the output keys and values.

Hadoop 'chunks' data into blocks of a configured size. Default is 64MB blocks. You may see where this causes issues for your approach; Each mapper may get only a piece of a file. If the file is less than 64MB (or whatever value is configured), then each mapper will get only 1 file.
I've had a very similar constraint; I needed a set of files (output from previous reducer in chain) to be entirely processed by a single mapper. I use the <64MB fact in my solution
The main thrust of my solution is that I set it up to provide the mapper with the file name it needed to process, and internal to the mapper had it load/read the file. This allows a single mapper to process an entire file - It's not distributed processing of the file, but with the constraint of "I don't want individual files distributed" - it works. :)
I had the process that launched my MR write out the file names of the files to process into individual files. Where those files were written was the input directory. As each file is <64MB, then a single mapper will be generated for each file. The map process will be called exactly once (as there is only 1 entry in the file).
I then take the value passed to the mapper and can open the file and do whatever mapping I need to do.
Since hadoop tries to be smart about how it does Map/Reduce processes, it may be required to specify the number of reducers to use so that each mapper goes to a single reducer. This can be set via the mapred.reduce.tasks configuration. I do this via job.setNumReduceTasks("mapred.reduce.tasks",[NUMBER OF FILES HERE]);
My process had some additional requirements/constraints that may have made this specific solution appealing; but for an example of a 1:in to 1:out; I've done it, and the basics are laid out above.
HTH

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.