Spark - save RDD to multiple files as output

Spark - save RDD to multiple files as output - java

I have a JavaRDD<Model>, which i need to write it as more than one file with different layout [one or two fields in the RDD will be different between different layout].
When i use saveAsTextFile() its calling the toString() method of Model, it means same layout will be written as output.
Currently what i am doing is iterate the RDD using map transformation method and return the different model with other layout, so i can use saveAsTextFile() action to write as different output file.
Just because of one or two fields are different , i need to iterate the entire RDD again and create new RDD then save it as output file.
For example:
Current RDD with fields:
RoleIndicator, Name, Age, Address, Department
Output File 1:
Name, Age, Address
Output File 2:
RoleIndicator, Name, Age, Department
Is there any optimal solution for this?
Regards,
Shankar

You want to use foreach, not collect.
You should define your function as an actual named class that extends VoidFunction. Create instance variables for both files, and add a close() method that closes the files. Your call() implementation will write whatever you need.
Remember to call close() on your function object after you're done.

It is possible with Pair RDD.
Pair RDD can be stored in multiple files in a single iteration by using Hadoop Custom output format.
rdd.saveAsHadoopFile(path, key.class, value.class,CustomTextOutputFormat.class, jobConf);
public class FileGroupingTextOutputFormat extends MultipleTextOutputFormat<Text, Text> {
#Override
protected Text generateActualKey(Text key, Text value) {
return new Text();
}
#Override
protected Text generateActualValue(Text key, Text value) {
return value;
}
// returns a dynamic file name based on each RDD element
#Override
protected String generateFileNameForKeyValue(Text key, Text value, String name) {
return value.getSomeField() + "-" + name;
}
}

Related

Hadoop Custom Partitioner not behaving according to the logic

Based on this example here, this works. Have tried the same on my dataset.
Sample Dataset:
OBSERVATION;2474472;137176;
OBSERVATION;2474473;137176;
OBSERVATION;2474474;137176;
OBSERVATION;2474475;137177;
Consider each line as string, my Mapper output is:
key-> string[2], value-> string.
My Partitioner code:
#Override
public int getPartition(Text key, Text value, int reducersDefined) {
String keyStr = key.toString();
if(keyStr == "137176") {
return 0;
} else {
return 1 % reducersDefined;
}
}
In my data set most id's are 137176. Reducer declared -2. I expect two output files, one for 137176 and second for remaining Id's. I'm getting two output files but, Id's evenly distributed on both the output files. What's going wrong in my program?

Explicitly set in the Driver method that you want to use your custom Partitioner, by using: job.setPartitionerClass(YourPartitioner.class);. If you don't do that, the default HashPartitioner is used.
Change String comparison method from == to .equals(). i.e., change if(keyStr == "137176") { to if(keyStr.equals("137176")) {.
To save some time, perhaps it will be faster to declare a new Text variable at the beginning of the partitioner, like that: Text KEY = new Text("137176"); and then, without converting your input key to String every time, just compare it with the KEY variable (again using the equals() method). But perhaps those are equivalent. So, what I suggest is:
Text KEY = new Text("137176");
#Override
public int getPartition(Text key, Text value, int reducersDefined) {
return key.equals(KEY) ? 0 : 1 % reducersDefined;
}
Another suggestion, if the network load is heavy, parse the map output key as VIntWritable and change the Partitioner accordingly.

Outputting single file for partitioner

Trying to get as many reducer as the no of keys
public class CustomPartitioner extends Partitioner<Text, Text>
{
public int getPartition(Text key, Text value,int numReduceTasks)
{
System.out.println("In CustomP");
return (key.toString().hashCode()) % numReduceTasks;
}
}
Driver class
job6.setMapOutputKeyClass(Text.class);
job6.setMapOutputValueClass(Text.class);
job6.setOutputKeyClass(NullWritable.class);
job6.setOutputValueClass(Text.class);
job6.setMapperClass(LastMapper.class);
job6.setReducerClass(LastReducer.class);
job6.setPartitionerClass(CustomPartitioner.class);
job6.setInputFormatClass(TextInputFormat.class);
job6.setOutputFormatClass(TextOutputFormat.class);
But I am getting ootput in a single file.
Am I doing anything wrong

You can not control number of reducer without specifying it :-). But still there is no surety of getting all the keys on different reducer because you are not sure how many distinct keys you would get in the input data and your hash partition function may return same number for two distinct keys. If you want to achieve your solution then you'll have to know number of distinct keys in advance and then modify your partition function accordingly.

you need to specify the number of reduce tasks that's equal to number of keys and also you need to return the partitions based on your key's in partitioner class. for example if your input having 4 keys(here it is wood,Masonry,Reinforced Concrete etc) then your getPartition method look like this..
public int getPartition(Text key, PairWritable value, int numReduceTasks) {
// TODO Auto-generated method stub
String s = value.getone();
if (numReduceTasks ==0){
return 0;
}
if(s.equalsIgnoreCase("wood")){
return 0;
}
if(s.equalsIgnoreCase("Masonry")){
return 1%numReduceTasks;
}
if(s.equalsIgnoreCase("Reinforced Concrete")){
return 2%numReduceTasks;
}
if(s.equalsIgnoreCase("Reinforced Masonry")){
return 3%numReduceTasks;
}
else
return 4%numReduceTasks;
}
}
corresponding output will be collected in respective reducers..Try Running in CLI instead eclipse

You haven't configured the number of reducers to run.
You can configure it using below API
job.setNumReduceTasks(10); //change the number according to your
cluster
Also, you can set while executing from commandline
-D mapred.reduce.tasks=10
Hope this helps.

Veni, You need to Chain the Tasks as below
Mapper1 --> Reducer --> Mapper2 (Post Processing Mapper which creates
file for Each key)
Mapper 2 is InputFormat should be NlineInputFormat, so the output of the reducer that is for each key there will be corresponding mapper and Mapper output will be a separate file foe each key.
Mapper 1 and Reducer is your existing MR job.
Hope this helps.
Cheers
Nag

How do I populate a JavaFX ChoiceBox with data from the Database?

private void initialize() {
loadPersistenceContext();
List<Events> events = getEventsChoiceBox(getPersistenceContext());
ObservableList<Event> data = FXCollections.observableList(events);
cbEvent.setItems(data); // Inserting data into the ChoiceBox
}
This is my main code. Problem is when the form is loaded, I get the Objects inserted in the ChoiceBox and not the properties.
This is the content of my List Events
Object[]
|- String
|- Integer
Object[]
|- String
|- Integer
So I want a ChoiceBox with that String property showing up and the Integer mapped to my controller.
I tried a lot of things but couldn't figure it out.

Here is another simple implementation from forums.oracle.com
Create a class for key - value
public class KeyValuePair {
private final String key;
private final String value;
public KeyValuePair(String key, String value) {
this.key = key;
this.value = value;
}
public String getKey() { return key; }
public String toString() { return value; }
}
Then create the ChoiceBox as:
ChoiceBox<KeyValuePair> choiceBox = new ChoiceBox<KeyValuePair>();
Fill the elements as;
choiceBox .getItems().add(new KeyValuePair("1", "Active"));
Hint: Retrive key-value pair from you database into an ArrayList and iterate
To retrieve the value:
choiceBox.getValue().getKey(); // returns the "1"
choiceBox.getValue().toString(); // returns the "Active"

See this example of a JavaFX ChoiceBox control backed by Database IDs.
The example works by defining a Choice class consisting of a database row ID and a string representation of the item to be displayed in the Choice box. The default toString method of Choice is overridden with a custom implementation that returns a string representation of the item to be displayed and not the database ID. When you add the choices to the ChoiceBox, the ChoiceBox will convert each Choice into a string for display. The displayed string value of the Choice is just the choice text rather than also including the database ID or using the default toString of Choice that would just display a meaningless object reference.
Output of choicebox sample app:
Also consider a ComboBox for such an implementation, as it has a mechanisms built into it to abstract the values of nodes from the display of the nodes (via a CellFactory). Use of a ComboBox is however often more complex than a ChoiceBox.

Or simply do: myChoiceBox.setConverter(myStringConverter), passing in an instance of your own subclass of javafx.util.StringConverter(JavaDoc).
Overriding the toString (and fromString) gives you full control over how your object is displayed without having to implement a toString in the object itself.

How to use MultipleOutputs<KEYOUT,VALUEOUT> for writing output data to multiple outputs

I am new to Hadoop and MapReduce and have been trying to write output to multiple files based on keys. Could anyone please provide clear idea or Java code snippet example on how to use it. My mapper is working exactly fine and after shuffle, keys and the corresponding values are obtained as expected. Thanks!
What i am trying to do is output only few records from the input file to a new file.
Thus the new output file shall contain only those required records, ignoring rest irrelevant records.
This would work fine even if i don't use MultipleTextOutputFormat.
Logic which i implemented in mapper is as follows:
public static class MapClass extends
Mapper {
StringBuilder emitValue = null;
StringBuilder emitKey = null;
Text kword = new Text();
Text vword = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] parts;
String line = value.toString();
parts = line.split(" ");
kword.set(parts[4].toString());
vword.set(line.toString());
context.write(kword, vword);
}
}
Input to reduce is like this:
[key1]--> [value1, value2, ...]
[key2]--> [value1, value2, ...]
[key3]--> [value1, value2, ...] & so on
my interest is in [key2]--> [value1, value2, ...] ignoring other keys and corresponding values. Please help me out with the reducer.

Using MultipleOutputs lets you emit records in multiple files, but in a set of pre-defined number/type of files only and not arbitrary number of files and not with on-the-fly decision on filename according to key/value.
You may create your own OutputFormat by extending org.apache.hadoop.mapred.lib.MultipleTextOutputFormat. Your OutputFormat class shall enable decision of output file name as well as folder according to the key/value emitted by reducer. This can be achieved as follows:
package oddjob.hadoop;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;
public class MultipleTextOutputFormatByKey extends MultipleTextOutputFormat<Text, Text> {
/**
* Use they key as part of the path for the final output file.
*/
#Override
protected String generateFileNameForKeyValue(Text key, Text value, String leaf) {
return new Path(key.toString(), leaf).toString();
}
/**
* When actually writing the data, discard the key since it is already in
* the file path.
*/
#Override
protected Text generateActualKey(Text key, Text value) {
return null;
}
}
For more info read here.
PS: You will need to use the old mapred API to achieve that. As in the newer API there isn't support for MultipleTextOutput yet! Refer this.

known API to write Bean/ResultSet into CSV file

I would like to export a Java Bean or ResultSet(JDBC) into a CSV file through Reflection mechanism.
I have seen this api :
http://opencsv.sourceforge.net/apidocs/au/com/bytecode/opencsv/bean/BeanToCsv.html
but it's not released yet.
Also, it will be fine if we can set some filters to avoid to map some precised fields.
Do you know a known API which owns these features ?

Unless there are some ready-made API:s I would use
Apache commons http://commons.apache.org/lang/api-2.4/org/apache/commons/lang/builder/ReflectionToStringBuilder.html to get a String representation of an JavaBean. By setting your own ToStringStyle it would be possible to create a CSV style String. There are many possible settings for styling of the String, including excluding fields and so on.
And then of course writing it to a file.

You can just write out to a csv file as you would to a normal .txt file by using an outputstream or so.
If you need more advanced excel like stuff I recommend using Apache POI. It has always done the job nice & clean for me.

Adding to Kennets answer:
I implemented two classes: One for the header (if needed) and one for the body (actual data)
HEADER
The header style class needs to extend ToStringStyle
Invoke toString with a single element, e.g. ReflectionToStringBuilder.toString(firstElement, headerStyle)
Constructor:
this.setUseClassName(false);
this.setUseIdentityHashCode(false);
this.setContentStart("");
this.setUseFieldNames(true);
this.setFieldNameValueSeparator("");
this.setContentEnd("\n");
Override Method:
#Override
public void append(StringBuffer buffer, String fieldName, Object value, Boolean fullDetail) {
super.append(buffer, fieldName, "", fullDetail);
}
BODY
The body class needs to extend RecursiveToStringStyle
Invoke toString with an array, e.g. ReflectionToStringBuilder.toString(array, bodyStyle)
Constructor:
this.setUseClassName(false);
this.setUseIdentityHashCode(false);
this.setContentStart("");
this.setUseFieldNames(false);
this.setContentEnd("");
this.setNullText("n.a.");
this.setArrayStart("");
this.setArrayEnd("");
this.setArraySeparator("\n");
Override Method:
#Override
public void append(StringBuffer buffer, String fieldName, Object value, Boolean fullDetail) {
String csvField = Optional.ofNullable(value)
.map(Objects::toString)
.map(this::escapeLineBreak)
.map(this::escapeDoubleQuote)
.map(this::escapeField)
.orElse(null);
super.append(buffer, fieldName, csvField, fullDetail);
}
Formatting Methods:
private String escapeDoubleQuote(final String field) {
return field.replace("\"", "\"\"");
}
private String escapeLineBreak(final String field) {
return field.replaceAll("\\R", " ");
}
private String escapeField(final String field) {
return "\"" + field + "\"";
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark - save RDD to multiple files as output - java

Related

Hadoop Custom Partitioner not behaving according to the logic

Outputting single file for partitioner

How do I populate a JavaFX ChoiceBox with data from the Database?

How to use MultipleOutputs<KEYOUT,VALUEOUT> for writing output data to multiple outputs

known API to write Bean/ResultSet into CSV file

Categories

Resources