Hadoop use one instance for each mapper - java

I'm using Hadoop 's map reduce to parse xml files. So I have a class called Parser that can have a method parse() to parse the xml files. And So I should use it in the Mapper's map() function.
However it means that every time, when I want to call a Parser, I need to create a Parser instance. But this instance should be the same for each map job. So I'm wondering if I can just instantize this Parser just once?
And just another add-on question, why the Mapper class is always static?

To ensure one parser instance per Mapper , please use mappers setup method for instantiating your parser instance and clean using cleanup method.
Same thing we applied for protobuf parser which we had, but need to make sure that your parser instance is thread safe, and no shared data.
Note : setup and cleanup method will be called only once per mapper so we can initialize private variables there.
To clarify what cricket_007 said in "In a distributed computing environment, sharing instances of a variable isn't possible..."
we have a practice of reusing of writable classes instead of creating new writables every time we need. we can instantiate once and re-set the writable multiple times as described by Tip 6
Similarly parser objects also can be re-used(Tip-6 style). as described in below code.
For ex :
private YourXMLParser xmlParser = null;
#Override
protected void setup(Context context) throws IOException, InterruptedException {
super.setup(context);
xmlParser= new YourXMLParser();
}
#Override
protected void cleanup(Mapper<ImmutableBytesWritable, Result, NullWritable, Put>.Context context) throws IOException,
InterruptedException {
super.cleanup(context);
xmlParser= null;
}

Related

Java Cucumber storing value returned by step

i'm writing a cucumber test and i come up with some difficulty:
I have a step which creates dto and saves it using save client which returns dto back again i would need to use that returned dto for other step but don't know how to make it.
Here's how it looks in code :
commonExpenseCreationSteps.java
#Given("^new \"([^\"]*)\" expense with type \"([^"]*)\"$")
public ExpenseDTO newExpense(String description, String expenseType) throws Throwable {
ExpenseDTO expenseDTO = new ExpenseDTO();
expenseDTO.setDefaultPurpose(description);
expenseDTO.setExpenseType(expenseType);
return expenseSaveClient.save(expenseDTO);
}
expenseTransactionsSendSteps.java
#Given("^send expense for Approval$")
public void sendExpenseForApproval() throws InterruptedException {
expenseTransactionSendClient.sendToApproval(expenseDTO);
}
How it would be possible to store value returned by one Step and use it in other one in this case i return ExpenseDTO in newExpense method but i need to use it in sendExpenseForApproval but don't know how to do it !?
Create expenseDTO object outside of your glue code, probably in your stepdef class constructor.
ExpenseDTO expenseDTO = new ExpenseDTO();
The way to share state between steps in the same class is to use instance variables. Set the value in one step and use that value in a later step.
The way to share state between steps with two or more step classes is to use dependency injection.
I wrote a blog post that describes how it can be done using PicoContainer.

Logic inside BuilderPattern

Recently I came across with a builder pattern that intrigued me.
So, I have an EntityBuilder which builds an Entity, but it doesn't return the entity. Here is the method signature:
public void build();
Instead, inside the build() method, it delivers the new object created, the Entity, to a CacheImplementation instance to store it.
Note: the CacheImpl is injected in the builder's constructor.
public void build(){
//create new entity
cacheImplementation.add(entity);
}
Does this sounds like best practice?
Later edit 0
public interface EntityBuilder {
void setProperty0(PropertyObject propertyObject0);
void setProperty1(PropertyObject propertyObject1);
void setProperty2(PropertyObject propertyObject2);
//...
void build();
}
public class EntityBuilderImpl implements EntityBuilder {
PropertyObject propertyObject0;
PropertyObject propertyObject1;
PropertyObject propertyObject2;
//...
// setters for all properties
#Override
public void build(){
//create new entity
cacheImplementation.add(entity);
}
}
The builder is used in the following way:
public class EntityProcessor{
private EntityBuilderFactory entityBuilderFactory;//initialized in constructor
void process(EntityDetails entityDetails){
EntityBuilder entityBuilder = this.entityBuilderFactory.getNewEntitytBuilder();
//..
// entityBuilder.set all properties from entityDetails
entityBuilder.build();
}
}
Note: the cacheImpl instance just stores the entities in a List<> which is accesses every N seconds.
Does this sounds like best practice?
The traditional builder pattern doesn't store the created object anywhere, it simply returns it.
I can imagine a variation where the builder also has a role of instance control to avoid creating duplicate objects, and managing a store of immutable objects.
The decision to not return an instance could be to make it clear that the method has a side effect. If the method returned the object, it might mislead to thinking that it's a traditional builder without side effects, when that's not the case here.
In any case, all this is just speculation, as we haven't seen the rest of the code where this is used and the way it is implemented and used. We don't have enough context to really judge.
There's nothing wrong with inventing new patterns, but it can be done well or badly.
I've seen similar void build() method in the JCodeModel class. As you can see it throws IOException because of the resources it manages:
public void build(File destDir,
PrintStream status)
throws IOException
You basically ask it to carry out the operation for you and if no error is present - you can continue with the workflow.
In general builder is used in following way:
Some class will use builder to create class. Simple
Now you have additional piece of complexity - caching. You can put caching inside Builder or one level higher inside Processor.
What are the implications of putting cache management inside builder:
Builder does not have single responsibility anymore.
It does not work how you would expect at first glance
You are unable to create object without putting it into cache
These problems will not occur if you put cache management to separate class.
I would say that it is not terrible solution, but it certainly will decrease maintainability of your code.

Cucumber Java - How to use returned parameters from a step in a new step?

I'm using cucumber with java in order to test my app ,
I would like to know if it is possible to take an object returned from first scenario step and use it in other steps.
Here is an example for the desirable feature file and Java code :
Scenario: create and check an object
Given I create an object
When I am using this object(#from step one)
Then I check the object is ok
#Given("^I create an object$")
public myObj I_create_an_object() throws Throwable {
myObj newObj = new myObj();
return newObj;
}
#When("^I am using this object$")
public void I_am_using_this_object(myObj obj) throws Throwable {
doSomething(obj);
}
#Then("^I check the object is ok$")
public void I_check_the_object_is_ok() throws Throwable {
check(obj);
}
I rather not to use variables in the class
(Because then all method variables will be in class level)
but i'm not sure it's possible.
Is it possible to use a return value in a method as an input in the next step?
There is no direct support for using the return values from step methods in other steps. As you said, you can achieve sharing of state via instance variables, which works fine for smaller examples. Once you get more steps and want to reorganize them into separate classes you might run into problems.
An alternative would be to encapsulate the state into its own class which manages it using ThreadLocals, you would have to make sure to initialize or reset this state, maybe using hooks.
If you are using a dependency injection framework like spring you could use the provided scenario scope.
#Component
#Scope("cucumber-glue")
public class SharedContext { ... }
This context object could then be injected into multiple classes containing the steps.

How do I replace an existing Singleton object with one from a saved file using ObjectInputStream?

I'm making a Java application that has basic Saving / Opening capabilities. All I need to save is the instance of my class ModeleImage which is a Singleton. My saving apparently works and looks like this:
ObjectOutputStream outputStream = new ObjectOutputStream(new FileOutputStream(file));
outputStream.writeObject(ModeleImage.getInstance());
outputStream.flush();
outputStream.close();
Now I'm trying to open that file with ObjectInputStream. I'm not sure if there's a way to simply replace my Singleton (ModeleImage) with the saved one but right now I'm only trying to copy and replace each attribute. My opening looks like this:
FileInputStream fis = new FileInputStream(fileChooser.getSelectedFile());
ObjectInputStream ois = new ObjectInputStream(fis);
//Get each attribute from the file and set them in my existing ModeleImage Singleton
ModeleImage.getInstance().setImage(((ModeleImage) ois.readObject()).getImage());
ModeleImage.getInstance().setLargeurImage(((ModeleImage) ois.readObject()).getLargeurImage());
ModeleImage.getInstance().setHauteurImage(((ModeleImage) ois.readObject()).getHauteurImage());
ModeleImage.getInstance().setxImage(((ModeleImage) ois.readObject()).getxImage());
ModeleImage.getInstance().setyImage(((ModeleImage) ois.readObject()).getyImage());
I also put try/catch around each. The problem is that my opening part catches an IOException when trying to replace attributes.
ModeleImage.getInstance().setImage(((ModeleImage) ois.readObject()).getImage());
//This catches an IOException
What am I doing wrong?
Is it because it's a Singleton or am I misunderstanding how ObjectInputStream and readObject() work?
By using a built-in feature of the serialization mechanism, you can enhance the normal process by providing two methods inside class file. Those methods are:
private void writeObject(ObjectOutputStream out) throws IOException;
private void readObject(ObjectInputStream in) throws IOException, ClassNotFoundException;
Implement this methods by ModeleImage class and you will control all aspects of serialization and have access to internal state of singleton.
You should only be calling readObject() once since you only wrote one object:
ModeleImage image = ois.readObject();
ModeleImage.getInstance().setImage((image.getImage());
ModeleImage.getInstance().setLargeurImage(image.getLargeurImage());
ModeleImage.getInstance().setHauteurImage((image.getHauteurImage());
ModeleImage.getInstance().setxImage(image.getxImage());
ModeleImage.getInstance().setyImage(image.getyImage());
What you should do is have a static block that checks an instance of your class that you serialized. If it can, find it, it sets it to your singleton instance (thus making sure you load the one from the file). If it can't find it (perhaps first time your program is launching), then it should create an instance and assign it to your singleton variable.
You could create a save method or what not, or override the finalize method to save off your singleton so that way you can check for it in the static block on next time it is class loaded.
Make sense?
Implement readResolve on your serializable Singleton class to ensure there is only ever a single instance and override the properties there, i.e.
private Object readResolve() throws ObjectStreamException
{
instance.setImage(getImage());
instance.setLargeurImage(getLargeurImage());
...
return instance;
}
this concept is described nicely on http://www.javalobby.org/java/forums/t17491.html or check out http://docs.oracle.com/javase/1.3/docs/guide/serialization/spec/input.doc6.html for more on readResolve(). Hope that helps.

Sending a variable to the Mapper Class

I am trying to get an input from the user and pass it to my mapper class that I have created but whenever the value always initialises to zero instead of using the actual value the user input.
How can make sure that whenever I get the variable it always maintain the same value. I have noticed that job1.setMapperClass(Parallel_for.class); creates an instance of the class hence forcing the variable to reinitialize to its original value. Below is the link to the two classes. I am trying to get the value of times from RunnerTool class.
Link to Java TestFor class
Link to RunnerTool class
// setup method in the Mapper
#Override
public void setup(Context context) {
int defaultValue = 1;
times = context.getConfiguration().getInt("parallel_for_iteration", defaultValue );
LOG.info(context.getConfiguration().get("parallel_for_iteration") + " Actually name from the commandline");
LOG.info(times + " Actually number of iteration from the commandline");
}
// RunnerTools class
conf.setInt(ITERATION, times);
You should note that mapper class will be recreated on many cluster nodes so any initalization done to the instance of the mapper class when running the job will not affect other nodes. Technically relevant jar file/s will be distributed among nodes and then mappers will be created there.
So as pointed in the answer above, the only way to pass information to the mappers is using Configuration class.
Mapper get's initialized by reflection, so you can not let the user interact with the mapper class.
Instead you have your Configuration object, which you have to provide if you're setting up your job. There you can set this using conf.set("YOUR KEY", "YOUR VALUE"). In your Mapper class you can override a method called setup(Context context), there you can get the value using context.getConfiguration().get("YOUR KEY"). And maybe save to your mapper local variable.

Categories