Adding Writable objects to Hadoop Configuration

Adding Writable objects to Hadoop Configuration - java

I see that the Configuration class in Hadoop is writable http://hadoop.apache.org/docs/current/api/org/apache/hadoop/conf/Configuration.html. However, I do not see any of the methods that it has exposed that can be used to add a writable object (I see a lot of methods to set and get primitive types like int, long). Let us say, I have my own writable object and I want to add it to the configuration for all my mappers and reduces to use, how do I do this?
Thanks,
Venkat

The configuration is really not for passing entire objects. The configuration should be used more for setting simple parameters that are needed for the setup of the Mappers/Reducers. Think of the conf as you set the variables at the beginning of the job. If you make changes during the middle of a run to the configuration, it most likely won't be there at the end as it's not really meant to dynamically pass data.
What you are looking for if you want to pass around entire Objects between nodes is the Distributed Cache. Technically speaking these are files, but you can use standard object serialization to add them. About the Distributed Cache.
*apologies for linking different hadoop versions, their pages are a bit muddled and hard to find what you need sometimes.

You can check HBase sources (starting from HBase 0.94.6) MultiTableInputFormat.setConf() class method and appropriate TableMapReduceUtil code (for example .initTableMapperJob()). They pass Scan objects through configuration. Earlier TableInputFormat.setConf() class uses very similar mechanics.
Usually only minimal attributes are passed through config but this is probably case closer to your one.
Hope it will help.

Related

Best approach storing and accessing Java application data

I'm in the middle of a massive refactoring project, the code has a 5000 line main class which was injected into everything, stored everything and had all of the common code.
I'm no expert on analysis and design but I've separated out things to the best of my ability and I'm about 80% through refactoring the classes that depend on the main class to use the new classes I've created.
There are some types of data which are initialised when the application starts and accessed by pretty much everything throughout the life of the application. For instance there is a Config class which holds hundreds of parameters.
The approach I've taken is to create several singletons the two most central are GUIData and ClientData. GUIData contains a reference to the mainframe of the application and clientdata maintains references to the config and other similar classes.
This allows me to call ClientData.getInstance().getConfig().getParam("param") from anywhere in the code but I don't feel like this is the best approach.
I considered individual static classes instead of these data singletons which contain instances of the classes but some of the classes do need constructors.
I've been googling on and off for a week trying to find a better way to do this but somehow I always end up on threads talking about database caching

Immutable (configuration) instances provide "thread-safe application-wide data access".
Typesafe's config (as suggested in a comment by Brian Kent) does exactly that.
Note that this does not involve static classes or singletons. Static classes and singletons may serve your purposes now,
but they could prove bothersome in the future. They can be handy ofcourse, but try limiting their use.
Initialization will have to be done after reading and parsing the configuration data. It is typically done at application startup, before other processing threads are started. The initialization will have to validate the configuration data as much as possible in order to fail fast and terminate the program if the configuration data is no good.
Having a lot of configuration data bundled together can create "hidden lines of communication". E.g. you update one value and the application fails because it required updates to other values as well. It's perfectly fine to put all configuration data in one file and load it from there, but your application (with hundreds of configuration options) should divide the configuration data in sets that are used by different parts of your application. This improves isolation, helps unit-testing and makes it possible to change the application in the future without getting too many nasty surprises.
There are two ways to use a set of configuration data:
from within an object call a singleton Settings.getInstance().getConfigForThisModule().
provide each object that uses configuration data with the configuration data via the constructor or via setConfig(ConfigForThisModule config).
The first approach depends on a convention not to call Settings.getInstance().getConfigForACompletelyUnrelatedModule() which could be a weakness. The second approach is more in line with "dependency injection" and could be more future proof.
You could mix both approaches while you are refactoring, just make sure to be consistent (e.g. only use the singleton approach for configuration data that is used in all parts of the application).
To further improve your design for using the configuration data, keep the following (likely) future functional requirement in mind: when the configuration file is updated, configuration data is reloaded and used in the application. Most logging frameworks manage to support this functional requirement without affecting the performance of multi-threaded applications. Among other things, it requires the following of your application:
if the new configuation data is no good, the program is not terminated but an error is logged instead and the old configuration data remains in use. Your initialization procedure will need to handle both "load at fresh start" and "reload" scenarios. The main thing to take away from this is that your initialization procedure needs to be re-usable and should not affect other (running) parts of your application (isolation, again).
long-lived objects may not keep a local copy of configuration data or a reference to an instance of ConfigForThisModule, instead Settings.getInstance()... (or some other method that can return an updated instance) should be called regurarly.
replacing old configuration with new configuration may not result in errors. Technically, replacing the configuration is as simple as updating an AtomicReference with a new configuration instance returned with Settings.getInstance().... But this is also where the isolation of the configuration data sets are tested: there should be no problem using an old set in one module and a new set in another module at the same.
Configuration data can be seen as a sort of "global state". With that in mind, further design points on what to do and what to avoid (partially blatantly copied to this answer) are discussed in the following two questions:
Why is Global State so Evil?
How are globals any different from a database?

Sorry, the question is a bit vague, are you looking to store the config or the cached objects used by other parts of your program ?
Since you have 100s of params, start with splitting up the config into manageable blocks
1) Split up your configuration parameters into logical blocks that have 1:1 correspondence with a simple properties file -its going to take some time
2) These property files must be externalized so that you can change them at any point in time, make sure that you pass in the base location via a env variable to the program
3) Write a utility class (singleton) that wraps Apache commons configuration to hold your config. (read *.properties from the base location and merge the properties into one configuration object) this must be done before any threads are kicked off.
4) Refer to the configuration param in your code using config.getXXXX() methods
Apache commons config also has ability to reload the config when your properties file changes on the filesystem.
Once this is done, use a DI container like Spring or Guice to cache the configured objects.

If it's just String property values you need, you don't even need a class for that - a global facility exists for you already: System.getProperties()
All you need do is first load the property values on start up:
System.setProperty("myKey", "myValue"); // see below how load properties from a file
Then read it anywhere in your code:
String myValue = System.getProperty("myKey");
or
String myValue = System.getProperty("myKey", "my desired default");
If your container doesn't support property loading out of the box, to load properties from an external file that looks like this:
key1=value
key2=some other value
etc...
you can use this code:
Files.lines(Paths.get("path/to/file"))
.filter(line -> !line.startsWith("#") || !line.contains("=")) // ignore comment/blank
.map(line -> line.split("=", 2)) // split into key/value
.forEach(split -> System.setProperty(split[0], split[1])); // load as property

you can use the Java Properties class util, basically its a HashTable
reference : https://docs.oracle.com/javase/7/docs/api/java/util/Properties.html
you create a file fileName.properties and store your data in key value pairs, for example:
username=your name
port=8080
then you load it into Properties Object and get the data like the following:
Properties prop = new Properties();
load the file...
String userName = prop.getProperty("username")
String port = prop.getProperty("port")// you can parse it to int if needed
what i suggest is to create a property file for each type of configuration like:
clientData.properties
appConfig.properties
you can follow this simple tutorial
http://www.mkyong.com/java/java-properties-file-examples/

How can I compare 2 large objects running on separate jvm's?

I am looking at changing the way some large objects which maintain the data for a large website are reloaded, they contain data relating to catalogue structure, products etc and get reloaded daily.
After changing how they are reloaded I need to be able to see whether there is any difference in the resulting data so the intention is to reload both and compare the content.
There may be some issues(ie. lists used when ordering is not imporatant) that make the comparison harder so I would need to be able to alter the structure before comparison. I have tried to serialise to json using gson but I run out of memory. I'm thinking of trying other serialisation methods or writing my own simple one.
I imagine this is something that other people will have wanted to do when changing critical things like this but I haven't managed to find anythign about it.

In this special case (separate VMs) I suggest adding something like a dump method to each class which writes the relevant content into a file (human readable text). This method calls dump on each aggregated object as well.
In the end you have to files from each VM, and then you can compare them using an MD5 checksum for example.
This is probably a lot of work, but if you encounter any differences, you can use diff on both files, and this will be a great help.
You can start with a simple version, and refine it step-by-step by adding more output.
Adding (complete) serialization later to a class is cumbersome. There might be tools which simplify this (using reflection etc.), but in my experience you have to tweak your classes: Exclude fields which are not relevant, define a sort order for lists, cyclic relations etc.
Actually I use a similar approach for the same reasons (to check whether a new version still returns the same result): The application contains multiple services (for each version), the results are always data transfer objects, serialization is added immediately to the DTOs, and DTOs must provide a comparison method dedicated for this purpose.

Looking at the complications and memory issues, also as you have mentioned you dont want to maintain versions, i would look to use database for comparison.
It will need some effort in terms of mapping your data in jvm to db table but once you have done that, it will be staright forward. You can dump data from one large object in db tables and then you can simply run a check from 2nd object in db.
Creating a stored proc can simplify things. This solution can support data check from any number of jvms.

Design pattern for parameter settings that is maintainable in decent size java project

I am looking for concrete ideas of how to manage a lot of different parameter settings for my java program. I know this question is a bit diffuse but I need some ideas about the big picture so that my code becomes more maintainable.
What my project does is to perform many processing steps on data, mostly text. These processing steps are algorithms of varying complexity that often have many settings. I would also like to change which processing steps are used by e.g. configuration files.
The reason for my program is to do repeatable experiments, and because of this I need to be able to get a complete view of all the parameters used in the different parts of the code, preferably in a nice format.
At this (prototype) stage I have the settings in source code like:
public static final param1=0.35;
and each class that is responsible for some processing step has its own hard coded settings. It is actually quite scary because there is no simple way to change things or to even see what is done and with what parameters/settings.
My idea is to have a central key/value store for all settings that also supports a dump of all settings. Example:
k:"classA_parameter1",v:"0.35"
k:"classC_parameter5",v:"false"
However, I would not really like to just store the parameters as strings but have them associated to an actual java class or object.
Is it smarter to have a singleton "SettingsManager" that manages everything. Or to have a SettingsManager object in each class that main has access to? I don't really like storing string descriptions of the settings but I cant see any other way (Lets say one setting is a SAXparser implementation that is used and another parameter is a double, e.g. percentage) since I really don't want to store them as Objects and cast them.
Experience and links to pages about relevant design patterns is greatly appreciated.
To clarify, my experiments could be viewed as a series of algorithms that are working on data from files/databases. These algorithms are grouped into different classes depending on their task in the whole process, e.g.
Experiment //main
InternetLookup //class that controls e.g. web scraping
ThreadedWebScraper
LanguageDetection //from "text analysis" package
Statistics //Calculate and store statistics
DatabaseAccess
DecisionMaking //using the data that we have processed earlier, make decisions (machine learning)
BuildModel
Evaluate
Each of the lowest level classes have parameters and are different but I still want a to get a view of everything that is going on.

You have the following options, starting with the simplest one:
A Properties file
Apache Commons Configuration
Spring Framework
The latter allows creation of any Java object from an XML config file but note that it's a framework, not a library: this means that it affects the design of the whole application (it promotes the Inversion of Control pattern).

This wheel has been invented multiple times already.
From the most basic java.util.Properties to the more advanced frameworks like Spring, which offers advanced features like value injection and type conversion.
Building it yourself is probably the worst approach.

Maybe not a complete answer to your question, but some points to consider:
Storing values as strings (and parsing the strings into other types via your SettingsManager) is the usual approach. If your configuration value is too complex to do this then it's probably not really a configuration value, but part of your implementation.
Consider injecting the individual configuration values required by each class via constructor arguments, rather than just passing in the whole SettingsManager object (see Law of Demeter)
Avoid creating a Singleton SettingsManager if possible, singletons harm testability and damage the design of your application in various ways.

If the number of parameters is big I would split them to several config files. Apache Commons Configuration, as mentioned by #Pino is really a nice library to handle them.
On the Java-side I would probably create one config-class per file and wrap Commons Configuration config to load settings, eg:
class StatisticsConfig {
private Configuration config = ... ;
public double getParameter1() {
return config.getDouble("classA_parameter1");
}
}
This may need quite a lot of boilerplate code if the number of parameters is big but I think it is quite clean solution (and easy to refactor).

Configuration design pattern across Java system

The question is old hat - what is a proper design to support a configuration file or system configurations across our system? I've identified the following requirements:
Should be able to reload live and have changes picked up instantly with no redeploying
For software applications that rely on the same, e.g., SQL or memcached credentials, should be possible to introduce the change in an isolated place and deploy in one swoop, even if applications are on separate machines in separate locations
Many processes/machines running the same application supported
And the parts of this design I am struggling with:
Should each major class take its own "Config" class as an input parameter to the constructor? Should there be a factory responsible for instantiating with respect to the right config? Or should each class just read from its own config and reload somewhat automatically?
If class B derives from class A, or composes around it, would it make sense for the Config file to be inherited?
Say class A is constructed by M1 and M2 (M for "main") and M1 is responsible for instantiating a resource. Say the resource relies on MySQL credentials that I expect to be common between M1 and M2, is there a way to avoid the tradeoff of break ownership and put in A's config v. duplicate the resource across M1 and M2's config?
These are the design issues I'm dealing with right now and don't really know the design patterns or frameworks that work here. I'm in Java so any libraries that solve this are very welcome.

You may want to check out Apache Commons Config, which provides a wide range of features. You can specify multiple configuration sources, and arrange these into a hierarchy. One feature of particular interest is the provision for Configuration Events, allowing your components to register their interest in configuration changes.
The goal of changing config on the fly is seductive, but requires some thought around the design. You need to manage those changes carefully (e.g. what happens if you shrink queue sizes - do you throw away existing elements on the queue ?)

Should each major class take its own "Config" class as an input parameter to the constructor?
No, that sounds like an awful design which would unnecessarily overcomplicate a lot of code. I would recommend you to implement a global Configuration class as a singleton. Singleton means that there is only one configuration object, which is a private static variable of your Configuration class and can be acquired with a public static getInstance() method whenever it is needed.
This configuration object should store all configuration parameters as key/value pairs.

Why do some APIs (like JCE, JSSE, etc) provide their configurable properties through singleton maps?

For example:
Security.setProperty("ocsp.enable", "true");
And this is used only when a CertPathValidator is used. I see two options for imporement:
again singleton, but with getter and setter for each property
an object containing the properties relevant to the current context:
CertPathValidator.setValidatorProperties(..) (it already has a setter for PKIXParameters, which is a good start, but it does not include everything)
Some reasons might be:
setting the properties from the command line - a simple transformer from command-line to default values in the classes suggested above would be trivial
allowing additional custom properties by different providers - they can have public Map getProviderProperties(), or even public Object .. with casting.
I'm curious, because these properties are not always in the most visible place, and instead of seeing them while using the API, you have to go though dozens of google results before (if lucky) getting them. Because - in the first place - you don't always know what exactly you are looking for.
Another fatal drawback I just observed is that this is not thread-safe. For example if two threads want to check a revocation via ocsp, they have to set the ocsp.responderURL property.. and perhaps override the settings of each other.

This is actually a great question that forces you to think about design decisions you may have made in the past. Thanks for asking a question that should have occurred to me years ago!
It sounds like the objection is not so much the singleton aspect of this (although an entirely different discussion could occur about that) - but the use of string keys.
I've worked on APIs that used this sort of scheme, and the reasons you outline above were definitely the driving factors - it makes it crazy simple to parse a command line or properties file, and it allows for 3rd party extensibility without impact to the official API.
In our library, we actually had a class with a bunch of static final String entries for each of the official parameters. This gave us the best of both worlds - the developer could still use code completion where it made sense to do so. It also becomes possible to construct hierarchies of related settings using inner classes.
All that said, I think that the first reason (easy parsing of command line) doesn't really cut it. Creating a reflection driven mechanism for pushing settings into a bunch of setters would be fairly easy, and it would prevent the cruft of String->object transformation from drifting into the main application classes.
Extensibility is a bit trickier, but I think it could still be handled using a reflection driven system. The idea would be to have the main configuration object (the one with all the setters in it) also have a registerExtensionConfiguration(xxx) method. A standard notation (probably dot separated) could be used to dive into the resultant acyclic graph of configuration objects to determine where the setter should be called.
The advantage of the above approach is that it puts all of the command line argument/properties file parsing exception handling in one place. There isn't a risk of a mis-formatted argument floating around for weeks before it gets hit.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.