Hadoop: MapReduce MinMax result different from original dataset - java

I am new in Hadoop.
I try to use MapReduce to get the min and max Monthly Precipitation value for each year.
Here is one year of the data set looks like:
Product code,Station number,Year,Month,Monthly Precipitation Total (millimetres),Quality
IDCJAC0001,023000,1839,01,11.5,Y
IDCJAC0001,023000,1839,02,11.4,Y
IDCJAC0001,023000,1839,03,20.8,Y
IDCJAC0001,023000,1839,04,10.5,Y
IDCJAC0001,023000,1839,05,4.8,Y
IDCJAC0001,023000,1839,06,90.4,Y
IDCJAC0001,023000,1839,07,54.2,Y
IDCJAC0001,023000,1839,08,97.4,Y
IDCJAC0001,023000,1839,09,41.4,Y
IDCJAC0001,023000,1839,10,40.8,Y
IDCJAC0001,023000,1839,11,113.2,Y
IDCJAC0001,023000,1839,12,8.9,Y
And this is what the result I get for the year 1839:
1839 1.31709005E9 1.3172928E9
Obviously, the result is not matched to the original data...But I cannot figure out why it happens...

Your code has multiple issues.
(1) In MinMixExposure, you write doubles, but read ints. You also use Double type (meaning that you care about nulls) but do not handle nulls in serialization/deserialization. If you really need nulls, you should write something like this:
// write
out.writeBoolean(value != null);
if (value != null) {
out.writeDouble(value);
}
// read
if (in.readBoolean()) {
value = in.readDouble();
} else {
value = null;
}
If you do not need to store nulls, replace Double with double.
(2) In map function you wrap your code in IOException catch blocks. This doesn't make any sense. If input data has records in incorrect format, then most probably you will get NullPointerException/NumberFormatError in Double.parseDouble(). However, you do not handle these exceptions.
Checking for nulls after you called parseDouble also doesn't make sense.
(3) You pass map key to reducer as Text. I would recommend to pass year as IntWritable (and configure your job with job.setMapOutputKeyClass(IntWritable.class);).
(4) maxExposure must be handled similarly to minExposure in reducer code. Currently you just return the value for the last record.

Your logic to find the min and max exposure in the Reducer seems off. You set maxExposure twice, and never check whether it is actually the max exposure. I'd go with:
public void reduce(Text key, Iterable<MinMaxExposure> values,
Context context) throws IOException, InterruptedException {
Double minExposure = Double.MAX_VALUE;
Double maxExposure = Double.MIN_VALUE;
for (MinMaxExposure val : values) {
if (val.getMinExposure() < minExposure) {
minExposure = val.getMinExposure();
}
if (val.getMaxExposure() > maxExposure) {
maxExposure = val.getMaxExposure();
}
}
MinMaxExposure resultRow = new MinMaxExposure();
resultRow.setMinExposure(minExposure);
resultRow.setMaxExposure(maxExposure);
context.write(key, resultRow);
}

Related

ConcurrentHashMap throws recursive update exception

Here is my Java code:
static Map<BigInteger, Integer> cache = new ConcurrentHashMap<>();
static Integer minFinder(BigInteger num) {
if (num.equals(BigInteger.ONE)) {
return 0;
}
if (num.mod(BigInteger.valueOf(2)).equals(BigInteger.ZERO)) {
//focus on stuff thats happening inside this block, since with given inputs it won't reach last return
return 1 + cache.computeIfAbsent(num.divide(BigInteger.valueOf(2)),
n -> minFinder(n));
}
return 1 + Math.min(cache.computeIfAbsent(num.subtract(BigInteger.ONE), n -> minFinder(n)),
cache.computeIfAbsent(num.add(BigInteger.ONE), n -> minFinder(n)));
}
I tried to memoize a function that returns a minimum number of actions such as division by 2, subtract by one or add one.
The problem I'm facing is when I call it with smaller inputs such as:
minFinder(new BigInteger("32"))
it works, but with bigger values like:
minFinder(new BigInteger("64"))
It throws a Recursive Update exception.
Is there any way to increase recursion size to prevent this exception or any other way to solve this?
From the API docs of Map.computeIfAbsent():
The mapping function should not modify this map during computation.
The API docs of ConcurrentHashMap.computeIfAbsent() make that stronger:
The mapping function must not modify this map during computation.
(Emphasis added)
You are violating that by using your minFinder() method as the mapping function. That it seems nevertheless to work for certain inputs is irrelevant. You need to find a different way to achieve what you're after.
Is there any way to increase recursion size to prevent this exception or any other way to solve this?
You could avoid computeIfAbsent() and instead do the same thing the old-school way:
BigInteger halfNum = num.divide(BigInteger.valueOf(2));
BigInteger cachedValue = cache.get(halfNum);
if (cachedValue == null) {
cachedValue = minFinder(halfNum);
cache.put(halfNum, cachedValue);
}
return 1 + cachedValue;
But that's not going to be sufficient if the computation loops. You could perhaps detect that by putting a sentinel value into the map before you recurse, so that you can recognize loops.

Handle long min value condition

When I ran a program, long min value is getting persisted instead of original value coming from the backend.
I am using the code:
if (columnName.equals(Fields.NOTIONAL)) {
orderData.notional(getNewValue(data));
As output of this, i am getting long min value, instead of original value.
I tried using this method to handle the scenario
public String getNewValue(Object data) {
return ((Long)data).getLong("0")==Long.MIN_VALUE?"":((Long)data).toString();
}
but doesn't work.
Please suggest
EDITED: I misread the code in the question; rereading it, I now get what the author is trying to do, and cleaned up the suggestion as a consequence.
(Long) data).getLong("0") is a silly way to write null, because that doesn't do anything. It retrieves the system property named '0', and then attempts to parse it as a Long value. As in, if you start your VM with java -D0=1234 com.foo.YourClass, that returns 1234. I don't even know what you're attempting to accomplish with this call. Obviously it is not equal to Long.MIN_VALUE, thus the method returns ((Long) data).toString(). If data is in fact a Long representing MIN_VALUE, you'll get the digits of MIN_VALUE, clearly not what you wanted.
Try this:
public String getNewValue(Object data) {
if (data instanceof Number) {
long v = ((Number) data).longValue();
return v == Long.MIN_VALUE ? "" : data.toString();
}
// what do you want to return if the input isn't a numeric object at all?
return "";

Hadoop Custom Partitioner not behaving according to the logic

Based on this example here, this works. Have tried the same on my dataset.
Sample Dataset:
OBSERVATION;2474472;137176;
OBSERVATION;2474473;137176;
OBSERVATION;2474474;137176;
OBSERVATION;2474475;137177;
Consider each line as string, my Mapper output is:
key-> string[2], value-> string.
My Partitioner code:
#Override
public int getPartition(Text key, Text value, int reducersDefined) {
String keyStr = key.toString();
if(keyStr == "137176") {
return 0;
} else {
return 1 % reducersDefined;
}
}
In my data set most id's are 137176. Reducer declared -2. I expect two output files, one for 137176 and second for remaining Id's. I'm getting two output files but, Id's evenly distributed on both the output files. What's going wrong in my program?
Explicitly set in the Driver method that you want to use your custom Partitioner, by using: job.setPartitionerClass(YourPartitioner.class);. If you don't do that, the default HashPartitioner is used.
Change String comparison method from == to .equals(). i.e., change if(keyStr == "137176") { to if(keyStr.equals("137176")) {.
To save some time, perhaps it will be faster to declare a new Text variable at the beginning of the partitioner, like that: Text KEY = new Text("137176"); and then, without converting your input key to String every time, just compare it with the KEY variable (again using the equals() method). But perhaps those are equivalent. So, what I suggest is:
Text KEY = new Text("137176");
#Override
public int getPartition(Text key, Text value, int reducersDefined) {
return key.equals(KEY) ? 0 : 1 % reducersDefined;
}
Another suggestion, if the network load is heavy, parse the map output key as VIntWritable and change the Partitioner accordingly.

Why isn't it an error?

The following program is a recursive program to find the maximum and minimum of an array.(I think! Please tell me if it is not a valid recursive program. Though there are easier ways to find the maximum and minimum in the array, I'm doing in the recursive manner only as a part of a exercise!)
This program works correctly and produces the outputs as expected.
In the comment line where I have marked "Doubt here!", I am unable to understand why an error is not given during compilation. The return type is clearly an integer array (as specified in the method definition), but I have not assigned the returned data to any integer array, but the program still works. I was expecting an error during compilation if I did it this way, but it worked. If someone would help me figure this out, it'd be helpful! :)
import java.io.*;
class MaxMin_Recursive
{
static int i=0,max=-999,min=999;
public static void main(String[] args) throws IOException
{
BufferedReader B = new BufferedReader(new InputStreamReader(System.in));
int[] inp = new int[6];
System.out.println("Enter a maximum of 6 numbers..");
for(int i=0;i<6;i++)
inp[i] = Integer.parseInt(B.readLine());
int[] numbers_displayed = new int[2];
numbers_displayed = RecursiveMinMax(inp);
System.out.println("The maximum of all numbers is "+numbers_displayed[0]);
System.out.println("The minimum of all numbers is "+numbers_displayed[1]);
}
static int[] RecursiveMinMax(int[] inp_arr) //remember to specify that the return type is an integer array
{
int[] retArray = new int[2];
if(i<inp_arr.length)
{
if(max<inp_arr[i])
max = inp_arr[i];
if(min>inp_arr[i])
min = inp_arr[i];
i++;
RecursiveMinMax(inp_arr); //Doubt here!
}
retArray[0] = max;
retArray[1] = min;
return retArray;
}
}
The return type is clearly an integer array (as specified in the method definition), but I have not assigned the returned data to any integer array, but the program still works.
Yes, because it's simply not an error to ignore the return value of a method. Not as far as the compiler is concerned. It may well represent a bug, but it's a perfectly valid use of the language.
For example:
Console.ReadLine(); // User input ignored!
"text".Substring(10); // Result ignored!
Sometimes I wish it could be used as warning - and indeed Resharper will give warnings when it can detect that "pure" methods (those without any side-effects) are called without using the return value. In particular, call which cause problems in real life:
Methods on string such as Replace and Substring, where users assume that calling the method alters the existing string
Stream.Read, where users assume that all the data they've requested has been read, when actually they should use the return value to see how many bytes have actually been read
There are times where it's entirely appropriate to ignore the return value for a method, even when it normally isn't for that method. For example:
TValue GetValueOrDefault<TKey, TValue>(Dictionary<TKey, TValue> dictionary, TKey key)
{
TValue value;
dictionary.TryGetValue(key, out value);
return value;
}
Normally when you call TryGetValue you want to know whether the key was found or not - but in this case value will be set to default(TValue) even if the key wasn't found, so we're going to return the same thing anyway.
In Java (as in C and C++) it is perfectly legal to discard the return value of a function. The compiler is not obliged to give any diagnostic.

Meaning of initializing the Big Decimal to -99 in the below code

I know hashtable doesnt allow null keys ...but how is the below code working.
And what does initializing the Big Decimal to -99 in the below code do.
private static final BigDecimal NO_REGION = new BigDecimal (-99);
public List getAllParameters (BigDecimal region, String key) {
List values = null;
if (region==null) {
region = NO_REGION;
}
Hashtable paramCache = (Hashtable)CacheManager.getInstance().get(ParameterCodeConstants.PARAMETER_CACHE);
if (paramCache.containsKey(region)) {
values = (List) ((Hashtable)paramCache.get(region)).get(key);
}
return values;
}
Am struggling for a long time and dont understand it.
This is an implementation of the null object pattern: a special object, BigDecimal(-99), is designated to play the role of null in a situation where "real" nulls are not allowed.
The only requirement is that the null object must be different from all "regular" objects. This way, the next time the program needs to find entries with no region, all it needs to do is a lookup by the NO_REGION key.
Regions are identified by a BigDecimal in the hashtable (key) - when no region is provided (null) a default value of -99 is used.
It just looks like poor code to me - if something that short makes you "struggle for a long time", that is usually the best indicator.
Just cleaning it up a little and it probably will make a lot more sense:
private static Hashtable paramCache = (Hashtable)CacheManager.getInstance().get(ParameterCodeConstants.PARAMETER_CACHE);
public List getAllParameters (BigDecimal region, String key) {
List values = null;
if (region != null && paramCache.containsKey(region)) {
Hashtable regionMap = (Hashtable) paramCache.get(region);
values = (List) regionMap.get(key);
}
return values;
}
Seems the writer into hashtable used NO_REGION as key for values without a region. So, the reader is doing the same thing.

Categories