Pig custom loadFunc always use 1 mapper and 1 inputSplit

Pig custom loadFunc always use 1 mapper and 1 inputSplit - java

I created a custom loadFunc with a custom InputFormat and RecordReader.
Whenever the InputFormat return more than one input split the PigSplit always contains only one input split and use only one mapper.
The implementation is too big to be posted here, but are there any obvious reasons why this might happen ?
Edit: I'm using pig 0.13 and by adding some logging I found that
the InputFormat created by the Loadfunc returns a list that contains two input splits and then the PigInputFormat uses this list for creating PigSplits.
I still can't find out where did Pig omit one of these input splits and only used the first one.
This is the code from PigInputFormat.java ( src ) line 273
InputFormat inpFormat = loadFunc.getInputFormat();
List<InputSplit> oneInputSplits = inpFormat.getSplits(
HadoopShims.createJobContext(inputSpecificJob.getConfiguration(), jobcontext.getJobID()));
List<InputSplit> oneInputPigSplits = getPigSplits(oneInputSplits, i, inpTargets.get(i), HadoopShims.getDefaultBlockSize(fs, isFsPath? path: fs.getWorkingDirectory()),
combinable, confClone);
splits.addAll(oneInputPigSplits);
I made sure that loadFunc returns 2 input splits, but somehow only one PigSplit is created.
Any clues of how can this be figured out.
Edit 2: So I downloaded the source code for pig 0.13 and compiled it and ran my script and surprisingly it worked fine and used the two splits when I did that, unfortunately I can't do that on the server node.
What I noticed is that the stack trace to create the inputsplits is different between the ready compiled version in cloudera and the downloaded version I compiled.
The cloudera version creates the InputSplits using org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat while the downloaded version uses org.apache.pig.impl.io.ReadToEndLoader
I'm really getting confused about this one.

So after investigating this, turns out that there is a bug in Pig version <= 0.13 that assumes each InputSplit should have a length ( it always assumes that it's reading from a file ) and because in my case the CustomInputSplit.getLength was returning 0, then pig was taking only the first InputSplit and leaving the others.
Workaround is just return any value in the getLength for the input split.
as I mentioned in the question, the behavior of loading the InputSplits changed after that and there is no need for the work around in these cases.

Related

how to find whether a substring in file is already present in hashmap?

I have a hashMap(guava bimap) in which keys and values both are strings, I wanted to write a program which parses the given file and replaces all the strings in the file which are also in BiMap with corresponding values from Bimap.
for example: i have a file called test.txt has following text
Java is a set of several computer software and specifications developed by Sun Microsystems.
and my BiMap has
"java i" => "value1"
"everal computer" => "value2" etc..
So now i want my program to take test.txt and Bimap as input and give an output which looks something like this
value1s a set of svalue2 software and specifications developed by Sun Microsystems.
please point me towards any algorithm which can do this, the program takes large files as input so brute force may not be a good idea.
Edit: I'm using fixed length strings for keys and values.
That example was just intended to show the operation.
Thanks.

For a batch operation like this, I would avoid putting a lot of data into the memory. Therefore I'd recommend you to write the new content into a new file. If the file in the end must be the exact same file, you can still replace one file by the other, at the end of the process. read, write and flush each new line separately, and you won't have any memory issues.

Retrieving the start position of an input split in Hadoop

I have as a project to develop a simple document index using MapReduce in Hadoop. I need to retrieve the start position (as in byte offset from the beginning of the file) of the FileSplit that the map() function is currently working on. As far as I understand, the input split given to the Mapper is logically split into parts by a RecordReader, each of which is later map()-ed.
I read the FileSplit documentation and I tried doing:
((FileSplit) context.getInputSplit()).getStart()
, but this always returns 0. Also, I am sure that the files are split in more than one part as I did some printing, so I expected non-zero values here and there.
Has someone else run in the same problem? I should also mention that I have little experience in Hadoop.
Edit:
There are 6 input files, each around 16KB (8KB compressed). All files seem to be split into two (Map input records=12). Each Mapper has its map() called twice, but both times getStart() returns 0.

how to get word-topic probability using mallet

I've made a parallel topic model using mallet.
And I want to get top-words for each document.
To do that, I'm trying to get a word-topic probability matrix.
How would I achieve this?

When you are building topics using MALLET, you have an option called --word-topic-counts-file. When you give this option and specify a file, MALLET writes ( topic, word, probability ) values per each line in the file. You can later read this file in C, Java or R (of course, any language) to create the matrix you want.

Just to make one point regarding the answer of Praveen.
Using the --word-topic-counts-file, MALLET will create a file which first few rows look something like this:
0 elizabeth 19:1
1 needham 19:2 17:1
2 died 19:2
3 mother 17:1 19:1 14:1
where first line means that the word elizabeth has been present in the topic 19 once; second line means that the word needham is associated two times with the topic 19, and with the topic 17 once; and so on...
Although, this file doesn't give you explicit probabilities, you can use it to calculate them.

Encountering and error: "Wrong input format: sample_serial_number out of range" when trying to use a precomputed kernel with LibSVM

I'm attempting to use a precomputed kernel with LibSVM 3.17 (Java Version) but am encountering an error which states: 'Wrong input format: sample_serial_number out of range' within the read_problem() method in the svm_train class.
I am using a linear kernel to begin with i.e. taking the dot-product of two vectors. The data I'm using has been scaled using svm_scale in the range [-1,1]. When saving my precomputed kernel, I'm saving out the ID of the row (which is effectively a unique identifier for the row) for my first column and the contents of the matrix for subsequent columns. My generated matrix is symmetric and I've included the first couple of entries of the file contents below for your evaluation:
1 0:10.3098007199 1:9.691388073999995 2:8.269529587900001 3:10.836359234799996
2 0:9.691388073999995 1:10.441238090599997 2:7.5937360488 3:9.193978496500002
3 0:8.269529587900001 1:7.5937360488 2:8.1263441462 3:9.8885507424
4 0:10.836359234799996 1:9.193978496500002 2:9.8885507424 3:13.705259598099996
The error itself occurs when when the value:
48:0.015231278900000159
is encountered in my precomputed kernel file (which happens to be on the first line). The error arises because the value above fails the following test:
if ((int)prob.x[i][0].value <= 0 || (int)prob.x[i][0].value > max_index)
where prob.x[i][0].value = 0.015231278900000159 within read_problem() in svm_train.
I'm a bit stuck as to how to proceed with this. I'm wondering if I have saved the data in the correct file format? I have read the README within LibSVM and I think I'm doing everything correctly (but obviously not)!! I've also looked at other answers given already, such as:
Libsvm precomputed kernels and
Precomputed Kernels with LibSVM in Python
but unfortunately I can't see the answer within them.
One final note: When I scaled the data in the range [0,1], the above error did not happen (as all the values in the matrix where now >= 1) but I'm puzzled as to why a negative value within the matrix seems to be causing problems in the first place.
Any help/insight offered would be greatly appreciated.

I solved the problem with another careful reading of John Robertson's post at the following:
Precomputed Kernels with LibSVM in Python

I received the same message and the solution turned out to be the range of the parameters specified for the training is not valid. For example, in my case, I was trying to enter '-t 4' while the t flag options are (0,1,2,3) only.

How to output a String with JRecord

I am trying to convert a XML/Bean (either one) to a fixed length flatfile with JRecord. I am not able to output it correctly to a String in file. I can only get an output as binary.
So I will just convert a XML/Beant to String. Not to a Binary Fixed Length and so on.
Any chance of someone who have solved this issue with JRecord?
Any other Framework with example´you can reccomend? And with an example.

I'm the author of JRecordBind, a Java library that does the same as JRecord (I think) and that it's based on XML Schema
The are a couple of examples on the homepage, while more are available as input for the tests
https://github.com/ffissore/jrecordbind/tree/master/jrecordbind-test/src/test/resources

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Pig custom loadFunc always use 1 mapper and 1 inputSplit - java

Related

how to find whether a substring in file is already present in hashmap?

Retrieving the start position of an input split in Hadoop

how to get word-topic probability using mallet

Encountering and error: "Wrong input format: sample_serial_number out of range" when trying to use a precomputed kernel with LibSVM

How to output a String with JRecord

Categories

Resources