Retrieving the start position of an input split in Hadoop

Retrieving the start position of an input split in Hadoop - java

I have as a project to develop a simple document index using MapReduce in Hadoop. I need to retrieve the start position (as in byte offset from the beginning of the file) of the FileSplit that the map() function is currently working on. As far as I understand, the input split given to the Mapper is logically split into parts by a RecordReader, each of which is later map()-ed.
I read the FileSplit documentation and I tried doing:
((FileSplit) context.getInputSplit()).getStart()
, but this always returns 0. Also, I am sure that the files are split in more than one part as I did some printing, so I expected non-zero values here and there.
Has someone else run in the same problem? I should also mention that I have little experience in Hadoop.
Edit:
There are 6 input files, each around 16KB (8KB compressed). All files seem to be split into two (Map input records=12). Each Mapper has its map() called twice, but both times getStart() returns 0.

Related

Java processing lines in file and data structures

I have read a bit about multidimensional arrays would it make sense to solve this problem using such data structures in Java, or how should I proceed?
Problem
I have a text file containing records which contain multiple lines. One record is anything between <SUBBEGIN and <SUBEND.
The lines in the record follow no predefined order and may be absent from a record. In the input file (see below) I am only interested in lines MSISDN, CB,CF and ODBIC fields.
For each of these fields I would like to apply regular expressions to extract the value to the right of the equals.
Output file would be a comma separated file containing these values, example
MSISDN=431234567893 the value 431234567893 is written to the output file
error checking
NoMSISDNnofound when no MSISDN is found in a record
noCFUALLPROVNONE when no CFU-ALL-PROV-NONE is found in a recored
Search and replace operations
CFU-ALL-PROV-NONE should be replaced by CFU-ALL-PROV-1/1/1
CFU-TS10-ACT-914369223311 should be replaced by CFU-TS10-ACT-1/1/0/4369223311
Output for first record
431234567893,BAOC-ALL-PROV,BOIC-ALL-PROV,BOICEXHC-ALL-PROV,BICROAM-ALL-PROV,CFU-ALL-PROV-1/1/1,CFB-ALL-PROV-1/1/1,CFNRY-ALL-PROV-1/1/1,CFNRY-ALL-PROV-1/1/1,CFU-TS10-ACT-1/1/1/4369223311,BAIC,BAOC
Input file
<BEGINFILE>
<SUBBEGIN
IMSI=11111111111111;
MSISDN=431234567893;
CB=BAOC-ALL-PROV;
CB=BOIC-ALL-PROV;
CB=BOICEXHC-ALL-PROV;
CB=BICROAM-ALL-PROV;
IMEISV=4565676567576576;
CW=CW-ALL-PROV;
CF=CFU-ALL-PROV-NONE-YES-NO-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFB-ALL-PROV-NONE-YES-YES-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFNRY-ALL-PROV-NONE-YES-YES-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFNRC-ALL-PROV-NONE-YES-NO-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFU-TS10-ACT-914369223311-YES-YES-25-YES-65535-YES-YES-NO-NO-NO-YES-YES-
YES-YES-NO;
ODBIC=BAIC;
ODBOC=BAOC;
ODBROAM=ODBOHC;
ODBPRC=ENTER;
ODBPRC=INFO;
ODBPLMN=NONE;
ODBPOS=NOBPOS-BOTH;
ODBECT=OdbAllECT;
ODBDECT=YES;
ODBMECT=YES;
ODBPREMSMS=YES;
ODBADULTSMS=YES;
<SUBEND
<SUBBEGIN
IMSI=11111111111133;
MSISDN=431234567899;
CB=BAOC-ALL-PROV;
CB=BOIC-ALL-PROV;
CB=BOICEXHC-ALL-PROV;
CB=BICROAM-ALL-PROV;
CW=CW-ALL-PROV;
CF=CFU-ALL-PROV-NONE-YES-NO-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO+-NO-NO;
CF=CFB-ALL-PROV-NONE-YES-YES-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFNRY-ALL-PROV-NONE-YES-YES-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFNRC-ALL-PROV-NONE-YES-NO-NONE-YES-65535-NO-NO-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFU-TS10-ACT-914369223311-YES-NO-NONE-YES-65535-YES-YES-NO-NO-NO-NO-NO-NO-NO-NO;
CF=CFD-TS10-REG-91430000000-YES-YES-25-YES-65535-YES-YES-NO-NO-NO-YES-YES-YES-YES-NO;
ODBIC=BICCROSSDOMESTIC;
ODBOC=BAOC;
ODBROAM=ODBOH;
ODBPRC=INFO;
ODBPLMN=PLMN1
ODBPLMN=PLMN3;
ODBPOS=NOBPOS-BOTH;
ODBECT=OdbAllECT;
ODBDECT=YES;
ODBMECT=YES;
ODBPREMSMS=NO;
ODBADULTSMS=YES;
<SUBEND

From what I understand, you are simply reading a text file and processing it and maybe replacing some words. You do not therefore need a data structure to store the words in. Instead you can simply read the file line by line and pass it through a bunch of if statements (maybe a couple booleans to check if the specific parameters you are searching for have been found?) and then rewrite the line you want to a new file.

Dealing with big files to implement data in machine learning algorithms, I did it by passing all of the file contents in a variable, and then using the String.split("delimeter") method (Supported from Java 8 and later), I broke the contents in a one-dimensional array, where each cell had the info before the delimeter.
Firstly read the file via a scanner or your way of doing it (let content be the variable with your info), and then break it with
content.split("<SUBEND");

How to specify multiple input paths to a Dataflow job

I want to run a Dataflow job over multiple inputs from Google Cloud Storage, but the paths I want to pass to the job can't be specified with just the * glob operator.
Consider these paths:
gs://bucket/some/path/20160208/input1
gs://bucket/some/path/20160208/input2
gs://bucket/some/path/20160209/input1
gs://bucket/some/path/20160209/input2
gs://bucket/some/path/20160210/input1
gs://bucket/some/path/20160210/input2
gs://bucket/some/path/20160211/input1
gs://bucket/some/path/20160211/input2
gs://bucket/some/path/20160212/input1
gs://bucket/some/path/20160212/input2
I want my job to work on the files in the 20160209, 20160210 and 20160211 directories, but not on 20160208 (the first) and 20160212 (the last). In reality there's a lot of more dates, and I want to be able to specify an arbitrary range of dates for my job to work on.
The docs for TextIO.Read say:
Standard Java Filesystem glob patterns ("*", "?", "[..]") are supported.
But I can't get this to work. There's a link to Java Filesystem glob patterns , which in turn links to getPathMatcher(String), that lists all the globbing options. One of them is {a,b,c}, which looks exactly like what I need, however, if I pass gs://bucket/some/path/201602{09,10,11}/* to TextIO.Read#from I get "Unable to expand file pattern".
Maybe the docs mean that only *, ? and […] are supported, and if that is the case, how can I construct a glob that Dataflow will accept and that can match an arbitrary date range like the one I describe above?
Update: I've figured out that I can write a chunk of code to so that I can pass in the path prefixes as a comma separated list, create an input from each and use the Flatten transform, but that seems like a very inefficient way of doing it. It looks like the first step reads all input files and immediately write them out again to the temporary location on GCS. Only when all the inputs have been read and written the actual processing starts. This step is completely unnecessary in the job I'm writing. I want the job to read the first file, start processing it and read the next, and so on. This just caused a ton other problems, I'll try to make it work, but it feels like a dead end because of the initial rewriting.

The docs do, indeed, mean that only *, ?, and [...] are supported. This means that arbitrary subsets or ranges in alphabetical or numeric order cannot be expressed as a single glob.
Here are some approaches that might work for you:
If the date represented in the file path is also present in the records in the files, then the simplest solution is to read them all and use a Filter transform to select the date range you are interested in.
The approach you tried of many reads in a separates TextIO.Read transforms and flattening them is OK for small sets of files; our tf-idf example does this. You can express arbitrary numerical ranges with a small number of globs so this need not be one read per file (for example the two character range "23 through 67" is 2[3-] plus [3-5][0-9] plus 6[0-7])
If the subset of files is more arbitrary then the number of globs/filenames may exceed the maximum graph size, and the last recommendation is to put the list of files into a PCollection and use a ParDo transform to read each file and emit its contents.
I hope this helps!

Pig custom loadFunc always use 1 mapper and 1 inputSplit

I created a custom loadFunc with a custom InputFormat and RecordReader.
Whenever the InputFormat return more than one input split the PigSplit always contains only one input split and use only one mapper.
The implementation is too big to be posted here, but are there any obvious reasons why this might happen ?
Edit: I'm using pig 0.13 and by adding some logging I found that
the InputFormat created by the Loadfunc returns a list that contains two input splits and then the PigInputFormat uses this list for creating PigSplits.
I still can't find out where did Pig omit one of these input splits and only used the first one.
This is the code from PigInputFormat.java ( src ) line 273
InputFormat inpFormat = loadFunc.getInputFormat();
List<InputSplit> oneInputSplits = inpFormat.getSplits(
HadoopShims.createJobContext(inputSpecificJob.getConfiguration(), jobcontext.getJobID()));
List<InputSplit> oneInputPigSplits = getPigSplits(oneInputSplits, i, inpTargets.get(i), HadoopShims.getDefaultBlockSize(fs, isFsPath? path: fs.getWorkingDirectory()),
combinable, confClone);
splits.addAll(oneInputPigSplits);
I made sure that loadFunc returns 2 input splits, but somehow only one PigSplit is created.
Any clues of how can this be figured out.
Edit 2: So I downloaded the source code for pig 0.13 and compiled it and ran my script and surprisingly it worked fine and used the two splits when I did that, unfortunately I can't do that on the server node.
What I noticed is that the stack trace to create the inputsplits is different between the ready compiled version in cloudera and the downloaded version I compiled.
The cloudera version creates the InputSplits using org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat while the downloaded version uses org.apache.pig.impl.io.ReadToEndLoader
I'm really getting confused about this one.

So after investigating this, turns out that there is a bug in Pig version <= 0.13 that assumes each InputSplit should have a length ( it always assumes that it's reading from a file ) and because in my case the CustomInputSplit.getLength was returning 0, then pig was taking only the first InputSplit and leaving the others.
Workaround is just return any value in the getLength for the input split.
as I mentioned in the question, the behavior of loading the InputSplits changed after that and there is no need for the work around in these cases.

Hadoop: Do Mappers run parallel When we use NLineInputFormat?

If yes, How does HDFS split input file into N lines to read by per mapper ?
I believe It's impossible!
When the splitter needs offset or bytes to split, It can be possible to split without processing whole of input file.
But when the number of '\n' or new line characters is important, before splitting it is necessary to process total input file (to count new line characters).

For NLineInputFormat to work, each split needs to know where the x Nth line starts. As you note in your comment to Tariq's answer, the mapper can't just know where the 3rd line (banana starts), it acquires this informaiton from the Map's InputSplit.
This is actually taken care of in the input format's getSplitsForFile method, which opens each input file up, and discovers the byte offsets where each Nth line starts (and generates an InputSplit to be processed by a Map task).
As you can imagine, this doesn't scale well for large input files (or for huge sets of input files) as the InputFormat needs to open up and read every single file to discover the split boundaries.
I've never used this input format myself, but i imagine its probably best used when you have a lot of CPU intensive work to do for every line in a smallish input file - so rather than 1 mapper doing all the work for a 100 record file, you can partition the load across many mappers (say 10 lines across 10 mappers).

Yes.
It's possible!
Reason :
The mechanism is still the same and works on the raw data. The N in NLineInputFormat represents refers to the number of lines of input that each mapper receives. Number of records, to be precise. Since, NLineInputFormat uses LineRecordReader, each line is one Record. It doesn't change the way splits are created, which is normally based on the size of an HDFS block(remember NLineInputFormat is a member of FileInputFormat family).

How can I search a string in a very big file with a specific format in java? [duplicate]

This question already has an answer here:
Closed 12 years ago.
Possible Duplicate:
do searching in a very big ARPA file in a very short time in java
my file's format:
\data\
ngram 1=19
ngram 2=234
ngram 3=1013
\1-grams:
-1.7132 puluh -3.8008
-1.9782 satu -3.8368
\2-grams:
-1.5403 dalam dua -1.0560
-3.1626 dalam ini 0.0000
\3-grams:
-1.8726 itu dan tiga
-1.9654 itu dan untuk
\end\
As you can see I have a number of lines in ngram 1,2 and 3. There is no need to read the whole file. If an input string is a one-word string, the program can just search in \1-grams: part. If an input string is a two-word string, the program can just search in \2-grams: part and so on. At last if the program finds the input string in the file, it has to return two numbers which are located at the left and right sides of the string. Also, I have to say that each part of the file has been sorted. I am sure that I do not have to read the file completely, and using the index file can not solve my problem. These ways take a lot of time, and my lecturer said that searching has to be done in less than 1 minute for such a big file. I think the best thing is to find a way to jump to a specific line not byte of the file, but I do not know how I can do it. It will be great if someone can help me to solve my problem.
My file is almost 800MB. I have found that using BufferedReader is a good way to read a file very fast, but when I read such a big file and put it in an array line by line, it takes more than 30 minutes.

How big is your file? A minute is a very long time. I would suggest using a BufferedReader for efficiency (and also for its readLine method).
If that really takes too long, two approaches come to mind that don't use indexes:
Force every line in the file to be the same length. Then you can jump to a specific line by calculating its start. If you don't know the line number you need, then at least you can use this to efficiently do a binary search of the entire file.
Jump to an arbitrary position and read forward until you get to a line that starts with a \. That will tell you whether you've found the right part or whether you need to jump forward from there or backward from the arbitrary position that you jumped to. This can also be used to create a binary search strategy for the data you need. It relies on the \ being a reliable indicator of the start of a part.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.